Tuesday 6 November 2018

Django custom caching library v2

In a previous post we looked at a very early version of a caching library used in my Django project. This has been enhanced to include new features as requirements came up. Although this library is based on practical requirements that showed up, the two primary api are documented well. This is so that the user is aware of what the library can handle well and avoid performance degradation. Coding up this library has been primarily to help with keeping caching code DRY. Compared to the previous version there are no changes at the models. There are three additions.

i) Prefetched relation support

Django documentation on Prefetch is available here.

In Django it is a common practice to prefetch related relations while querying a model. While this is a good idea, this can really degrade performance by increasing the number of sql queries by O(N) where N is the number of prefetched rows. To address prefetching, both apis will accept a tuple of Prefetch objects. Not the prefetch related names. The reason is as follows. Prefetch objects allow more control on what is prefetched. This helps with performance especially using the .only(*fields) api from queryset as shown below.

In the code we want to get a web page and prefetch its related page word counts. We control what columns are needed from the prefetched relation, PageWordCount, using a queryset. Then we pass the Prefetch to the api. This is important for caching as too much prefetched data will result in memory consumption at database and web server but also cause Django to silently fail when the data is set to memcached. Memcached has a configurable 1MB object size limit. Notice the foreign key reference to web page in the only fields.  

In order to understand the loop hole which will cause sql to be fired, we need to understand how Django handles prefetch. On the primary relation Django brings in the web pages and uses an IN SQL query to bring in the PageWordCounts. Now it does the join in Python i.e it tries to find the PageWordCounts that belong to each WebPage. For that you need the foreign key field. If you did not mention it in the only(*fields) Django will send out an sql query for exactly that, for each prefetched row. 

Prefetch support in the other api is shown below. Here we are pre-loading the cache with a list of all WebPages. This is a better example of where forgetting the above point will cost a lot.

The api signatures are shown below. First one allows fetching rows based on fields. Cache entry is set based on the specified fields. The second fetches all rows.

ii) select_related

Django doc on this is here.

This is a simple forwarding of required fields. Similar to prefetch but for one-to-one and foreign keys relations.

iii) Chunked bulk updates to memcached

Once all the rows are fetched using all_ins_from_cache api, we will have a list of instances. This list can be huge. The api loops through the list and sets the individual cache entries using set_many. However, set_many was silently failing with 100-120 entries. Possibly due to large amount of data being passed over a single call. To avoid this, the instances list is broken into manageable chunks and each chunk is passed to set_many. Chunk size can be configured.

The resulting library is more usable in the Django project data set. Cache set/get code is more sophisticated and helps to keep code DRY.