Wednesday 28 November 2018

*Updates with Runtimes* Web Search Engine with Word Counts

Here we look at implementation of a web search engine. The project already has the data on word counts for web pages. Pages added to the project have been crawled and content word counts are stored periodically. This was primarily for generating word clouds and text content analysis. However the word counts can also be used build a search index for the set of web pages. Given a bunch of words the search index can give back the list of pages within which the words occur. In addition to that, the word counts are attached with the timestamp at which the page was processed. This helps to find more recent occurrences.

Quick overview of steps involved: 

A) Filter word counts within a time period. Past 4 (or N) days.
B) Build a trie data structure with the data. 
C) Compress the trie so that it can be held in memory. 
D) For a given search string made up of multiple words, find the set of web pages where the words occur. The compressed trie helps with this. Time complexity is described below. 
E) Find the intersection of the sets of web pages. 
F) Extract required information and send back results. This information includes as in other search engines the full url of the page, time of crawling (word count generation) and a title. 
G) Cache information as necessary to speed up the web view.

Runtimes with cProfile are as follows:

1) Building the trie takes 3.583 seconds for 173693 Words. Pickled Size is 119.2MB

2) Compressing the trie takes 2.38 seconds and Pickles size is 6.2MB

3) Searching including fetching resulting web pages ~4ms

4) Searching all 10 strings above including fetching results ~116ms

Some screenshots of the engine at work are shown

Apart from the trie index the rest of the data is already part of the project database. However the trie is not part of the database. It is generated when required, compressed and held in memory.

Quick overview of tries: At the core of the index is a data structure called Trie (compressed). A trie is an m-ary tree where each node branches out based on the character encountered in a key. The interesting thing about tries is this. For a set of K unique characters a node has K+1 pointers. Based on the keys that are inserted into a trie the number of nodes can change. For a given trie, if S is node count, key count is N and L is the length of longest key then the search for any key is in O(L) independent of K and N. Storage requirement is (K+1) x S x P bits independent of N, the number of keys in the trie. P is number of bits in a pointer.

Compressing the trie: Once the trie has been constructed it can be compressed. Multiple techniques such as Patricia Tries and de la Briandais trees can be used. However, here the project uses a different technique. Any trie with N nodes and a K character set can be represented by an M x K table. The table can be shrinked further using a sparse matrix. Here we see the difference in serialised size of the trie index.

For 79 web pages in the project, with in a week there can be minimum 2 crawls so ~ 160 word count data rows for the web pages. Sizes for objects were also monitored using pympler trackers for Python 3.

Uncompressed Trie size  25628388 bytes ~ 25MB
Compressed Trie size 21586721bytes ~ 21 MB
Compressed Trie with minimum selected data in leaf nodes 7281173 bytes ~ 7.2 MB

This 7.2 MB trie index can be held in memory or cached. 

The search results are in decreasing order of timestamps. 

The architecture of the crawler project was discussed previously here.  Crawling and word counts are executed in celery async tasks. This architecture is shown below.

Future work: 

1) Currently a set intersection of the words is used. More options like OR and NOT can be supported using expression trees. 

2) Storing the pages themselves in the filesystem for reference would be great. But this is not feasible at the present disk allowance.

3) Since the word counts are time stamped,  a date time search window option can be given to users. Holding the index over an increased period of time raises the size of the index too.

4) It would be feasible to rank the pages on a complex parameter than just time stamps. Relevant visits and count can be used along with timestamps.

5) Word edit distance can be used to correct words as in popular search engines.

Tuesday 6 November 2018

Django custom caching library v2

In a previous post we looked at a very early version of a caching library used in my Django project. This has been enhanced to include new features as requirements came up. Although this library is based on practical requirements that showed up, the two primary api are documented well. This is so that the user is aware of what the library can handle well and avoid performance degradation. Coding up this library has been primarily to help with keeping caching code DRY. Compared to the previous version there are no changes at the models. There are three additions.

i) Prefetched relation support

Django documentation on Prefetch is available here.

In Django it is a common practice to prefetch related relations while querying a model. While this is a good idea, this can really degrade performance by increasing the number of sql queries by O(N) where N is the number of prefetched rows. To address prefetching, both apis will accept a tuple of Prefetch objects. Not the prefetch related names. The reason is as follows. Prefetch objects allow more control on what is prefetched. This helps with performance especially using the .only(*fields) api from queryset as shown below.

In the code we want to get a web page and prefetch its related page word counts. We control what columns are needed from the prefetched relation, PageWordCount, using a queryset. Then we pass the Prefetch to the api. This is important for caching as too much prefetched data will result in memory consumption at database and web server but also cause Django to silently fail when the data is set to memcached. Memcached has a configurable 1MB object size limit. Notice the foreign key reference to web page in the only fields.  

In order to understand the loop hole which will cause sql to be fired, we need to understand how Django handles prefetch. On the primary relation Django brings in the web pages and uses an IN SQL query to bring in the PageWordCounts. Now it does the join in Python i.e it tries to find the PageWordCounts that belong to each WebPage. For that you need the foreign key field. If you did not mention it in the only(*fields) Django will send out an sql query for exactly that, for each prefetched row. 

Prefetch support in the other api is shown below. Here we are pre-loading the cache with a list of all WebPages. This is a better example of where forgetting the above point will cost a lot.

The api signatures are shown below. First one allows fetching rows based on fields. Cache entry is set based on the specified fields. The second fetches all rows.

ii) select_related

Django doc on this is here.

This is a simple forwarding of required fields. Similar to prefetch but for one-to-one and foreign keys relations.

iii) Chunked bulk updates to memcached

Once all the rows are fetched using all_ins_from_cache api, we will have a list of instances. This list can be huge. The api loops through the list and sets the individual cache entries using set_many. However, set_many was silently failing with 100-120 entries. Possibly due to large amount of data being passed over a single call. To avoid this, the instances list is broken into manageable chunks and each chunk is passed to set_many. Chunk size can be configured.

The resulting library is more usable in the Django project data set. Cache set/get code is more sophisticated and helps to keep code DRY.