Wednesday, 30 August 2017

Tuning the word cloud: Nuances with Mobile-support/REST-api/Cache

The word cloud web application in the previous post has been updated with the following features.



1) Mobile/Tablet support using Bootstrap css.
2) Configurable ignore lists to reduce noise in the cloud.
3) Caching.
4) REST.

1)  Mobile/Tablet support

Bootstrap makes the app usable on various device screens. Bootstrap intro page here. A page from the application now looks like this.


2) Ignore list via Configuration 

A web page can now be set with a list of words to ignore. There is no point so far in counting the word "like" on a social media page. Now the application can have words associated with lists and those lists associated with a web page. i.e configurable ignore lists. Example.


After applying an ignore list there is visible reduction in noise. (before Vs after).



















3) Caching

This is most challenging in any application. When the data has not changed in the db, time spent on db/disk accesses for the same can be saved thus improving app latency. Memcached is used on 2 hosts and the application is configured to use the same. When a page is requested, data access is first checked against the cache, if it is a miss then only the db is queried. This data from the db is cached right away so that subsequent requests for the same data return fast. Common challenges in caching is described below. It is important to develop a feature without caching and then put the caching logic in. Screens show cache hits/miss on 2 consecutive requests for the same data. The second request returns from cache hits.



4) REST API

Data needs to be provisioned in a way that can be consumed by mobile applications, web apps or any program that wants to talk to your application. REST is good especially for native mobile applications that need the data but, handle display on their own. Django REST makes this manageable with a few quirks of its own. A rest response from the application using Chrome is shown and the same is accessible using curl command too.


  
The rest of the post is about common programming challenges in DRF and caching.

DRF challenge

DRF is good at so many levels. However it is tightly couple with the queryset. So if there is a list of instances rather than a queryset, things become incompatible. For example, customizing a foreign key related field to use cached data before hitting the database. While this may be a remote requirement the problem is well explained here. Curiously, Django1.10 has a new feature that lets queryset api with instances list. This api feature is described here. So if the instances that match are already in the cache they can be used. DRF serializer can be made to work with model instances like



The eager loading idea is from here but has been modified for the specific requirement with instances. After this what remains is overriding the get_queryset on the rest view. Works with Django 1.10. If the api is heavily used and database accesses can be saved, it is worth the effort and work too.


Caching Challenges

 

1) What to cache? 

To avoid access to the database, it is common to cache database entries. Images, static pages and json responses are also cached.

 

2) Cache code

Where to put it? How generalized should it be? Good cache key? In tutorials it is common to see cache access code and if that fails the object is retrieved from db and set to cache. This is leads to a lot of duplication of code. It is better to identify common sql access patterns to the db and cache those. The code that caches the db can be set as a behavior of the database entity or coded as a utility. Example for cache access logic as part of the db entity facade/ORM. This orm approach is described here.


Modify that for reuse: common accesses like the above on primary key can be factored out for re-use in a utility module.

Some access types are difficult to generalize and doing so can make the code base difficult to maintain. For example, fetching an entry in table A that corresponds to a foreign key. While it may be tempting (and feasible) to make a utility for this too, it is better to leave this alone until replication makes the use-cases clear and creates a demand for factoring this out into a utility.
  
What works in terms of generalizing the cache code depends on how many requirements come up. After a number of use cases, a general pattern should emerge in any application, like the primary key and foreign key access above.

 

3) Cache Key? 

A good cache key is one which describes the cached content. Also it should not be so long that the time to compute the hash beats the whole purpose.

 

4) Cache invalidation? 

If the data that is cached has changed then the cache needs to be updated. If the data is updated at many points in the application, it may become difficult to keep track of this. The example shows a signal handler that updates the cache on a db update/save. This approach works well.

 

5) Cache loading

Finally, when the application starts up, it is good to have the cache initiated with some data. What qualifies to be loaded to cache on startup? A fixed number of entries/tables with fixed row count or most frequented data points make good candidates. The application data, UI and usage need to be analyzed to determine the same. For example, data that corresponds to drop down lists with fixed number of entries. Say the names of states in a country shown in a drop down list.

Saturday, 12 August 2017

Word clouds: Text analysis for interesting patterns

This post utilises text analysis to identify trends in popular web pages. Of interest are home pages of news sites, Twitter pages of popular personalities and other sites where content changes frequently. Analysing words and projecting to a cloud provides a visual representation of text data which in turn helps to quickly perceive overall changes in trends and interests. Here the cloud is generated on content from Saturday and Sunday.




About the application:

The web application generates a word cloud for each submitted link at a given time. The application first generates a word count for each page using natural language tool kit and stores the results in a database. The counts are converted to a word cloud using opensource javascript libraries. Web links can be added to the application as needed. The word count task is performed asynchronously using celery worker nodes. 

Software built using:

Django1.10/Python3.5, Asynchronous tasks using Celery 4.1.0, Javascript, Natural Language Processing Tool Kit (NLTK), Postgres Database. 3 hosts. One for web application, one for celery workers and Database and third for the message broker. 

Home pages analysed:

RT.com
BBC.com
Google News
Fox News
Indian Prime Minister Twitter Page
Obama Twitter page
President Trump Twitter Page

Celery Django Architecture

>> Word cloud on Saturday to left and Sunday to right

RT.com

BBC.com

Fox News

Google News

Obama


 President Trump and  the Indian Prime Minister Twitter Handles