Sunday 23 December 2018

Throughput | Reaching 250 requests per second

In this post we look at how the throughput of my data visualization web application was improved to 267 requests per second in this iteration. This is a significant improvement with additional caching and application architecture modifications. Note that this new throughput is for the web application with a lot more functionalities like language support, machine learning models etc.

New throughput on load test: 267 request per second

Background

Previous JMeter test results with load balanced deployment and model caching is here. The same load test is used again to see the improvement.


The load balanced architecture is described here.

JMeter test results with new throughput
 
JMeter Response time Graph is shown below.



JMeter Graph results is shown below.



Techniques so far and modifications applied

The previous caching technique focused on avoiding database hits. 

That employed a Django model based custom caching library. In addition to that, it also marked static files like js and images with down stream cache-control so that the browser does not download them each time. 

That far that is good. However

1) With the application there are a couple of scopes from improvement especially since a set of new features have been added among which language support is prominent. The Django application builds HTML templates to provision to clients. This includes all the html templates like navigation bar, user profile templates and page footer templates. Some template contents have to change based on say time, user, user language etc. However most templates once they have been generated based on one or more of the above can remain same and be reused. That's where template fragment caching comes in. A few things that can trip if not understood are
  • The gain from a single template fragment being cached is tiny. The return on investing in caching templates will only show as the number of concurrent requests on the application go up.
  • Also, locality of cache matters for template fragments. The savings to be made on time is small on each request. So even having to go to a cache on a different host will cost more than just building the template!. This mandates a local cache and is a modification to the architecture.
2) Each view generates a response based on the request. For most requests the response is the same. For example, the response to a request for getting 'Word counts for www.cnn.com at 3 PM on 25 Dec 2018' is going to be the same. Such views need to be identified and cached. This helps with improving throughput.

3) Finally, one physical aspect of your deployment that can affect performance is thermal throttling of CPUs. It is a good idea to check this too.

Tuesday 11 December 2018

Lang support Dutch, Swedish... for WebApp | HUD

Supporting multiple languages for a Django web application is straight forward. 1) Add the LocaleMiddleWare to the list of middlewares. 2) Supply languages files with translations in PO files. 3) The PO files are compiled into MO files. When a request is encountered django checks into the following places for the required lannguage; the url for a prefix, the session, a language cookie and Accept-Language header in that order. This order can be seen in the locale middleware code at django > middleware > locale. Once the required language is identified, it is activated.

Within the application a user can have a preferred language. This is implemented the same way as preferred timezone. Users can go into their preferences page and choose from a list of supported languages. Currently English, Dutch, Swedish, Norwegian and French languages are supported. Two users with Dutch and Swedish language preferences are shown below.


When a user accesses the application for the first time the login page is shown in a language as on the Accept-Language header from the browser. Here Chrome was set to French (browser settings shown in the end) and Opera in default language settings. Login based on request headers (for users above) is shown below.


The following screens show the two users accessing the application after logging in. For both the browsers the first language was what the browser mentioned in the request header. Once the user logs in the user's preferred language is activated. This is shown below where the preferred language of user on left is Dutch and Swedish on the right.


Chrome language settings indicating French selection is as follows.



References

Django docs on translation

https://docs.djangoproject.com/en/2.1/topics/i18n/translation/

Wednesday 28 November 2018

*Updates with Runtimes* Web Search Engine with Word Counts

Here we look at implementation of a web search engine. The project already has the data on word counts for web pages. Pages added to the project have been crawled and content word counts are stored periodically. This was primarily for generating word clouds and text content analysis. However the word counts can also be used build a search index for the set of web pages. Given a bunch of words the search index can give back the list of pages within which the words occur. In addition to that, the word counts are attached with the timestamp at which the page was processed. This helps to find more recent occurrences.

Quick overview of steps involved: 

A) Filter word counts within a time period. Past 4 (or N) days.
B) Build a trie data structure with the data. 
C) Compress the trie so that it can be held in memory. 
D) For a given search string made up of multiple words, find the set of web pages where the words occur. The compressed trie helps with this. Time complexity is described below. 
E) Find the intersection of the sets of web pages. 
F) Extract required information and send back results. This information includes as in other search engines the full url of the page, time of crawling (word count generation) and a title. 
G) Cache information as necessary to speed up the web view.

Runtimes with cProfile are as follows:

1) Building the trie takes 3.583 seconds for 173693 Words. Pickled Size is 119.2MB


2) Compressing the trie takes 2.38 seconds and Pickles size is 6.2MB

3) Searching including fetching resulting web pages ~4ms



4) Searching all 10 strings above including fetching results ~116ms

 
Some screenshots of the engine at work are shown




Apart from the trie index the rest of the data is already part of the project database. However the trie is not part of the database. It is generated when required, compressed and held in memory.

Quick overview of tries: At the core of the index is a data structure called Trie (compressed). A trie is an m-ary tree where each node branches out based on the character encountered in a key. The interesting thing about tries is this. For a set of K unique characters a node has K+1 pointers. Based on the keys that are inserted into a trie the number of nodes can change. For a given trie, if S is node count, key count is N and L is the length of longest key then the search for any key is in O(L) independent of K and N. Storage requirement is (K+1) x S x P bits independent of N, the number of keys in the trie. P is number of bits in a pointer.

Compressing the trie: Once the trie has been constructed it can be compressed. Multiple techniques such as Patricia Tries and de la Briandais trees can be used. However, here the project uses a different technique. Any trie with N nodes and a K character set can be represented by an M x K table. The table can be shrinked further using a sparse matrix. Here we see the difference in serialised size of the trie index.

For 79 web pages in the project, with in a week there can be minimum 2 crawls so ~ 160 word count data rows for the web pages. Sizes for objects were also monitored using pympler trackers for Python 3.

Uncompressed Trie size  25628388 bytes ~ 25MB
Compressed Trie size 21586721bytes ~ 21 MB
Compressed Trie with minimum selected data in leaf nodes 7281173 bytes ~ 7.2 MB

This 7.2 MB trie index can be held in memory or cached. 

The search results are in decreasing order of timestamps. 

The architecture of the crawler project was discussed previously here.  Crawling and word counts are executed in celery async tasks. This architecture is shown below.


Future work: 

1) Currently a set intersection of the words is used. More options like OR and NOT can be supported using expression trees. 

2) Storing the pages themselves in the filesystem for reference would be great. But this is not feasible at the present disk allowance.

3) Since the word counts are time stamped,  a date time search window option can be given to users. Holding the index over an increased period of time raises the size of the index too.

4) It would be feasible to rank the pages on a complex parameter than just time stamps. Relevant visits and count can be used along with timestamps.

5) Word edit distance can be used to correct words as in popular search engines.

Tuesday 6 November 2018

Django custom caching library v2

In a previous post we looked at a very early version of a caching library used in my Django project. This has been enhanced to include new features as requirements came up. Although this library is based on practical requirements that showed up, the two primary api are documented well. This is so that the user is aware of what the library can handle well and avoid performance degradation. Coding up this library has been primarily to help with keeping caching code DRY. Compared to the previous version there are no changes at the models. There are three additions.

i) Prefetched relation support

Django documentation on Prefetch is available here.

In Django it is a common practice to prefetch related relations while querying a model. While this is a good idea, this can really degrade performance by increasing the number of sql queries by O(N) where N is the number of prefetched rows. To address prefetching, both apis will accept a tuple of Prefetch objects. Not the prefetch related names. The reason is as follows. Prefetch objects allow more control on what is prefetched. This helps with performance especially using the .only(*fields) api from queryset as shown below.


In the code we want to get a web page and prefetch its related page word counts. We control what columns are needed from the prefetched relation, PageWordCount, using a queryset. Then we pass the Prefetch to the api. This is important for caching as too much prefetched data will result in memory consumption at database and web server but also cause Django to silently fail when the data is set to memcached. Memcached has a configurable 1MB object size limit. Notice the foreign key reference to web page in the only fields.  

In order to understand the loop hole which will cause sql to be fired, we need to understand how Django handles prefetch. On the primary relation Django brings in the web pages and uses an IN SQL query to bring in the PageWordCounts. Now it does the join in Python i.e it tries to find the PageWordCounts that belong to each WebPage. For that you need the foreign key field. If you did not mention it in the only(*fields) Django will send out an sql query for exactly that, for each prefetched row. 

Prefetch support in the other api is shown below. Here we are pre-loading the cache with a list of all WebPages. This is a better example of where forgetting the above point will cost a lot.


The api signatures are shown below. First one allows fetching rows based on fields. Cache entry is set based on the specified fields. The second fetches all rows.




ii) select_related

Django doc on this is here.

This is a simple forwarding of required fields. Similar to prefetch but for one-to-one and foreign keys relations.

iii) Chunked bulk updates to memcached

Once all the rows are fetched using all_ins_from_cache api, we will have a list of instances. This list can be huge. The api loops through the list and sets the individual cache entries using set_many. However, set_many was silently failing with 100-120 entries. Possibly due to large amount of data being passed over a single call. To avoid this, the instances list is broken into manageable chunks and each chunk is passed to set_many. Chunk size can be configured.



The resulting library is more usable in the Django project data set. Cache set/get code is more sophisticated and helps to keep code DRY.


Wednesday 17 October 2018

HUD | Enable/Disable Django Apps


This post describes how Django apps in a project can be enabled/disabled via settings. The requirement is to be able to enable/disable Djanga apps with flags. If the app is enabled in the project then it will be loaded, its urls and templates will be available to users. On the other hand if an app is disabled then its templates and urls are not available to the user.

For example in the screenshot below the word cloud app is enabled on the left. The app is available on the navbar and homepage . The deployment to the right does not have the word cloud app enabled. Notice that the templates have adjusted themselves based on configuration.



To prevent Django from loading an app is easy. Just do not add it to the list of INSTALLED_APPS in Django settings. However the root url conf, ROOT_URLCONF will need to be changed accordingly. Again, url references in templates cannot be enabled/disabled using just the INSTALLED_APPS. Editing templates and root url confs to tailor them just for a specific deployment is not recommended. This creates additional effort as that particular deployment will need to be tracked and maintained separately.

A better implementation is to specify whether an app is enabled/not and the project will load appropriately. For each app we need its

a) Switch name: This tells the rest of the project whether the app is enabled or disabled. i.e a flag like wordcloud_enabled to check against. This is particularly useful for templates. 

b) Url regex: This is the base url pattern for the app. For example all urls for articles app will have the "articles/" base pattern as prefix. And the url for posting articles can be https://www.server.com/article/post

c) Urls module: The Python module that holds the app's url patterns. In the above example the url patterns for posting, editing and deleting articles etc are specified in this module.

These 3 details are easily configurable in named tuples as shown.

1) Controlling apps in templates: 

Users should not be able to view or use urls of a disabled app. In order to achieve this templates should know whether an app is disabled. This is made possible using a list of Application tuples and a template context processor. The switches tell the template whether an app is enabled or not. The context that contains the app's state is generated from the Application tuples and made available by a context processor. 



In this solution, so long as the templates are as modular as possible, i.e utilising template hierarchy to separate template fragments, we can enable/disable parts of the user interface. The solution becomes as simple as the following check in a template for the navbar for the ml machine learning application.


Notice that the template uses the flag 'ml_enabled' to check ml app is enabled. Each application needs a flag that describes its state. This flag/label is also configurable.

2) Now comes the root url confs. 

For each enabled application we need the url regex and the urls module for the application. These are added to urlpatterns in root url conf. This is shown below.



Thursday 11 October 2018

Deploying application updates | Quick automation using shell scripts

This post describes how the visualisation and machine learning application is currently updated to servers. See the application demo on YouTube using the playlist.

There are 2 Ubuntu server host machines in the load balanced pool. Python 3.5.3 and Virtualenv are compiled and are available on the machines. System Python is not used. The web application updates are made available on a particular branch of the Bitbucket git repository. The repository is configured with read-only keys for release use. This key is different from the developer keys. On the Ubuntu hosts the read-only keys are configured such that the keys get added to ssh-agent on agent startup. Only the branch with tested final release is cloned. 

There is a root folder within which each release is cloned to a specific date-timed folder. So there will be multiple folders each representing a particular update or release. A symbolic link 'current' points to the latest release running in production. The modwsgi-express server config points to this symbolic link. This avoids having to modify the modwsgi_express configuration on every application feature update. Finally this makes rollback as easy as pointing back to the previous release folder.

All of this is done using a simple shell script. The root folder, branch to clone, number of processes, thread counts etc are configurable. Figure shows the file with list of server ips and update scripts.


This shell script logs into the deployment servers using ssh (another key for admins). 


Then it performs the tasks mentioned above. This deployment also ensures that each release runs on its own python virtual environment. After the cloning, the virtualenv's  requirements.txt is available and is used to create a new environment in the release folder. If needed each of the date-timed folder mentioned above can be put back in production. The script accesses the servers listed in a configuration file. Each machine is updated sequentially. The figure shows the script logging into the first server in list.


The ssh-agent is also used only for the duration of the update. The agent is started at the beginning of the script and stopped at the end. Finally, the modwsgi_express server is restarted in this script. This is not necessary and will be removed in the next iteration. The figure shows the script finishing up on the first server and moving onto the next server in list. 


Notes:

This script based approach configures everything from the virtualenv upwards. However the infrastructure can also be configured and kept in code using tools like Puppet, chef and the like. Here the focus is on the application side and is kept simple using shell scripting.

Monday 6 August 2018

Machine Learning | Cancer prediction

In a previous post a number of machine learning models were trained on a dataset. Details of the data set are here. The relative importance of each of the 32 dimensions was also deduced from a multi-dimensional visualisation. This post builds on the results of that to finally use the models for prognosis.

video for post content

The models that were trained included a Gradient boosting classifier, Random forest, an SVC and a K-neighbour classifier. In addition to those models, another Gradient boosting classifier was trained with parameter tuning. For the algorithm a range of values for learning rate, max_features and max depth were specified in a parameter grid. A model selection and evaluation tool like GridSearchCV and a scoring parameter is used to choose a model with best quality.

Using the generated models:

After this model training and generation, we need to use the models. Also when there are a number of algorithms, samples and models, it would be great to select a model and apply an out of bag data/sample through it. 

To do that with this cancer data set, we serialise the models, samples and provision them over an application. The models and their details are stored in the database with a link to their respective files. The same is done for samples. Since each model utilises a particular algorithm, a list of algorithms is also maintained. So algorithms, models and samples can be added to the application. From there a user interface is presented to choose a sample and the model to apply.

This enables users in, say a medical institution, to add new models/samples as they are generated and conduct a prognosis. Models can also be revisited and looked up as they are updated.

Screens of conducting a prognosis on stored models and samples are shown below. 

1) Selecting a sample


2) Selecting a model

3) Result



Another prognosis with K-neighbour classifier on a different sample.

Details of a particular model can also be viewed.





Sunday 15 July 2018

Porting/Upgrading a Python 2 library to Python 3

In this post we look at porting the coils library from Python 2.7 to Python 3.5.3 and having it on both versions.  Coils is a data structure library that was coded in 2015. It has implementations of basic data structures like hash tables, binary search trees and splay trees, list and heaps. Coils code is available here. The steps described here are based on official documentation on porting here. It would be necessary to go through it before attempting to port your project.

The ported code can be installed from https://pypi.org/project/pycoils/

Lets see the steps for porting coils library:

1) Create 2 virtual environments, one for Python 2.7.15 and Python 3.5.3

2) pip install future on both environments.

3) Ensure that all tests pass in the Python 2.7.15 environment to begin porting. 

You need tests and good coverage in your project. One way to be sure that the code would work as expected after porting to Python 3 and also on Python 2 is by ensuring that all tests pass on both versions. Here we see a screen of 255 tests for the project on both.




coils also had good test coverage as shown below.


If you do not have any tests on your project, there is no way to identify regressions introduced by porting. This is because porting involves not only using tools to refactor but also, manual changes to pass tests. 

If your coverage is low, write or modify tests to increase it.

4) Run futurize --stage1 -w **/*.py

(When using globstar ** make sure that it is enabled)

This will modernize the code without introducing any dependencies. You will see a lot of output regarding the code that was refactored. This is shown below.


Once the run is complete without errors, you can use a diff tool to see what code was changed to support python 3 in stage 1. In this example we see python 3 style in operator brought in to replace has_key.


5) Commit the changes. Rerun tests on Python 2 environment and manually fix any errors to ensure tests pass.

6) Run  futurize --stage2 -w **/*.py

The aim is to ensure that code works on both Python 3 and then back on Python 2 with future. As before the command will output the refactoring done. 


The changes can be checked with a diff tool too. Here we see the support for the division method on Python 3 and Python 2 with future package.



7) Run the tests again on Python 3. Note that test cases may also have been ported to Python 3 and will need changes to run on both Python 3 and Python 2. 

Here we see some manual fixes to the test cases. This was to a) Drop deprecated key-word arguments and b) Drop deprecated TestCase class method aliases.





8) Once the tests pass on both Python 2 and Python 3, then you have a stable version that can be used in both environments. 

Coils library has an examples package which demonstrates how the library's data structures can be used in code. For example, creating a min heap, adding data and performing heap operations. 

This is also used to ensure that things work as expected on both.



There are some post conversion steps (depends on your code). After that as tests and examples run on both versions, we have Coils for Python3/Python2.  

Thursday 28 June 2018

Load testing and insights

In this post the visualisation/ml web application is load tested with JMeter. More about the application here. The setup is described in the following video



The application architecture is here

The first run performs a load test on a single deployment cell i.e only one application server in the pool. The rest of the architecture remains same. In the second run 2 deployment cells are load balanced and JMeter is pointed at the load balancer. 

In both runs the application is tested with load slightly more than expected load. In addition to monitoring and understanding server resources and the application stack under load, we can also notice expected and unexpected application behaviour. As we will see in this post, monitoring the application logs during such tests can also help with identifying and addressing some blind spots.  

Topology

#Load balancer: 1 Nginx deployment
#Web app servers: 2 vms
#Database hosts: 2 vms
#Media and assets hosts: 1 vm
#Memcached hosts: 2 hosts
#Celery hosts: 3 (2 shared with memcached hosts)

JMeter test details

#Users: 100
#Ramp up: 10
#Loop: 4

Listeners

Graph results: This give an overall throughput number.
Response time Graph: For each sample (app in this setup), a response time is plotted. Helps to compare apps. Say, for example the Equake app and the WordCloud app in project.

Each of the application pages are accessed twice as many times as the number of pages in each loop.



In the Django project, the Equake app is heavier than others. It accesses Leaflet for map tiles, remote external REST APIs and caches results. It is expected that the app will have a few requests (the ones that trigger cache misses) that will be slow. A timeout for accessing external APIs is set to 4 seconds for the Equake application. Load test will reveal the impact Equake app exerts on server resources and other apps like WordCloud, tax statistics and Dota2.

Results

Without load balancing

Test time: 30 mins
Throughput: 922.498 per minute
Response time graph is shown below



As expected the Equake app home page for the live earth quake view (yellow) has spikes. Other Equake app pages like monthly, weekly and daily earth quake views timeout at 4 seconds. The rest of the apps have high response times along with equake but not as much.

With load balancing

Test time: 5 min 22 seconds
Throughput: 1707.32 per minute
Response time graph is shown below



All apps except Equake have gone down to response times less than or between 200-300 milliseconds. The Equake pages that access external apis timeout on 4 seconds. The home page (live earth quake view) for equake app has a maximum response time of ~ 17 seconds compared to 1.5 minutes without load balancing.

Insights

1) Throughput increases close to a factor of 2.  

2) All apps except the Equake app behave consistently as expected under load. They start off with a response time of close to 1 second and quickly drop down to <= 200 milliseconds. Caching also works predictably as seen in the application logs. 

3) So, from the load test what is happening with the Equake app?

The remote external REST API has a high response time. But, this should go down like the other apps as caching is enabled. Application logs reveal that there are cache misses for the fetched external REST API data. Data was being fetched. But monthly and weekly earthquake GeoJson data are too large. Around 5-7 MB when saved in file. The default entry size in Memcached is 1MB. So it was being discarded just like that. This causes the entire data to be fetched again. Memcached entry sizes can be increased with the -I flag. However, this means that ~ 7-8 MB of data will be fetched from the cache each time. For this type of data the size will also differ. 5 MB for this week's Equake data and 8 MB for next week. It would be better to use dedicated cache with increased item size limits. Memcached pools with different item sizes can be setup. Regular apps can use the default pool and apps with greater entry size requirements can utilise the other pool. Another approach would be to host the Equake app separately on dedicated app servers. Yet another approach would be to chunk the data and store it in the regular Memcached pool. 




Sunday 10 June 2018

Signup form with Capthca and User profile/preferences page



Two new features have been added to the visualization and machine learning application.

1) Sign up form for new users:

This form asks for basic information from new users to create their account. This form is also protected by the Captchas app developed in the previous iteration. More about that here

2) A User profile/preferences page:

The application has been updated with a user preferences page. This page allows the user to change their settings within the application. At present users can change their
  • Preferred timezone: This is the timezone in which the user wants to view data in.
  • Profile picture: An avatar shown on the navigation bar and in profile page.

Tuesday 24 April 2018

Implementing Captchas


"Completely Automated Public Turing Tests to Tell Computers And Humans Apart (CAPTCHAs) are challenge-response tests used to determine whether or not the user is a human" -- Wikipedia

These are used to ensure that a human is indeed in front of the screen making a web request. Sensitive parts of application are protected with captchas. A form that adds data into a database may need to be protected from accepting automated postings. Without protection, a stream of automated posts can not only just swamp the application but can also fill the database/disk with valid junk. For example, gibberish in a feedback form. 

Contemporary captchas are usually one of the following

1) One image or a set of images with a challenge question
An image will have text in it. The user has to type the text into a box. Another type is a set of images and the user has to pick a subset of the images. Say you are shown 4 images and there are animals in only 2 of them. You have to select the correct ones with the animals.

2) A click request on a box
3) An audio playback 
4) Even a short video playback can be used

The last two have storage and bandwidth impact on a web application.

Requirements 

A pluggable captcha that can be used in any Django web application. A mechanism to add and configure captchas with challenges. Once a captcha is added, the system must pick it from there. Captchas have to be one level up in difficulty. i.e something more than just 'enter the text', although these can also be used. 

Implementation

A single image captcha with additional semantic requirements is implemented. A reusable Django app 'captchas' holds the model, form etc to select and process captchas. The template can be included in any HTML form. The default display is as a Bootstrap4 card. How and where this card renders on a form is up to the page designer. Django views just need to send a form in responses for GET and process the submitted form in  a POST. The validation of the captcha is isolated in its form.

The add web page functionality in HUD application is protected with these captchas. This implementation can not only ask for just the text but can also ask for anything based on the image. This iteration includes captchas with challenges like 

- Type the text
- Type only the numbers seen in image
- Type only the letters
- Type the last four letters in reverse

Examples are


Or anything that can be inferred from the image i.e the challenge is configurable as shown. In this iteration basic colored images were used. Using strike throughs, blurs and other effects and on the images can further confuse models. It is also important to change the size of the image as it will slightly increase processing cost.


Advantages

1) There is a one-to-many relation between the images and challenges. With many images and challenges this approach can mitigate the effect of a sweatshop. A captcha image will show up with a different challenge thus mitigating image signature based attacks. If an attacker is getting past the security then it has to be on expensive discipline.

2) There are online free captcha services that can be easily integrated  to sites. However, these tend to have one or another pattern. The popular services may have already been subjected to continuous automated machine learning to created models. Such models are posed with a custom unfamiliar challenge thus making it difficult.

3) Ability to change the challenge over time allows for reuse. This is because it is the challenge that can hold a semantic requirement on a static image.

4) Even if the captcha images are harvested from the application, the challenge remains unknown. The challenge on a harvested image can be changed to a more complicated question.