Sunday, 15 July 2018

Porting/Upgrading a Python 2 library to Python 3

In this post we look at porting the coils library from Python 2.7 to Python 3.5.3 and having it on both versions.  Coils is a data structure library that was coded in 2015. It has implementations of basic data structures like hash tables, binary search trees and splay trees, list and heaps. Coils code is available here. The steps described here are based on official documentation on porting here. It would be necessary to go through it before attempting to port your project.

The ported code can be installed from https://pypi.org/project/pycoils/

Lets see the steps for porting coils library:

1) Create 2 virtual environments, one for Python 2.7.15 and Python 3.5.3

2) pip install future on both environments.

3) Ensure that all tests pass in the Python 2.7.15 environment to begin porting. 

You need tests and good coverage in your project. One way to be sure that the code would work as expected after porting to Python 3 and also on Python 2 is by ensuring that all tests pass on both versions. Here we see a screen of 255 tests for the project on both.




coils also had good test coverage as shown below.


If you do not have any tests on your project, there is no way to identify regressions introduced by porting. This is because porting involves not only using tools to refactor but also, manual changes to pass tests. 

If your coverage is low, write or modify tests to increase it.

4) Run futurize --stage1 -w **/*.py

(When using globstar ** make sure that it is enabled)

This will modernize the code without introducing any dependencies. You will see a lot of output regarding the code that was refactored. This is shown below.


Once the run is complete without errors, you can use a diff tool to see what code was changed to support python 3 in stage 1. In this example we see python 3 style in operator brought in to replace has_key.


5) Commit the changes. Rerun tests on Python 2 environment and manually fix any errors to ensure tests pass.

6) Run  futurize --stage2 -w **/*.py

The aim is to ensure that code works on both Python 3 and then back on Python 2 with future. As before the command will output the refactoring done. 


The changes can be checked with a diff tool too. Here we see the support for the division method on Python 3 and Python 2 with future package.



7) Run the tests again on Python 3. Note that test cases may also have been ported to Python 3 and will need changes to run on both Python 3 and Python 2. 

Here we see some manual fixes to the test cases. This was to a) Drop deprecated key-word arguments and b) Drop deprecated TestCase class method aliases.





8) Once the tests pass on both Python 2 and Python 3, then you have a stable version that can be used in both environments. 

Coils library has an examples package which demonstrates how the library's data structures can be used in code. For example, creating a min heap, adding data and performing heap operations. 

This is also used to ensure that things work as expected on both.



There are some post conversion steps (depends on your code). After that as tests and examples run on both versions, we have Coils for Python3/Python2.  

Thursday, 28 June 2018

Load testing and insights

In this post the visualisation/ml web application is load tested with JMeter. More about the application here. The setup is described in the following video



The application architecture is here

The first run performs a load test on a single deployment cell i.e only one application server in the pool. The rest of the architecture remains same. In the second run 2 deployment cells are load balanced and JMeter is pointed at the load balancer. 

In both runs the application is tested with load slightly more than expected load. In addition to monitoring and understanding server resources and the application stack under load, we can also notice expected and unexpected application behaviour. As we will see in this post, monitoring the application logs during such tests can also help with identifying and addressing some blind spots.  

Topology

#Load balancer: 1 Nginx deployment
#Web app servers: 2 vms
#Database hosts: 2 vms
#Media and assets hosts: 1 vm
#Memcached hosts: 2 hosts
#Celery hosts: 3 (2 shared with memcached hosts)

JMeter test details

#Users: 100
#Ramp up: 10
#Loop: 4

Listeners

Graph results: This give an overall throughput number.
Response time Graph: For each sample (app in this setup), a response time is plotted. Helps to compare apps. Say, for example the Equake app and the WordCloud app in project.

Each of the application pages are accessed twice as many times as the number of pages in each loop.



In the Django project, the Equake app is heavier than others. It accesses Leaflet for map tiles, remote external REST APIs and caches results. It is expected that the app will have a few requests (the ones that trigger cache misses) that will be slow. A timeout for accessing external APIs is set to 4 seconds for the Equake application. Load test will reveal the impact Equake app exerts on server resources and other apps like WordCloud, tax statistics and Dota2.

Results

Without load balancing

Test time: 30 mins
Throughput: 922.498 per minute
Response time graph is shown below



As expected the Equake app home page for the live earth quake view (yellow) has spikes. Other Equake app pages like monthly, weekly and daily earth quake views timeout at 4 seconds. The rest of the apps have high response times along with equake but not as much.

With load balancing

Test time: 5 min 22 seconds
Throughput: 1707.32 per minute
Response time graph is shown below



All apps except Equake have gone down to response times less than or between 200-300 milliseconds. The Equake pages that access external apis timeout on 4 seconds. The home page (live earth quake view) for equake app has a maximum response time of ~ 17 seconds compared to 1.5 minutes without load balancing.

Insights

1) Throughput increases close to a factor of 2.  

2) All apps except the Equake app behave consistently as expected under load. They start off with a response time of close to 1 second and quickly drop down to <= 200 milliseconds. Caching also works predictably as seen in the application logs. 

3) So, from the load test what is happening with the Equake app?

The remote external REST API has a high response time. But, this should go down like the other apps as caching is enabled. Application logs reveal that there are cache misses for the fetched external REST API data. Data was being fetched. But monthly and weekly earthquake GeoJson data are too large. Around 5-7 MB when saved in file. The default entry size in Memcached is 1MB. So it was being discarded just like that. This causes the entire data to be fetched again. Memcached entry sizes can be increased with the -I flag. However, this means that ~ 7-8 MB of data will be fetched from the cache each time. For this type of data the size will also differ. 5 MB for this week's Equake data and 8 MB for next week. It would be better to use dedicated cache with increased item size limits. Memcached pools with different item sizes can be setup. Regular apps can use the default pool and apps with greater entry size requirements can utilise the other pool. Another approach would be to host the Equake app separately on dedicated app servers. Yet another approach would be to chunk the data and store it in the regular Memcached pool. 




Sunday, 10 June 2018

Signup form with Capthca and User profile/preferences page



Two new features have been added to the visualization and machine learning application.

1) Sign up form for new users:

This form asks for basic information from new users to create their account. This form is also protected by the Captchas app developed in the previous iteration. More about that here

2) A User profile/preferences page:

The application has been updated with a user preferences page. This page allows the user to change their settings within the application. At present users can change their
  • Preferred timezone: This is the timezone in which the user wants to view data in.
  • Profile picture: An avatar shown on the navigation bar and in profile page.

Tuesday, 24 April 2018

Implementing Captchas


"Completely Automated Public Turing Tests to Tell Computers And Humans Apart (CAPTCHAs) are challenge-response tests used to determine whether or not the user is a human" -- Wikipedia

These are used to ensure that a human is indeed in front of the screen making a web request. Sensitive parts of application are protected with captchas. A form that adds data into a database may need to be protected from accepting automated postings. Without protection, a stream of automated posts can not only just swamp the application but can also fill the database/disk with valid junk. For example, gibberish in a feedback form. 

Contemporary captchas are usually one of the following

1) One image or a set of images with a challenge question
An image will have text in it. The user has to type the text into a box. Another type is a set of images and the user has to pick a subset of the images. Say you are shown 4 images and there are animals in only 2 of them. You have to select the correct ones with the animals.

2) A click request on a box
3) An audio playback 
4) Even a short video playback can be used

The last two have storage and bandwidth impact on a web application.

Requirements 

A pluggable captcha that can be used in any Django web application. A mechanism to add and configure captchas with challenges. Once a captcha is added, the system must pick it from there. Captchas have to be one level up in difficulty. i.e something more than just 'enter the text', although these can also be used. 

Implementation

A single image captcha with additional semantic requirements is implemented. A reusable Django app 'captchas' holds the model, form etc to select and process captchas. The template can be included in any HTML form. The default display is as a Bootstrap4 card. How and where this card renders on a form is up to the page designer. Django views just need to send a form in responses for GET and process the submitted form in  a POST. The validation of the captcha is isolated in its form.

The add web page functionality in HUD application is protected with these captchas. This implementation can not only ask for just the text but can also ask for anything based on the image. This iteration includes captchas with challenges like 

- Type the text
- Type only the numbers seen in image
- Type only the letters
- Type the last four letters in reverse

Examples are


Or anything that can be inferred from the image i.e the challenge is configurable as shown. In this iteration basic colored images were used. Using strike throughs, blurs and other effects and on the images can further confuse models. It is also important to change the size of the image as it will slightly increase processing cost.


Advantages

1) There is a one-to-many relation between the images and challenges. With many images and challenges this approach can mitigate the effect of a sweatshop. A captcha image will show up with a different challenge thus mitigating image signature based attacks. If an attacker is getting past the security then it has to be on expensive discipline.

2) There are online free captcha services that can be easily integrated  to sites. However, these tend to have one or another pattern. The popular services may have already been subjected to continuous automated machine learning to created models. Such models are posed with a custom unfamiliar challenge thus making it difficult.

3) Ability to change the challenge over time allows for reuse. This is because it is the challenge that can hold a semantic requirement on a static image.

4) Even if the captcha images are harvested from the application, the challenge remains unknown. The challenge on a harvested image can be changed to a more complicated question.

Saturday, 7 April 2018

Migrating to Bootstrap 4

This post lists the effort in migrating HUD project to the latest stable version 4 of Bootstrap. Screens from mobile, tablet and wide screens are posted at the end.


Project details


# Web framework: Django 1.10
# HTML files / Templates: 66
# Custom css files: 10
# Effort: 4.5 days

Approach

Branch out and import Bootstrap 4 into the project. Things will look out of place. Go through official Migration Guide for each user interface component used and adopt changes. 

 

Notable points

Panels: There are no panels in bootstrap 4. If you have been using a lot of them, then you need to find and replace them with cards. Migration details link.

Cards: These can do all that panels did in the application and more. So moving was not fruitless. The most important advantage in the application is the better ux on a really small mobile, through the tablet to desktop. It was easier to create a tile like structure with images out of cards. Previously this was done using custom css. 

Navbar: A number of changes in the classes for navbar elements. Glyphicons are no longer part of bootstrap. Although this was not much of an impact, it needs to be brought in manually. The new approach can avoid nesting of ul li and a tags to create navbar items. In the new approach, a dropdown-menu class can be applied to a div. And a dropdown-item class make an a tag a drop down item. While a nesting with li is avoided. This can wreck havoc if there are css combinators applied to create custom navbars.  There was limited impact on the project and css was migrated. Migration details link

Display/hide: Some classes have been dropped altogether. For example, hidden-sm was used to hide a component on small screens and save space on the navbar. This has been replaced using a combination of screen size and values as .d-none .d-sm-block.  This is explained here. image-responsive is now image-fluid.

Forms: It is always important to keep forms simple. This is one part where both users and developers can make the layout complicated. Forms were kept simple in the application and migration was not much of a problem with changes listed here. Labels have new classes and also to control sizes. Again, help-block class for help text is now form-text class. This is a better option in a Django application. The error check in template can be moved outside the element.

Grids: Although this can appear daunting, a quick read can help to push through the changes. In reality the addition of a tier helps finer control of flexible display. For example, the application after migration was tested on a portrait mobile device (Lumia 630!), portrait + landscape of tablet and wide screen laptop. Screen widths at which the transition occurs has changed. The exact details are here

Margin and Padding: New formatted classes are in place. This helps to control the values for each break point. mb-4 is a simple one used where 4em is the bottom-margin. Another class mr-auto helps with right aligned navbar links.  

Flex layout: d-flex was used to create a container to hold the avatar icon and user name in the navbar. More about Flex here. This was used in conjunction with mr-auto the new margin classes to right align the avatar and user name.

The effort was worth it when viewed on multiple screens. Some screens mobiles, tablets and desktop.

Mobile 480 x 854 pixels


Samsung Galaxy Tab S2 1536 x 2048 (portrait)


4K Laptop

Friday, 30 March 2018

Speeches: Content Analysis


Visualising topical content of 5 speeches

1) 'Duty, Honour, Country' speech by General MacArthur
2) 'Why I killed Gandhi' speech by Nathuram Godse
3) The Cuban Missile Crisis Address to Nation by President Kennedy
4) Pearl Harbour Address to Nation by President Roosovelt
5) The League of Nations talk by President Wilson


Sunday, 11 February 2018

Updated* Machine Learning | Insights from Visualisation | Multi-dimensional data


Data sets for machine learning are multi dimensional. The number of dimensions depends on the data domain. For example data for a collaborative item-to-item recommender includes a user and an item. This can easily be fit to a co-occurrence matrix. However, data from medical diagnosis have more than 3 dimensions. This is the same for air pollution measurements, vehicle stats and other data sets. When the number of dimensions is 5 or less, it is easy to visualise data before deciding on an approach to machine learning.

Visualising data before applying machine learning has advantages.

1) Visually identify unique patterns in dimensions. Those dimensions are likely more important to help in decision making.

2) Help to decide candidate model(s). For example, would it be better to use a Random Forest classifiers or Support Vector Machine or Nearest Neighbour Classifier?

3) Select a subset of dimensions (before model fitting) that are more suitable for machine learning. Not all collected dimensions turn out useful. Subsequently adding dimensions leads to different use cases and insights.

The rest of the post is divided into
1) Visualization
2) Prediction results using Classifiers
3) Plotting feature weights from fitted models to confirm insights from visualization.

1) Visualization

The sample data set for this post is the Breast Cancer Wisconsin (Diagnostic) Data Set from University of California, Irvine Machine Learning Repository. The dimensions are measurements on cell nuclei. There are 30 dimensions. How are the Benign and Malignant cell readings distributed across the dimensions? In other words what is different between Benign and Malignant cells? Visualising the data on parallel coordinates gives a sense of its dimensions. (Red=Malignant Blue=Benign)


 As an example the perimeter dimension in the top chart is interesting. Filtering based on that looks like this.
While the perimeter dimension cannot help with a binary decision, visualisation shows that the compactness and concavity dimensions can help to decide from that point on. When you go through the above visualization and interact with it, notice that it looks more like decision making. So is this data set a good candidate for Random Forest Classifier or Nearest Neighbour Classifier or SVC?

2) Prediction results

Applying these classifiers to out of bag test data yields different levels of accuracy in prediction. Random Forest Classifier is more accurate than Nearest Neighbour Classifier. The result of applying these classifiers is shown below. RandomForestClassifier/GradientBoostingClassifier is the best choice as was clear from the visualization,



3) Confirm dimension insights from visualization
Fitted tree models have an attribute that shows the weight of each dimension. This allows us to confirm if the dimensions we thought were important from the visualization are really of any predictive value.

From the two plots notice that the dimensions radius, perimeter, concavity, concave_points have more weights than others in both classifiers. Notice that the standard error dimensions are less important.