Tuesday, 24 April 2018

Implementing Captchas

"Completely Automated Public Turing Tests to Tell Computers And Humans Apart (CAPTCHAs) are challenge-response tests used to determine whether or not the user is a human" -- Wikipedia

These are used to ensure that a human is indeed in front of the screen making a web request. Sensitive parts of application are protected with captchas. A form that adds data into a database may need to be protected from accepting automated postings. Without protection, a stream of automated posts can not only just swamp the application but can also fill the database/disk with valid junk. For example, gibberish in a feedback form. 

Contemporary captchas are usually one of the following

1) One image or a set of images with a challenge question
An image will have text in it. The user has to type the text into a box. Another type is a set of images and the user has to pick a subset of the images. Say you are shown 4 images and there are animals in only 2 of them. You have to select the correct ones with the animals.

2) A click request on a box
3) An audio playback 
4) Even a short video playback can be used

The last two have storage and bandwidth impact on a web application.


A pluggable captcha that can be used in any Django web application. A mechanism to add and configure captchas with challenges. Once a captcha is added, the system must pick it from there. Captchas have to be one level up in difficulty. i.e something more than just 'enter the text', although these can also be used. 


A single image captcha with additional semantic requirements is implemented. A reusable Django app 'captchas' holds the model, form etc to select and process captchas. The template can be included in any HTML form. The default display is as a Bootstrap4 card. How and where this card renders on a form is up to the page designer. Django views just need to send a form in responses for GET and process the submitted form in  a POST. The validation of the captcha is isolated in its form.

The add web page functionality in HUD application is protected with these captchas. This implementation can not only ask for just the text but can also ask for anything based on the image. This iteration includes captchas with challenges like 

- Type the text
- Type only the numbers seen in image
- Type only the letters
- Type the last four letters in reverse

Examples are

Or anything that can be inferred from the image i.e the challenge is configurable as shown. In this iteration basic colored images were used. Using strike throughs, blurs and other effects and on the images can further confuse models. It is also important to change the size of the image as it will slightly increase processing cost.


1) There is a one-to-many relation between the images and challenges. With many images and challenges this approach can mitigate the effect of a sweatshop. A captcha image will show up with a different challenge thus mitigating image signature based attacks. If an attacker is getting past the security then it has to be on expensive discipline.

2) There are online free captcha services that can be easily integrated  to sites. However, these tend to have one or another pattern. The popular services may have already been subjected to continuous automated machine learning to created models. Such models are posed with a custom unfamiliar challenge thus making it difficult.

3) Ability to change the challenge over time allows for reuse. This is because it is the challenge that can hold a semantic requirement on a static image.

4) Even if the captcha images are harvested from the application, the challenge remains unknown. The challenge on a harvested image can be changed to a more complicated question.

Saturday, 7 April 2018

Migrating to Bootstrap 4

This post lists the effort in migrating HUD project to the latest stable version 4 of Bootstrap. Screens from mobile, tablet and wide screens are posted at the end.

Project details

# Web framework: Django 1.10
# HTML files / Templates: 66
# Custom css files: 10
# Effort: 4.5 days


Branch out and import Bootstrap 4 into the project. Things will look out of place. Go through official Migration Guide for each user interface component used and adopt changes. 


Notable points

Panels: There are no panels in bootstrap 4. If you have been using a lot of them, then you need to find and replace them with cards. Migration details link.

Cards: These can do all that panels did in the application and more. So moving was not fruitless. The most important advantage in the application is the better ux on a really small mobile, through the tablet to desktop. It was easier to create a tile like structure with images out of cards. Previously this was done using custom css. 

Navbar: A number of changes in the classes for navbar elements. Glyphicons are no longer part of bootstrap. Although this was not much of an impact, it needs to be brought in manually. The new approach can avoid nesting of ul li and a tags to create navbar items. In the new approach, a dropdown-menu class can be applied to a div. And a dropdown-item class make an a tag a drop down item. While a nesting with li is avoided. This can wreck havoc if there are css combinators applied to create custom navbars.  There was limited impact on the project and css was migrated. Migration details link

Display/hide: Some classes have been dropped altogether. For example, hidden-sm was used to hide a component on small screens and save space on the navbar. This has been replaced using a combination of screen size and values as .d-none .d-sm-block.  This is explained here. image-responsive is now image-fluid.

Forms: It is always important to keep forms simple. This is one part where both users and developers can make the layout complicated. Forms were kept simple in the application and migration was not much of a problem with changes listed here. Labels have new classes and also to control sizes. Again, help-block class for help text is now form-text class. This is a better option in a Django application. The error check in template can be moved outside the element.

Grids: Although this can appear daunting, a quick read can help to push through the changes. In reality the addition of a tier helps finer control of flexible display. For example, the application after migration was tested on a portrait mobile device (Lumia 630!), portrait + landscape of tablet and wide screen laptop. Screen widths at which the transition occurs has changed. The exact details are here

Margin and Padding: New formatted classes are in place. This helps to control the values for each break point. mb-4 is a simple one used where 4em is the bottom-margin. Another class mr-auto helps with right aligned navbar links.  

Flex layout: d-flex was used to create a container to hold the avatar icon and user name in the navbar. More about Flex here. This was used in conjunction with mr-auto the new margin classes to right align the avatar and user name.

The effort was worth it when viewed on multiple screens. Some screens mobiles, tablets and desktop.

Mobile 480 x 854 pixels

Samsung Galaxy Tab S2 1536 x 2048 (portrait)

4K Laptop

Friday, 30 March 2018

Speeches: Content Analysis

Visualising topical content of 5 speeches

1) 'Duty, Honour, Country' speech by General MacArthur
2) 'Why I killed Gandhi' speech by Nathuram Godse
3) The Cuban Missile Crisis Address to Nation by President Kennedy
4) Pearl Harbour Address to Nation by President Roosovelt
5) The League of Nations talk by President Wilson

Sunday, 11 February 2018

Updated* Machine Learning | Insights from Visualisation | Multi-dimensional data

Data sets for machine learning are multi dimensional. The number of dimensions depends on the data domain. For example data for a collaborative item-to-item recommender includes a user and an item. This can easily be fit to a co-occurrence matrix. However, data from medical diagnosis have more than 3 dimensions. This is the same for air pollution measurements, vehicle stats and other data sets. When the number of dimensions is 5 or less, it is easy to visualise data before deciding on an approach to machine learning.

Visualising data before applying machine learning has advantages.

1) Visually identify unique patterns in dimensions. Those dimensions are likely more important to help in decision making.

2) Help to decide candidate model(s). For example, would it be better to use a Random Forest classifiers or Support Vector Machine or Nearest Neighbour Classifier?

3) Select a subset of dimensions (before model fitting) that are more suitable for machine learning. Not all collected dimensions turn out useful. Subsequently adding dimensions leads to different use cases and insights.

The rest of the post is divided into
1) Visualization
2) Prediction results using Classifiers
3) Plotting feature weights from fitted models to confirm insights from visualization.

1) Visualization

The sample data set for this post is the Breast Cancer Wisconsin (Diagnostic) Data Set from University of California, Irvine Machine Learning Repository. The dimensions are measurements on cell nuclei. There are 30 dimensions. How are the Benign and Malignant cell readings distributed across the dimensions? In other words what is different between Benign and Malignant cells? Visualising the data on parallel coordinates gives a sense of its dimensions. (Red=Malignant Blue=Benign)

 As an example the perimeter dimension in the top chart is interesting. Filtering based on that looks like this.
While the perimeter dimension cannot help with a binary decision, visualisation shows that the compactness and concavity dimensions can help to decide from that point on. When you go through the above visualization and interact with it, notice that it looks more like decision making. So is this data set a good candidate for Random Forest Classifier or Nearest Neighbour Classifier or SVC?

2) Prediction results

Applying these classifiers to out of bag test data yields different levels of accuracy in prediction. Random Forest Classifier is more accurate than Nearest Neighbour Classifier. The result of applying these classifiers is shown below. RandomForestClassifier/GradientBoostingClassifier is the best choice as was clear from the visualization,

3) Confirm dimension insights from visualization
Fitted tree models have an attribute that shows the weight of each dimension. This allows us to confirm if the dimensions we thought were important from the visualization are really of any predictive value.

From the two plots notice that the dimensions radius, perimeter, concavity, concave_points have more weights than others in both classifiers. Notice that the standard error dimensions are less important.

Tuesday, 23 January 2018

Multi-DB design for high performance web applications

Applications with a small user base employ a design with a single centralised database. The rate of data growth and load do not create contention for such applications. So a single db design will work well. However, each database has a maximum connection limit. PostgreSQL is 'good for a few hundred connections, but for 1000s it would be better to look at a connection pooling solution'. Caching and db connection pooling can be utilised to postpone hitting the limit. This single db approach is easy to maintain, test and orchestrate with developer operations. 

When there are specific performance requirements for #users to support, #api requests per second, read/write patterns and data isolation the single database design would be broken by design. Hitting its upper limit on performance/load will impact users. A competition that has a better design to address this will win. A multi-database design addresses the shortcomings by separating data across multiple databases.

A) Data can be logically partitioned. For example, user authentication and profile information can reside on one database while other application data like say catalog, reviews and feedbacks can reside in a different database. That way authentication connections are routed to that specific server while the other servers can utilise connections fully to provision data.

B) Another approach is to partition based on read and write operations. All writes will go to one database. This database will be periodically synced to another read-only database. This helps when there are regular writes to the db but, the read density is too high. Multiple read-only slaves which mirror the primary can be used as needed.

C) A third approach is to use dedicated databases for different regions/locations. This is useful when usage varies across locations. Each database can be configured differently to handle its own access/load patterns. For example, utilise separate databases to store data from each state, New South Wales, Victoria, WA and so on.

D) To partition all data with redundancy, it would be better to use a database that has sharding built into its design. MongoDB is a good choice. This is described in a previous post.

Multiple databases mean more effort in development, operations, support and maintenance. So tangible performance benefits must justify their use.

Friday, 29 December 2017

Australia Tax Stats 2011-12 | Visualisation & Insights

This post visualises Australian Tax data for Individuals during the income year 2011-12. This web application allows the user to explore tax filing data to derive a number of insights. The data is available from the Australia Government website here

Watch the video for a look at this application. 

Specific Insights

1) Size of the workforce has not changed much: If the Wikipedia page on Australian economy is correct, then the workforce has not changed from 2011-12. The workforce count on the data confirms this.

2) Workforce made up of Equal numbers from both genders: Tax data shows that almost the same number of men and women figure in the filings.

3) State-wise gender gap in registered salary: Data shows that women, when compared to men, although roughly the same number in the workforce as men have their cumulative salary field at half the $ number in WA. This gap varies for the states. Since the number of women is almost the same, this wage gap triggers cause-analysis such as:

  1. Women are paid less in the same job compared to men.
  2. If (1) is not true then, women are not in the same pay band/scale or jobs as men are in. This would explain why the salary is less even though women make up the same head count.
4) Taxable Income/Salaries for under 18 age and above 70: The data shows that there is healthy income (but < 1 Billion) for these two age groups.

5) Comparison of states: On all accounts the ranking is NSW, Victoria, Queensland, WA, SA, TAS and other.

There are other data points like child support by filing individual, refundable, exemptions etc which can be analysed.

Thursday, 14 December 2017

Data visualisation: Plotting USGS earthquakes

This post makes use of USGS's live feed of earthquakes. The feed is plotted on a Leaflet map to visualize the data. Leaflet is a javascript library for interactive maps (http://leafletjs.com/). The advantage is that it consumes Geojson well.  

Feeds are categorised into all quakes and significant ones. The app's hourly feed is updated continuously for new data. When new data is available the map is updated. For each point of interest a leaflet layer group is utilised. This holds a circular marker layer and a place holder. A popup layer is also added which displays a text with the title from geojson. This title includes magnitude and location of the quake. All layer groups are added to the map's layer group. This allows individual layers associated with a point to be update or removed when data is updated.

Although all earth quakes are not significant, from the data is clear that earthquakes are more common in plate boundaries and fault lines (not surprising but nice to visualise). Switching to a weekly or monthly view makes vulnerable locations apparent.

Significant quakes are also seen in daily, weekly and monthly basis. Adding filters to query on Magnitude or changing the circle marker circumference based on magnitude would be good. However, significant earthquakes are far less in number and are shown separately in the app. Also to be noted is that, leaflet slowed down when a lot of layers were used for each point to support individual point updates. It also eases out when the map is just cleared and updates are added.


Python 3.5.3
Django 1.10
Bootstrap 3.3.7
Leaflet 1.2.0
Javascript, JQuery