Sunday, 12 July 2015

Text search 2: Rabin-Karp rolling hash in Python

Searching for sub-strings is a functionality provided by programming language libraries. There are multiple methods to implement this. One algorithm we looked at previously is the "Looking Glass" algorithm which performs some preliminary operations and creates a structure that helps in the search speed. Rabin-Karp on the contrary uses hashing for pattern matching. The interesting part is in how the hash is continuously used to find a match. 

For a given string "the", a hash can be calculated once. Say it is H1. We need to find where this string occurs in a larger string say "Three things cannot be hidden for long: the sun, the moon and the truth". The length of the string to search is 3 here and we know the hash of this search string as H1. We just need to go through the larger string and find where all this hash matches. So we need to hash 3 characters of the larger string dropping a character and taking one character at a time. The hashes hash("Thr"), hash("hre"), hash("ree") etc all the way to the end hash("uth") are compared to H1. This is better shown below

At each shift right we need to compare H1 (calculated once only) to a new hash. Hashes that move through the larger string or the rolling hash are calculated by applying Horner's rule for polynomials. To modify hash by adding a letter we use <new hash = old hash * alphabet_size + letter>. To modify hash by dropping a letter we use <new hash = old hash - letter * power(alphabet_size, length_of_hashed_string - 1) >. This is better explained using numbers as follows.

1) This algorithm is useful when we want to match multiple patterns in the same go. 

2) However this algorithm does not jump ahead intelligently on a mis-match. i.e we could move to the next "t" and continue the search. Looking Glass method does the skipping on a mismatch. So a combination of these to approaches is great. 

3) Also, hashing leads to collisions. So at each match we have to still compare the search string letters individually to make sure it is not a hash-collision. A hash match becomes an indication of a possible positive match.

A python implementation of the algorithm is shown below.

4) Runtime for this algorithm is O(length of search string + length of larger string). Runtime with cProfile for python is shown below for main string with 1024 length and 2 different search string lengths. See that the change in the length of the search string has no effect. The algorithm still goes through the main string one character at a time. The key to the runtime of this algorithm is the hash function.

Wednesday, 22 April 2015

Understand It Yourself Kit: *args and *kwargs for Python

This is a post aimed at Python programmers. Also, this will be useful for engineers who work in other languages like Java and are finding their way in Python.

Programmers are used to argument lists. For example, Java programmers are used to writing the main entry point of their program with an argument list. This looks like

Although other languages can have variable list of arguments in the form of maps and lists in all methods, this does not appear all over the languages for no particular reason. However, this is mainstream in Python code. So we will be seeing a lot of *args, and **kwargs as following in method definitions.

Programmers who come from other languages where this is not mainstream try to understand this from the caller environment whereas it is better the other way around. This kind of understanding also helps in designing flexible Python code. What that means is

If a method has formal arguments, *args and **kwargs, it is saying that, whatever you pass in addition to the formal arguments, will go into *args and any name value arguments will go into **kwargs. So you can call it with quite a lot of arguments in a number of ways. Close to anything goes.

For example the aFunction can be called as follows

But there are some rules to watch out for. For that here is a Understand It Yourself Kit, which you run and just read the output/code and figure it out yourself. A few base code for this was taken from 'Python Epiphanies from Stuart Williams'. This also has a lot of additions which bring up interesting parts of the rules.

Download and run the code from here.

How it is useful: Within the code, notice the call and how args and kwargs get assigned from the console. The extra arguments get assigned to args and kwargs.

Now have a look at the next call and its output. aFunction was called with name-value pairs packed into a dict (comes as kwargs). This got unpacked to its formal arguments.

Thats where this *args and **kwargs are useful. The additional parameters that are passed to aFunction even if they are not useful/make sense to aFunction, will get assigned properly to another function further up the call stack. aFunction just has to pass *args and **kwargs and if it makes sense to the next method, it will.

This is also useful when writing classes in Python. A child class constructor just passes the extra arguments to the parent's constructor call. An example is shown below

Thursday, 16 October 2014

The Server side: webapp performance architecture with Django, load balancing and caching

Web apps need to be quick by reducing latency while serving client requests. In all applications persistent data is on a database on disk (some use solid state too) and the web app on server accesses it. While a few concurrent connections will be easily handled by issuing database requests, when it comes to thousands or millions of requests this approach will fail fast. Scaling up by utilizing powerful hardware is a solution which can buy time until the next growth level of client requests. Even on a scaled up hardware there is room to squeeze more requests through if data caching is part of the design. Accessing data from disk / another server on the network is slower than accessing it from the memory. If the round trip can be avoided this enables fewer comparatively low cost hardware to easily handle considerable loads without increasing latency. A profiling of response times is on the next post. Once server side things like what is the app's actions, when it acts, how fast & efficiently are thought out, the data model can be designed for maximum throughput too.

Here we look at horizontally deploying multiple application servers and a single back end database which handles these app servers' data requests. These app servers are load balanced in round robin method for requests from the external world. Each request to a webapp is given to a different app server. Sticky sessions can also be enabled but that won't be round robin'ed. The objective of round robin is to decrease latency caused by the wait if it was only one server. However, in this scenario the database handles all the app servers. If not addressed this becomes a contention easily thus taking everything to square one. On closer analysis of most web apps/sites/services it would be possible to avoid unnecessary database calls with caching. (The next level to improve performance would be database sharding) What this helps is saving the cost of a network/disk call. Imagine having a good pizza ready on the kitchen table rather than you having call Dominos' or Pizza hut's home delivery. In the case you had to call home delivery, ask for pizzas and keep some ready in your refrigerator just in case someone suddenly gets the urge thus saving you and them the wait next time.  Pizzas become stale, so do data. There has to be precise cache invalidation when the database record is updated. The point is that you might have to hit the database but it is not needed every time a request is made. 

OS: Opensuse 13.1 or any Linux
Web app: developed on Django framework.
Webserver: Apache2 with mod_wsgi to host python webapp
Caching: Memcached
Database: Postgresql
Load balancing: nginx
# application servers: 3
# loadbalancing node: 1

Django code with sample view that uses the cache:

Wireshark trace showing the round robin requests and responses

Memcached on the three server with hits count thus avoiding the database:

Monday, 25 August 2014

Celatum Pro 3.0 with secure chat feature : "Every message matters"

Celatum Pro version 3.0 has been published. This version of the Android app has a new feature for secure chat. It allows you to secure your instant messages with encryption. Every message is encrypted with a different key.

Availability: Check out Google Play and Amazon app stores for downloads in select countries.
Languages Supported: German, Russian, English, Spanish, French and Portuguese.

Think your instant messages are secure? Think again. On all existing messaging systems the channel is secured with SSL and the like. This does not mean your data is secure. Once out of the channel at the server side, the data itself is available in plain sight at the service provider's server.

Celatum Pro offers encryption of the data itself so that, when the message is not viewable for the messaging service provider. This feature is an extension of the existing encryption feature of Celatum Pro which allowed secure text attachments.

Checkout the features on Play and Amazon stores for the app.

Wednesday, 25 June 2014

Web page scraper design using blueprints

This post is available as a presentation here

Services and apps that supply information from multiple sites generate the data by scraping information from web pages. For example, an application that provides third party retail product information gets the data from the retailers web site. This is done by crawler programs which visit the webpage and scraper programs which extract the required information from the page. Scraping is done by taking advantage of how a site designs or structures its web pages. Sites structure webpages using html and look & feel via Cascaded Style Sheets. Every site has a particular structure or template used by the web developers. For example, the news website CNN has a structure for presenting various information. Their html page structure is different from BBC's page. Retail websites' pages are no different. In order to get to the required information, scraper programs can extract the html or CSS elements of interest one-by-one until it gets to the data. For example, the following snap shows the product name in an element within an html webpage. 

This works until the websites change the structure of the pages altogether. So scrapers need to be designed to quickly adapt to changes in structure of the target pages. Pages populated in a lazy manner with Java Script pose another challenge. But, this latter challenge can be addressed by using tools like Selenium. A scraper must simply adapt to the page changes i.e once the elements containing the target data have been identified, the scraper should go after those elements.

Scrapers can be designed to take metadata of information that needs to be extracted. We can call this the blueprint based on which the scraper will work on a particular page set. The meta data can contain the list of elements to be scraped for a piece of data i.e data piece and list of elements. The blue print for the whole information to be obtained from a page can include a list of data pieces and their corresponding elements. Another program can generate this blue print and once tested can be given to the scraper. A scraper library in Java which uses this blueprint approach is show below. This also includes a small API to generate the blue print and also to use the scraper. 

The core idea in this scraper design is devolution of responsibilities. The scraper/library used is a program that take a web page and a blueprint. It scrapes the page based on the blueprint and returns the information. It should not have to or be made to think of the type or nature of the object/information scraped. i.e it should scrape an abstract object. It should be the job of the program requesting the information to tightly set an identity to the scraped information. 

For example, when a retail web page needs to be scraped, the scraper need not know that it is an object, say a Java object of type say, RetailItem. It only needs to know that, this scraped object has an attribute "name" and also an attribute "price". Plus their values i.e the actual name and price (HydroFlask and $50). Nothing more. A scraper like this builds key value pairs to represent the object scraped. In programming this is an object as a collection of attributes in a map.

Not only that, the blue print itself can be a set of key value pairs where the keys are attribute names and values are the list of elements to be pulled.

A blueprint screen with key "brand" and one html/css element to be pulled is shown below.

Selectors as shown in the image specify the elements to be followed or pulled to get to the information. In this example, a div element with id product-tabs needs to be pulled from the page. 

Now that, the scraper just takes a blue print and generates an abstract object as keys and values, all that remains is sharing the blueprint. This can be done using JSON format. GSON library can be used to handle the format and conversions to and from binary. 

The object that is scraped is a simple java object which holds a map as shown below.

The blueprint is also similar in structure as a key value pair holder.

 The API altogether looks like this in final

The blueprint can be built one selector at a time using the API as follows.

Finally a sample blueprint itself in JSON is as follows

Friday, 23 May 2014

Accomodate data via interning - Java

Software receives data from multiple sources and holds necessary portions of it in memory. Data may be read from disks and networks. More data can be fit into a program's memory if, only one copy of relevant data is stored. For example, lets say a software keeps track of vehicles and related information like number plates. Vehicles' number plates are registered in a state. The software does not need to have multiple copies of a Registered State object for each vehicle registered in the same state. If there are a 100000 vehicles registered in Colorado, then only one instance of the Registered State of Colorado is needed. If the Object's state does not change for the execution time, this saves space. This approach called Flyweight Pattern or interning is used to hold 'immutables' in a program. 

The following screen shows the profiling of the example described above. A collection of 100000 vehicles is built and all of them are set to Registered State (Colorado). On the profiler output we see that, once the data loading is finished there is only one Live Registered State representing Colorado. 

Profile Output:

Code for intern store
Here we use a synchronized map to hold weak references to an instance. Each class is again mapped to these 'maps of instances'. For the purpose of storing the objects in synchronized maps, we override the hashcode method for the data object. For this the program we use HashCodeBuilder from Apache libraries (commons-lang-2.4.jar).

Source for the sample is here

This is the case with Python too. An example on the Python shell to count the active reference to a string object and object ids is shown below.

1) ACM Webinar on Taking the Big out of Big Data By Kate Matsurdia
2) Apache Commons library

Tuesday, 20 May 2014

Java Non-blocking IO - Part 1

Non blocking Input Output package aims to increase performance and improve responses. In regular blocking IO you read from a socket's stream and block on the read() until you get data. In NIO, you register to be notified of events and respond only when that event occurs. Programs can avoid blocking and take action on being notified. NIO uses channels and 'a channel represents a connection to stuff like sockets, files etc and can perform multiple io operations like read, write'. Instead of an inputstream and outputstream we have a single channel to both. A profiling of NIO servers here shows that a design with NIO may not always be better than a normal multi-threaded blocking IO design for applications. Using NIO on a client like mobile devices may not (or may) be useful but for a server it may be different. Registering for events is done via 'Selectors'. Multiplexing channels is made possible via Selectors. A Selectable (multiplexable) Channel's registration with a Selector for events is represented by Keys in the Selector for different operations. When a bunch of operations on the SelectableChannel is available, the selector gives keys representing those operations.

To write a simple NIO server which accepts connections
1) open a ServerSocketChannel & set is as non-blocking
2) get the underlying socket for the channel just created & bind it to an address.
3) open a  selector
4) register the ServerSocketChannel with the Selector for a OP_ACCEPT operation. 

To handle different events
1) check the selectors for the number of ready operations (keys set)
2) if it is -1 continue to wait for an event.
3) Once there are keys to go with, retrieve the keySet which holds the keys whose channels are ready.
4) Loop through each key and handle the corresponding IO.
4a) Filter operations types by comparing the keys' ready operation code.
5) mark key as handled. (remove it)

To Write and Read from Channel:
Channels perform two-way communication and handle blocks of Data. These blocks of data are read 'into' and written 'out from' Buffers. Buffers have internal byte arrays to hold the data and can be retrieved.
Buffers have a current position from where data will be next read or written into. Flipping a buffer sets the limit as the current position and returns the position to the beginning. So if we want to use the data so far read into a buffer b and send the same data out using the same buffer, we need to flip it. This code just sends the data back.

The full code for the project is here. The client app in the project simply keeps track of time taken to get a response from the server.