Sunday 12 July 2015

Text search 2: Rabin-Karp rolling hash in Python

Searching for sub-strings is a functionality provided by programming language libraries. There are multiple methods to implement this. One algorithm we looked at previously is the "Looking Glass" algorithm which performs some preliminary operations and creates a structure that helps in the search speed. Rabin-Karp on the contrary uses hashing for pattern matching. The interesting part is in how the hash is continuously used to find a match. 

For a given string "the", a hash can be calculated once. Say it is H1. We need to find where this string occurs in a larger string say "Three things cannot be hidden for long: the sun, the moon and the truth". The length of the string to search is 3 here and we know the hash of this search string as H1. We just need to go through the larger string and find where all this hash matches. So we need to hash 3 characters of the larger string dropping a character and taking one character at a time. The hashes hash("Thr"), hash("hre"), hash("ree") etc all the way to the end hash("uth") are compared to H1. This is better shown below

At each shift right we need to compare H1 (calculated once only) to a new hash. Hashes that move through the larger string or the rolling hash are calculated by applying Horner's rule for polynomials. To modify hash by adding a letter we use <new hash = old hash * alphabet_size + letter>. To modify hash by dropping a letter we use <new hash = old hash - letter * power(alphabet_size, length_of_hashed_string - 1) >. This is better explained using numbers as follows.

1) This algorithm is useful when we want to match multiple patterns in the same go. 

2) However this algorithm does not jump ahead intelligently on a mis-match. i.e we could move to the next "t" and continue the search. Looking Glass method does the skipping on a mismatch. So a combination of these to approaches is great. 

3) Also, hashing leads to collisions. So at each match we have to still compare the search string letters individually to make sure it is not a hash-collision. A hash match becomes an indication of a possible positive match.

A python implementation of the algorithm is shown below.

4) Runtime for this algorithm is O(length of search string + length of larger string). Runtime with cProfile for python is shown below for main string with 1024 length and 2 different search string lengths. See that the change in the length of the search string has no effect. The algorithm still goes through the main string one character at a time. The key to the runtime of this algorithm is the hash function.

No comments: