Tuesday, 13 December 2011

Classic Longest common subsequence via Dynamic programming

Run time Profiling results of classic longest common subsequence algorithm. 

An example application of this algorithm is finding the difference of two text files. We need to find the longest common subsequence of two sequences. i.e for "bdca" and "bcbda" the longest common subsequence is of length 3 and is  "bda"
The core algorithm using dynamic programming for this is as follows

Lcs(i,j) = {      Lcs(i-1, j-1) + 1 ; if match at i & j.
               Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j

This can be implemented recursively using a two dimensional array m x n where m, n are sequence lengths. For example, two sequences "bdca" and "bcbda" we have the matrix for the algorithm above

0 1 2 3 4

b c b d a
0 b  | 1   1   1   -1   -1 
1 d  | 1   1   1   2   -1
2 c  | -1   2   2   2   -1   
3 a  | -1   -1   -1   -1   3
So here 3 is the length of the longest common subsequence. base on the core algorithm above.
Runtime complexity is about O(mn). This particular basic program yeilds a time as show here


By analysing the recurssion tree of this particular problem there is an improvement that can be made to the runtime. The addition to the core algorithm is as follows

If Lcs(i, j) has not been calculated so far ?

    Lcs(i,j) = { 
                        Lcs(i-1, j-1) + 1 ; if match at i & j.
        Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j

else return the calculated Lcs(i, j);

This small addition saves the code from going to calculate duplicate recursion trees. The time saved (~ 30 ms) is evident here

The sample recurssion tree is shown below. There are a lot of repeated duplicate tasks.

     4, 4

   3,4     4, 3

   2, 4, 4,3          3, 3   4, 2   

1, 4 2, 3 .... 2, 3   3,2 .... 3, 2 4,1
Reconstruction is based again on the algorithm. We backtrack based on the two steps of the algorithm. ie if match occurred or the Max of sub-problems.

1. We start at the right end of the tree
2. if there was a match at i & j then we take sequence[i] (= sequence[j] as in step 1 of algorithm) and go to (i-1, j-1)
3. else we can go either to (i-1, j) or (i, j-1) in the matrix. The logical thing to do is to go where there is maximum value.
4. if matrix values at  (i-1, j) or (i, j-1) are same we may have to investigate both.

The disadvantage of this approach is that, as the length of the sequences go up the memory for the matrix also goes up as m x n. One way to improve the memory is to use only two rows since we look at only those every time. But, we may need to keep track of which way we step to reconstruct the lcs sequence. 

Wednesday, 7 December 2011

A take on Huffman coding algorithm. How to squeeze data. 133 bytes to 69 bytes.

Problem: Compress data in a text file using huffman encoding algorithm.
Basic approach:  
1 Read the file and build a Huffman encoding tree.
2 Get encoded bit string.
3 Decode and get the original data by walking the Huffman encoding tree.

This method of compression is based on an inefficiency in normal representation of data strings. The inefficiency is that, we use 8 bits to encode a character that appears 1000 times and 8 bits again for a character that appears only once in the input data. So, if we use less number of bits to represent a high frequency character and comparatively more for low frequency character, then we can compress the data. This gives rise to two new issues.

a) The issue then becomes one of knowing the character set or the alphabet of your data in advance inorder to go ahead with this type of encoding. 

b) Encoding the characters as bit sequences can lead to one character being the prefix of another. This can lead decoding program astray. i.e if m is 01101 and e is 011, then we have problem decoding 01101 in out stream since the m might be decoded as an e.

For the time being lets take the ascii character set to address issue a. To deal with issue b we can use a binary tree to represent the encoding. To make sure that, no bit sequnce is  a prefix of another character we use the huffmann algorithm to build the tree. This tree is the huff encoding tree. For example an encoding tree is shown here. 

Input: is a file with the content  
"this is a test file input to huffman encoding algorithm. This will be compressed to a huffmann code. Huffmann encoding tree is used."

The frequency of the characters is calculated as shown here.

Notice that, e occurs 11 times and is encoded with only 3 bits where as b occurs only once and is encoded by 7 bits!. The huffman encodings for the characters are shown here. 

Result: The huffman encoding tree has 45 nodes and the encoded bit stream of the input above is 551 bits ie 68.8 bytes as opposed to the 133 bytes in the input. Bit vector and heap help the encoding tree generation algorithm. But, the tree itself seems to be better of as a pointer based tree as opposed to a array based tree (may not be a balanced tree). This can be pretty handy when sending considerable amounts of data over the network in a software environment.

Tree building algorithm:

1 Read the data from the file.
2 Build a table with the count for each character in your alphabet.
3 Build a huffmann encoding tree based on the frequency of your alphabet.
  a) create leaf nodes with all the characters occuring in the file along with their     
  b) take two nodes which have the smallest frequencies.
  c) combine these two nodes as the left and right children of a new node.
  d) push the new node into the list with frequency set to the sum of the two child nodes.
4 repeat step three until you end up with one node in the list i.e your root node for the encoding tree.

To encode:
1. Replace each character with the bits encountered while getting to the character form the root of the tree.

To decode:
1 input your stream of encoded bits
2 Until your input is not over
3 for each of the bit encountered 
  a move to the left of the tree if it is a bit 0
  b move to the right of the tree if 1 bit
4 if you have reached the leaf node of the encoding tree, go to the root.