Wednesday, 7 December 2011

A take on Huffman coding algorithm. How to squeeze data. 133 bytes to 69 bytes.

Problem: Compress data in a text file using huffman encoding algorithm.
Basic approach:  
1 Read the file and build a Huffman encoding tree.
2 Get encoded bit string.
3 Decode and get the original data by walking the Huffman encoding tree.

This method of compression is based on an inefficiency in normal representation of data strings. The inefficiency is that, we use 8 bits to encode a character that appears 1000 times and 8 bits again for a character that appears only once in the input data. So, if we use less number of bits to represent a high frequency character and comparatively more for low frequency character, then we can compress the data. This gives rise to two new issues.

a) The issue then becomes one of knowing the character set or the alphabet of your data in advance inorder to go ahead with this type of encoding. 

b) Encoding the characters as bit sequences can lead to one character being the prefix of another. This can lead decoding program astray. i.e if m is 01101 and e is 011, then we have problem decoding 01101 in out stream since the m might be decoded as an e.

For the time being lets take the ascii character set to address issue a. To deal with issue b we can use a binary tree to represent the encoding. To make sure that, no bit sequnce is  a prefix of another character we use the huffmann algorithm to build the tree. This tree is the huff encoding tree. For example an encoding tree is shown here. 

Input: is a file with the content  
"this is a test file input to huffman encoding algorithm. This will be compressed to a huffmann code. Huffmann encoding tree is used."

The frequency of the characters is calculated as shown here.

Notice that, e occurs 11 times and is encoded with only 3 bits where as b occurs only once and is encoded by 7 bits!. The huffman encodings for the characters are shown here. 

Result: The huffman encoding tree has 45 nodes and the encoded bit stream of the input above is 551 bits ie 68.8 bytes as opposed to the 133 bytes in the input. Bit vector and heap help the encoding tree generation algorithm. But, the tree itself seems to be better of as a pointer based tree as opposed to a array based tree (may not be a balanced tree). This can be pretty handy when sending considerable amounts of data over the network in a software environment.


Tree building algorithm:

1 Read the data from the file.
2 Build a table with the count for each character in your alphabet.
3 Build a huffmann encoding tree based on the frequency of your alphabet.
  a) create leaf nodes with all the characters occuring in the file along with their     
      frequencies.
  b) take two nodes which have the smallest frequencies.
  c) combine these two nodes as the left and right children of a new node.
  d) push the new node into the list with frequency set to the sum of the two child nodes.
4 repeat step three until you end up with one node in the list i.e your root node for the encoding tree.

To encode:
1. Replace each character with the bits encountered while getting to the character form the root of the tree.

To decode:
1 input your stream of encoded bits
2 Until your input is not over
3 for each of the bit encountered 
  a move to the left of the tree if it is a bit 0
  b move to the right of the tree if 1 bit
4 if you have reached the leaf node of the encoding tree, go to the root.

No comments: