Thursday, 2 June 2011

Spell checker design using scatter hash algorithm

This is the design of a spell checker that, I finally coded up. The principle is same as that of McIlroy but, the hash and the bit vector distribution is based on trial and error. I used a list of words that, I stumbled upon in the web. It is by no means optimized for spell corrections. It has 58112 words which I assume as almighty. I take each word and subsequently hash it 32 times to get 32 values which are mapped into a bit vector. The bit vector from the previous post was reused. So, if a word not in the dictionary is entered and checked, atleast one of the hashes would fail. If it failed, the word is not in the dictionary. Happy. Since this uses scattered hash/sort of :), an invalid word can match the hashes accidentally. The fun part was, how this is affected by the hash table size and the hash function. You can literally see the error increase/decrease as you alter these. 

The input: a list of 58112 word.
The output behaviour:

                Bit vector size     invalid word accepted?
--------------------------------------------------------------
case a                500KB                No.
case b                375KB               No.
case c                250KB                No.
case d                125KB                Yes.
Final case:      93.75 KB                Yes.

The first numbers on the bit vector memory feel not so good for a 58K word list. The mathematics that, describes the relation between the hash bit length (32 here), word list length and the probablity of an invalid word hashing correctly is described by McIlroy. This comes into play in the last two cases above. I did not prove it but, runs of the program point to that clearly. It can still be downsized but, the point was to try the scatter technique for the spell checker.

Observation:  I did change the 32 values used in the hash functions which gave me a good reduction in bit vector size and lower error rate. The
final run used 32 consecutive values from the farther end of the proposed hash table size. Previously, I was using differences of 100 ie N - 100, N - 200, N -300, N -400 ...... N - (32 * 100). All that left me was with even values in the hash functions. This affected the accuracy of the hash I can say from the run of the program (i.e invalid words matched seeminlgy easy). But, when I used the consecutive values i.e N - 1, N -2....N - 32 I have introduced primes not purposefully. But, the urge to change the hash values was there. McIlroy says that, the use of n primes for the n hash functions is very good and he used it. A lot of reading on the web claimed that, the use of primes in hash function and their advantage was not that great. But, this small program with only a couple of primes seems to be faring better than with out them. Why 32? McIlroy used 27 I picked 32!. :)). This choice is to be based on the mathematical relation I pointed to above (refer McIlRoy paper).

The program when profiled in Netbeans takes 0.169ms to find if a word in the dictionary. A run of the program on this 573 word post gave optimized, Netbeans, McIlroy, atleast, probablity, seeminlgy as invalid. True. None of these were in the word list. "seeminlgy" is misspelt in this post! The checking code took 20.4 ms.



No comments: