Friday 20 May 2011

Designing a spell check or corrector. Round 1

Problem: How to implement a simple dictionary. Given a list of words 60000 words in my case. The tough part is not with checking if a word is in the dictionary or not. It is the correction part that is challenging. 

I was aware of how McIlroy used a hash to convert strings and subsequently used the hash as an index into a table. The space compression using bits is also there. So, how to go about the correcting a word. For example, type in the words 'intelligence', 'indelligence', 'itelligance', 'indelligance' into Google search. The search engine asks if you meant the word intelligence. The similarities with all these is how they sound. Even 'entelligant' will get you intelligent. Some reading gave two methods to approach this problem one was to play with the word and making slight changes to see if the new word is in the dictionary. The second is known as the Soundex method that, identifies words that sound alike. The concept of a signature seems to be crucial in this business but the better the signature is in representing all the letter, their sequence in the final signature the more successful the checker will be. 

Could the signature itself be used to find the locality of the word if not the word itself?

A simple not so effective solution: I used the most common signature for strings. Given a word 'cabbage' the signature is a2b2c1e1g1. Kind of beats the purpose to store a signature larger than the word. So another common knowledge to use a2b2ceg. This could tell me some relations between the existing word in the dictionary only. But, when it comes to a wrongly spelt word for example 'cabage' the signature becomes 'a2bceg'. Binary search through the signature list will not help because the signature can lead you astray. A modification with a look ahead logic to see if the direction of binary search will yeild a result or close, was also not up to the mark. The real question is whether I need to store the correct words' list. Data structures like tries, suffix arrays all come to mind. But, search engine uses tries to go to the list of pages with that word. Here I just want to correct the word.  My instinct says no if I was to go with the first method mentioned above.  

I am curious to look at Knuth's description of the soundex algorithm and went through some documents of english words. But, for the time being I feel like abandoning both, although not completely, in favour of a simple approach to do a simple spell correction in a given list of words. More on this as I near a good to disclose method.

No comments: