Tuesday, 25 October 2011

Spell Suggestion Like Google / Microsoft Word

Problem: 

To implement a spell suggestion program similar to the one you see on Google search box, Microsoft Word and the like. English Language dictionary is used.

Solution / Approach

Algorithm uses a trie to store your keys. Walk through the trie with the user to find the suggestions. An alternative approach is use word edits and employ the spell checker in a previous post. 

Input: A english dictionary of 59000 words.

Advantages: Search for a Key is very fast. Suggestions can be found in good time too.

Disadvantages

The size of the trie can be a limiting factor. But, more than one trie compression technique is available especially for English. But, for this experiment I was focussing on run-time with the first implementation only.

Theory

Trie is a M-ary tree. An indexing operation is done at each node of the tree. The indices are the characters of the search Key. Suppose there are k possible character values for key. In English a-z  + unique marker gives 27. Then each node of the trie has k+1 pointers one for each characters possible at that node. This is similar to the way we search for a word in the dictionary. For example, the word "alpha" is looked up by  looking up character "a" then within that, lookup "l" and so on. You can have the pages of all words beginning with "al". In a trie lookup the first character of the key at the root. Then, follow the pointer in the table corresponding to that character. For the word list the number of nodes in the Trie is 143699 ! The word list is by no means compact / processed for that matter. 

Searching an item simply follows characters in the key through the tree and if you have exhausted the key, you have a hit or a miss. Inserting Keys is simple as in searching, you add nodes when you have a miss for the key values.

In a trie, you search in O(length of the Key). The snapshots speak for themselves on this.     


On the other hand the memory requirement for the basic trie is high. If there are n nodes, then there are n * (k + 1) pointers. The two variations that can be used are 

1) Linked list for Pointers in nodes as most of the table is empty in a Trie for English. The trie using the linked list is a de la Briandais Trie. 

2) Then there is Practical Algorithm To Retrieve Information Coded In Alphanumeric. Patricia Tries! Here, the algorithms for building and maintaining the tree will change since we have to store the keys too.

3) Tries can also be represented as two dimensional arrays without the key values being stored in the table. This can save quite a bit compared to basic trie. 

Experiment:

The run of the java program in NetBeans profiler is here. This is for a small word.


For a really long word the time is bit more as shown in profiler.


Suggestions example runs. The algorithm follows from where the user strayed. Although this throws up more suggestions, it works.

No comments: