Friday 20 May 2011

Designing a spell check or corrector. Round 1

Problem: How to implement a simple dictionary. Given a list of words 60000 words in my case. The tough part is not with checking if a word is in the dictionary or not. It is the correction part that is challenging. 

I was aware of how McIlroy used a hash to convert strings and subsequently used the hash as an index into a table. The space compression using bits is also there. So, how to go about the correcting a word. For example, type in the words 'intelligence', 'indelligence', 'itelligance', 'indelligance' into Google search. The search engine asks if you meant the word intelligence. The similarities with all these is how they sound. Even 'entelligant' will get you intelligent. Some reading gave two methods to approach this problem one was to play with the word and making slight changes to see if the new word is in the dictionary. The second is known as the Soundex method that, identifies words that sound alike. The concept of a signature seems to be crucial in this business but the better the signature is in representing all the letter, their sequence in the final signature the more successful the checker will be. 

Could the signature itself be used to find the locality of the word if not the word itself?

A simple not so effective solution: I used the most common signature for strings. Given a word 'cabbage' the signature is a2b2c1e1g1. Kind of beats the purpose to store a signature larger than the word. So another common knowledge to use a2b2ceg. This could tell me some relations between the existing word in the dictionary only. But, when it comes to a wrongly spelt word for example 'cabage' the signature becomes 'a2bceg'. Binary search through the signature list will not help because the signature can lead you astray. A modification with a look ahead logic to see if the direction of binary search will yeild a result or close, was also not up to the mark. The real question is whether I need to store the correct words' list. Data structures like tries, suffix arrays all come to mind. But, search engine uses tries to go to the list of pages with that word. Here I just want to correct the word.  My instinct says no if I was to go with the first method mentioned above.  

I am curious to look at Knuth's description of the soundex algorithm and went through some documents of english words. But, for the time being I feel like abandoning both, although not completely, in favour of a simple approach to do a simple spell correction in a given list of words. More on this as I near a good to disclose method.

Saturday 14 May 2011

Serialization to the rescue

I wanted to post this ever since I had been through Bentley's articles. This happened when I was working on mobile devices and subject is about exploring alternate solutions for a given software problem, each with its own plus, minus points. The lesson I learned is that, there is always a scope and environment to the code that you write and making the right tradeoffs can get you a solution faster.

The Problem: To display visual warning to users on a PDA (MDA Compact 3 & HTC Qtek S9100) when they cross the speed limit. Simple enough. But, there are strings attached. You have to build the feature on an existing solution which displays a map on the screen and provides navigation, traffic info to the users. The exisitng solution is keeping track of the location via gps <= every second and you have to use that data. Now, the display should be like the speed signs on the streets. You need to provide audio alerts too. Things are not simplified by the fact that, a version of TomTom will also be running on the PDA. PDA has 200MHz processor, 64MB RAM. There is the previous version with still limited features that, the software is to work. How do we show the sign/live feed on the screen ?  A sample sign 

Solution 1: Since we had the speed info from the existing components, we could simply draw our way through. A good drawing algorithms to put the street symbol with the speed data on the screen. After a couple of runs and stress tests (remember the existing load too), it was not very responsive plus the screen was left in an inconsistent state. Also, the platform was Microsoft Windows mobile, programming was done using embedded VC++  and Microsoft foundation classes.

Solution 2: Show the sign on a seperate window and use MFC window region classes. What about the Sign itself ? Store it in the memory. The device could spare some. So why not? The bottle neck was with the MFC classes in the device sdk. It was stripped down and did not have the full region functionalities. You can set the region but, how to define the circular region? Again a dead end.

At this time I was pretty much finished with Jeff Prosise's Programming windows with MFC. It and other old Wintellect books have been my favourite while on Microsoft platform. Not to mention Charles Petzold. At this time I remembered about Serialization. Yes, that can let you save the state of an object. Aha. So, I can define the region on my desktop and serialise the region classes. Read the region classes in the PDA and use them as I like. I ran back to my workstation to try this out and see the effects of the stress tests. And it worked. Next my test representative and myself took a version with the modification on a field test. It performed better than the other option.

Although there were a number of tradeoffs in terms of memory, response time etc, solution worked. I discussed it with my lead, tech consultant, project colleagues and we had a good laugh.

Thursday 12 May 2011

Oho! algorithms

Again I have been coding some stuff and analysing the runtimes. This time was the two algorithms for rotating a vector left by N locations. This, a reverser algorithm, is used by Brian Kerningham and Plaugher in their text editor and also by Ken Thomson in the editor for Unix and by Bentley too. So, if ABCD is a vector you want to rotate it by 2 locations to the left, we get CDAB. 

The first algorithm is a Juggler algorithm which juggles the values at indices similar to a shift. This is something that, I arrived at and so I will describe the other 'Reverser' algorithm. It goes like this. AB be the string. A is N units length and we want to rotate by N so the result is BA. First do a Ar B where Ar stands for the reverse of A. Then Ar Br. Then ( Ar Br )r. This should give you BA. Brian reported that, this worked first time. I also am glad to report that, it worked first time ;) :)). This is ofcourse if you have a reverse function to do the Ar and Br etc. In my case it is just a switch of array locations starting from the extremities and moving to the center of the array. 
The general analysis of the two algorithms is not much a do to share on a blog. The reverser is always better than the Juggler algorithm. They did exhibit some minor changes in their runtimes of the order of 1 or 2 ms. 

Well that said, *There is a general question of how the GCD of N and the length of the string figure in the analysis. So, this is what I did (open to questioning).

Input a 1248000 long string.
N is such that, N is the GCD.  

At lower values of N (the GCD), both algorithms are the closest in terms of runtimes. Here is N = 4. Almost same. 10 ms and 10.7 ms.

At huge N for N = 1248, the Juggler takes almost twice the time of the Reverser. 6.99 and 13.1.

This seems to be consistent. Anyways, I think I will use the Reverser in my encrypting text editor Crystal. :)

Tuesday 10 May 2011

Getting runtime from ~ 15 seconds to 4 seconds

This is a bit of information on how I got a ~ 15 second program to work in less than 4 seconds. Although the insights are obvious, using the profiler defenitely gave more info than raw intuition.

The Problem: The anagram problem as discussed by Bentley. For a quick example 'rickets', 'sticker' and 'tickers' are anagrams. Using a permutations approach is defenitely unacceptable because of the time complexity. This brings us to the signature approach where we identify anagrams based on signatures. My program uses quick sort to build a signature. The partition code for the quick sort is that discussed by Bentley as I know from my college days the bugs and the confusion involved in coding the partition using the regular method.

The input to the code is a 58111 long list of english words.

Case 1: Using a swap function, quick sort and signature function. Output was dumped to the screen using 'System.out'.  The total run time of the program was around 15 seconds. The quick sort took close to under 5 seconds. The screen dump took around 7-8 seconds !.

* I replaced the screen dump code with a write to a text file. So, I wrote all the anagrams (the hash map contents) to a text file. But, this was done using Buffered Writer. This saved 8 seconds leaving the whole code to around 6-7seconds.

This is a general approach I always used.

Case 2: I had a look at the time taken by the functions and noted the following. Notice the quicksort, partition, swap methods in the hot spots window. The invocation frquency for the simple swap is a place to tune. I replaced the 3 line swap code at 3 invocation points in the quick sort. So swap as a seperate function is gone.

Before the substitution was made partition code took colse to 2 seconds. After eliminating the method invocation, it takes 414 ms ! Quick sort now takes 2.3 seconds. But, overall time saved is 2 seconds! Now the runtime is 3.6 seconds compared to 5.5 before.

Case 3: Something I have taken for granted is Horners method in hashing a string. This is explained by Robert Lafore in his book. Hasing the strings took 120 ms.

Case 4: Since we can save time on method invocations, I believed that a non-recursive version of quick sort can have a serious effect on the running time. This was not the case. The self time of the quick sort function was more by 1 second.  The program itself was a bit slow by 3-4 seconds. There are more instructions in the quicksort function so, the increase is 'explainable'. The stack which was used to bypass the recurssion was being called, for pop and push a huge number of times. The point is that, the non recursive version of quick sort did not yield anything better. On the contrary it is performing a bit worse (7-9 seconds for the whole program ! looking for the missing 3-4 seconds ).