Programming Communications: June 2011

Sunday 12 June 2011

Lejos

Lejos is a firmware jvm that can live in the mindstorms nxt brick. I installed it a couple of days ago after some hiccups. I was following some documents on the web to install this on the brick. After a bit of frustration I was able to get it right. Thanks to mindstormscreator and his installation videos on youtube. This also shows how to get everything up and running even with eclispe IDE. The test run of a java program that plays a tune on the mindstorms brick was successful. At one time after failed flashes my brick was showing the lejos version like a counter 0.7 -0.8, also showing a java exception on the lego brick, finally settling in 0.8. But, it is all sorted out now. A particular book on robotics studio was very tiring even after a number of reads. So, I decided to go the Java way. More information on how to on Lejos is here Lejos. Can't wait to program in java for the mindstorms brick. Hoping for challenges as in Bentley!

Thursday 9 June 2011

Google.com Doodle Guitar 9 June 2011

Check out the new doodle from Google. If you have not seen it go to Google.com. Not your regional google search page but, the US or .com level page. The doodle is a playable guitar. This is my favourite doodle and also the shortest post on this blog ! Also, Google voice search is available on this google.com domain. in case you did not notice. A picture of the doodle

Thursday 2 June 2011

Spell checker design using scatter hash algorithm

This is the design of a spell checker that, I finally coded up. The principle is same as that of McIlroy but, the hash and the bit vector distribution is based on trial and error. I used a list of words that, I stumbled upon in the web. It is by no means optimized for spell corrections. It has 58112 words which I assume as almighty. I take each word and subsequently hash it 32 times to get 32 values which are mapped into a bit vector. The bit vector from the previous post was reused. So, if a word not in the dictionary is entered and checked, atleast one of the hashes would fail. If it failed, the word is not in the dictionary. Happy. Since this uses scattered hash/sort of :), an invalid word can match the hashes accidentally. The fun part was, how this is affected by the hash table size and the hash function. You can literally see the error increase/decrease as you alter these.

The input: a list of 58112 word.
The output behaviour:

Bit vector size invalid word accepted?

--------------------------------------------------------------
case a                500KB                No.
case b                375KB               No.
case c                250KB                No.
case d                125KB                Yes.
Final case:     93.75 KB                Yes.

The first numbers on the bit vector memory feel not so good for a 58K word list. The mathematics that, describes the relation between the hash bit length (32 here), word list length and the probablity of an invalid word hashing correctly is described by McIlroy. This comes into play in the last two cases above. I did not prove it but, runs of the program point to that clearly. It can still be downsized but, the point was to try the scatter technique for the spell checker.

Observation: I did change the 32 values used in the hash functions which gave me a good reduction in bit vector size and lower error rate. The
final run used 32 consecutive values from the farther end of the proposed hash table size. Previously, I was using differences of 100 ie N - 100, N - 200, N -300, N -400 ...... N - (32 * 100). All that left me was with even values in the hash functions. This affected the accuracy of the hash I can say from the run of the program (i.e invalid words matched seeminlgy easy). But, when I used the consecutive values i.e N - 1, N -2....N - 32 I have introduced primes not purposefully. But, the urge to change the hash values was there. McIlroy says that, the use of n primes for the n hash functions is very good and he used it. A lot of reading on the web claimed that, the use of primes in hash function and their advantage was not that great. But, this small program with only a couple of primes seems to be faring better than with out them. Why 32? McIlroy used 27 I picked 32!. :)). This choice is to be based on the mathematical relation I pointed to above (refer McIlRoy paper).

The program when profiled in Netbeans takes 0.169ms to find if a word in the dictionary. A run of the program on this 573 word post gave optimized, Netbeans, McIlroy, atleast, probablity, seeminlgy as invalid. True. None of these were in the word list. "seeminlgy" is misspelt in this post! The checking code took 20.4 ms.