Tuesday 13 December 2011

Classic Longest common subsequence via Dynamic programming

Run time Profiling results of classic longest common subsequence algorithm. 

An example application of this algorithm is finding the difference of two text files. We need to find the longest common subsequence of two sequences. i.e for "bdca" and "bcbda" the longest common subsequence is of length 3 and is  "bda"
The core algorithm using dynamic programming for this is as follows

Lcs(i,j) = {      Lcs(i-1, j-1) + 1 ; if match at i & j.
               Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j
        }

This can be implemented recursively using a two dimensional array m x n where m, n are sequence lengths. For example, two sequences "bdca" and "bcbda" we have the matrix for the algorithm above

0 1 2 3 4

b c b d a
     --------------------------------------
0 b  | 1   1   1   -1   -1 
     |
1 d  | 1   1   1   2   -1
     |
2 c  | -1   2   2   2   -1   
     |
3 a  | -1   -1   -1   -1   3
     --------------------------------------  
   
So here 3 is the length of the longest common subsequence. base on the core algorithm above.
Runtime complexity is about O(mn). This particular basic program yeilds a time as show here

Improvement:

By analysing the recurssion tree of this particular problem there is an improvement that can be made to the runtime. The addition to the core algorithm is as follows

If Lcs(i, j) has not been calculated so far ?

    Lcs(i,j) = { 
                        Lcs(i-1, j-1) + 1 ; if match at i & j.
        Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j
            }

else return the calculated Lcs(i, j);

This small addition saves the code from going to calculate duplicate recursion trees. The time saved (~ 30 ms) is evident here

The sample recurssion tree is shown below. There are a lot of repeated duplicate tasks.

     4, 4

   3,4     4, 3

   2, 4, 4,3          3, 3   4, 2   

1, 4 2, 3 .... 2, 3   3,2 .... 3, 2 4,1
     
Reconstruction is based again on the algorithm. We backtrack based on the two steps of the algorithm. ie if match occurred or the Max of sub-problems.

1. We start at the right end of the tree
2. if there was a match at i & j then we take sequence[i] (= sequence[j] as in step 1 of algorithm) and go to (i-1, j-1)
3. else we can go either to (i-1, j) or (i, j-1) in the matrix. The logical thing to do is to go where there is maximum value.
4. if matrix values at  (i-1, j) or (i, j-1) are same we may have to investigate both.

The disadvantage of this approach is that, as the length of the sequences go up the memory for the matrix also goes up as m x n. One way to improve the memory is to use only two rows since we look at only those every time. But, we may need to keep track of which way we step to reconstruct the lcs sequence. 

Wednesday 7 December 2011

A take on Huffman coding algorithm. How to squeeze data. 133 bytes to 69 bytes.

Problem: Compress data in a text file using huffman encoding algorithm.
Basic approach:  
1 Read the file and build a Huffman encoding tree.
2 Get encoded bit string.
3 Decode and get the original data by walking the Huffman encoding tree.

This method of compression is based on an inefficiency in normal representation of data strings. The inefficiency is that, we use 8 bits to encode a character that appears 1000 times and 8 bits again for a character that appears only once in the input data. So, if we use less number of bits to represent a high frequency character and comparatively more for low frequency character, then we can compress the data. This gives rise to two new issues.

a) The issue then becomes one of knowing the character set or the alphabet of your data in advance inorder to go ahead with this type of encoding. 

b) Encoding the characters as bit sequences can lead to one character being the prefix of another. This can lead decoding program astray. i.e if m is 01101 and e is 011, then we have problem decoding 01101 in out stream since the m might be decoded as an e.

For the time being lets take the ascii character set to address issue a. To deal with issue b we can use a binary tree to represent the encoding. To make sure that, no bit sequnce is  a prefix of another character we use the huffmann algorithm to build the tree. This tree is the huff encoding tree. For example an encoding tree is shown here. 

Input: is a file with the content  
"this is a test file input to huffman encoding algorithm. This will be compressed to a huffmann code. Huffmann encoding tree is used."

The frequency of the characters is calculated as shown here.

Notice that, e occurs 11 times and is encoded with only 3 bits where as b occurs only once and is encoded by 7 bits!. The huffman encodings for the characters are shown here. 

Result: The huffman encoding tree has 45 nodes and the encoded bit stream of the input above is 551 bits ie 68.8 bytes as opposed to the 133 bytes in the input. Bit vector and heap help the encoding tree generation algorithm. But, the tree itself seems to be better of as a pointer based tree as opposed to a array based tree (may not be a balanced tree). This can be pretty handy when sending considerable amounts of data over the network in a software environment.


Tree building algorithm:

1 Read the data from the file.
2 Build a table with the count for each character in your alphabet.
3 Build a huffmann encoding tree based on the frequency of your alphabet.
  a) create leaf nodes with all the characters occuring in the file along with their     
      frequencies.
  b) take two nodes which have the smallest frequencies.
  c) combine these two nodes as the left and right children of a new node.
  d) push the new node into the list with frequency set to the sum of the two child nodes.
4 repeat step three until you end up with one node in the list i.e your root node for the encoding tree.

To encode:
1. Replace each character with the bits encountered while getting to the character form the root of the tree.

To decode:
1 input your stream of encoded bits
2 Until your input is not over
3 for each of the bit encountered 
  a move to the left of the tree if it is a bit 0
  b move to the right of the tree if 1 bit
4 if you have reached the leaf node of the encoding tree, go to the root.

Friday 28 October 2011

How to Code your own Search Engine Like Google, Bing or Yahoo

This subject of this post has been upgraded with new algorithm for set intersection. Go February 2012 posts for a better solution.


Problem: 
How to code your own search engine like Google or Bing? How to design the search engine? The search engine should give results in run time Order of (length of search key words)
Input: 
100 html files which have a lot of english words in it. Here we focus on the words only. Parsing out the html tags can be done too.
Solution: 
Here I use Trie and Binary Tree data structures to design the core search engine. Data can be extracted from pages and the data structures can be built by another program called crawler. This can be combined with the spell suggestion program in the previous post to make it close to real life. 
Design considerations here:
The design for the search engine is shown below. The tries as explained in the previous post gives search results in the Order of length of key words. Here we have two words "search " and "engine". The trie has paths for each of these search terms and end in a list of HTML files that have these search terms in them. This list is maintained as a binary search tree. This makes it to find the set intersection of the files which have both the search terms! Any data structure that efficiently finds set intersection can be used.


Working:
1) Crawl the files. Take the words from the files and add them to the trie. At the trie leaf node set the list of files in a binary tree. 
2) Input the search terms.
3) For each term retrieve the set of files. Here it is available as binary search trees at the leaf nodes. Since we use a trie, for any number of files, I get the set of files for any given word proportional to the length of the search key word.
4) Find the set intersection of these and display to the user. Here a binary search tree is used to represent sets. Any set representation can be as long as the process of finding set intersection is done efficiently.


The following screen shot shows the search engine in action for the key words, "search", "engine" and "search engine"




The run-time:
To find a single word is around 39 microseconds on my computer using the netbeans profiler (as the trie in the previous blog post). The run time for my search engine is shown below. The search took only 62 microseconds. Now that is how Google is fast. As long as your search data structure and set data structure is chosen correct, you can have it as fast as Google or Bing. Plus you can add additional meta data such for paid pages, relevance to terms, page rank etc and make it better. As for memory you can use a better trie as mentioned in the previous post. 

Tuesday 25 October 2011

Spell Suggestion Like Google / Microsoft Word

Problem: 

To implement a spell suggestion program similar to the one you see on Google search box, Microsoft Word and the like. English Language dictionary is used.

Solution / Approach

Algorithm uses a trie to store your keys. Walk through the trie with the user to find the suggestions. An alternative approach is use word edits and employ the spell checker in a previous post. 

Input: A english dictionary of 59000 words.

Advantages: Search for a Key is very fast. Suggestions can be found in good time too.

Disadvantages

The size of the trie can be a limiting factor. But, more than one trie compression technique is available especially for English. But, for this experiment I was focussing on run-time with the first implementation only.

Theory

Trie is a M-ary tree. An indexing operation is done at each node of the tree. The indices are the characters of the search Key. Suppose there are k possible character values for key. In English a-z  + unique marker gives 27. Then each node of the trie has k+1 pointers one for each characters possible at that node. This is similar to the way we search for a word in the dictionary. For example, the word "alpha" is looked up by  looking up character "a" then within that, lookup "l" and so on. You can have the pages of all words beginning with "al". In a trie lookup the first character of the key at the root. Then, follow the pointer in the table corresponding to that character. For the word list the number of nodes in the Trie is 143699 ! The word list is by no means compact / processed for that matter. 

Searching an item simply follows characters in the key through the tree and if you have exhausted the key, you have a hit or a miss. Inserting Keys is simple as in searching, you add nodes when you have a miss for the key values.

In a trie, you search in O(length of the Key). The snapshots speak for themselves on this.     


On the other hand the memory requirement for the basic trie is high. If there are n nodes, then there are n * (k + 1) pointers. The two variations that can be used are 

1) Linked list for Pointers in nodes as most of the table is empty in a Trie for English. The trie using the linked list is a de la Briandais Trie. 

2) Then there is Practical Algorithm To Retrieve Information Coded In Alphanumeric. Patricia Tries! Here, the algorithms for building and maintaining the tree will change since we have to store the keys too.

3) Tries can also be represented as two dimensional arrays without the key values being stored in the table. This can save quite a bit compared to basic trie. 

Experiment:

The run of the java program in NetBeans profiler is here. This is for a small word.


For a really long word the time is bit more as shown in profiler.


Suggestions example runs. The algorithm follows from where the user strayed. Although this throws up more suggestions, it works.

Tuesday 2 August 2011

Beginning Java Security on Android

Android provides all of the java security libraries in Java. Android provides cipher engine classes, Keystore, algorithms, public key cryptography, key generators and the like. The list of algorithms that are supported seems as in the java language updates. Some searching and trials may be needed on this. This is true for the simulator that comes with the android devices too. 


One difference is that, while the default security provider for java on desktops is SunJCE the one on android is bouncycastle. So the default keystore type is BKS rather than JKS. This was not so obvious at first but, a few trials showed that, the keystore was BKS and you can also have PKCS12 keystore. The rest is all the same except that, if you create a PKCS12 keystore you need to store an entry as a PKCS12 type/stream otherwise you get an exception asking for a PKCS12 type. 


Also, If the keystore path has a '.' anywhere in the path, storing the keystore throws exception. So if you stor your keystore at a location with your 'package folder' name for example 'com.myapp' it may result in exception. Removing the '.' character from the path name avoided this.

Monday 1 August 2011

Encrypting Text Editor Crystal v 1.0 for Android

About the application


In the past few weeks I put together an android application which is similar to Crystal in my old blog post. This application allows the user to store text data on an android device by encrypting it. Crystal v 1.0 for android uses java security libraries available on android sdk. Crystal allows you to create a master password for your data files. Once you set your master password, you can write files using crystal and save them in encrypted form and read them later with your master password. 


Crystal v 1.0 for android apk file is here . You will need a third party application to install it on your android phone. (Until I get my android market account set properly !)


    Welcome Screen
     
 More screen shots and how to use crystal is here 


Current limitations
1) Your master password is immutable.
2) Crystal Selects the location for files on your device automatically.
3) AES is the only algorithm supported in the current version.

How it works
Crystal uses java security on android to create a master keystore on your device. This key store holds your key which is used to encrypt/decrypt your data files. The keystore has a password which is the master password. You set this password while using the application and it automatically generates a symmetric key and stores it in your keystore.  The master password is never stored anywhere on your device. Your key is also safe as long as you keep your master password.

Saturday 16 July 2011

Android User Interface Part 2

To build a compound control as shown below.
As in the previous post the steps are to 
1) Define the layout in xml.
2) Define a class to represent the layout in code.
3) Inflate the layout in code.
4) Set it to an activity.
The xml layout.
-------------------
The difference is in the attributes that are specified on the xml layout that allows the buttons or any other view to take up enough space as we need it to.
The layout specifies a editview, a list view and two buttons. The two buttons have their width set to 0 dip. The table row which has the button has a collective weight sum of 2. Each of the buttons has a weight of 1. So they occupy half the width each.
The layout class
-------------------
This is similar to the class in the previous blog but, here it is sub-classed from tablelayout. The rest is same.
Using this in an activity
----------------------------
This is also the same as in previous post. Insatntiate a class of the layout just created with the activity as the context. Set it to the activity using 'setContentView' method. The method also shows how to use the arrayadapter to set values to a listview.
The advantage of using the xml layout is that, once it is defined and represented in a layout sub-class it can be used at multiple application activities on the run.

Android User Interface: Using Xml layout resources Part 1

Problem: Getting a layout in xml, to bring it up on an activity. 


In android Views are widgets or controls like textviews, buttons, lists and the like. Activity is what you see on the screen i.e your window or frame that is to be displayed. Layout is how you arrange / position your views on the screen. This is similar to java layouts.


Solution:


The first thing to do is to get your layout defined in an xml layout resource. This can easily be done by using the tools in eclispe for android development. Either you define the xml as shown below in hand or you can use the graphical editor. 


In the xml file below, there is a textview that acts as a banner and a listview. The attributes of the views are mostly straight forward. We can get more control over the views using these attributes. For the time being we focus on getting this xml layout on the screen.
The second step is to represent layout in code using a class which is subclassed from android.widget.LinearLayout. We will use this class in code to build the layout and the views in it (inflate) at runtime. 


(i) Define a class that extends from android.widget.LinearLayout
(ii) Define fields in the class for the textview and listview. For an instance of    this layout class we will get the corresponding view objects into these.
(iii) Override the Constructor for the view, place a call to the super class constructor. Now we add the code to inflate the xml layout. 
(iv) Inflating the layout is done using a system service in android called layout inflator service. Once we inflate the layout we get a reference to the textview and the listview of this layout by calling the findViewById method.


So far the layout is there to be used on an activity. To use this layout class in an activity,


(i) Declare this layout object in your activity.
(ii) In the onCreate method of this activity create the layout object using the current activity as context.
(iii) Set the layout using the 'setContentView' method.


You can set some values for the text view and the list view after this or in the layout class itself. Here an arrayadapter is used to set values to the listview. 




The advantage of using the xml layout is that, you dont build the layout in code and keep you user interface decoupled from the code. When the layout inflator inflates the layout, it can determine the best by using the xml attributes.


** Note:  Never forget to list your own new activities into the android manifest.











Sunday 12 June 2011

Lejos

Lejos is a firmware jvm that can live in the mindstorms nxt brick. I installed it a couple of days ago after some hiccups. I was following some documents on the web to install this on the brick. After a bit of frustration I was able to get it right. Thanks to mindstormscreator and his installation videos on youtube. This also shows how to get everything up and running even with eclispe IDE. The test run of a java program that plays a tune on the mindstorms brick was successful. At one time after failed flashes my brick was showing the lejos version like a counter 0.7 -0.8, also showing a java exception on the lego brick, finally settling in 0.8. But, it is all sorted out now. A particular book on robotics studio was very tiring even after a number of reads. So, I decided to go the Java way. More information on how to on Lejos is here Lejos. Can't wait to program in java for the mindstorms brick. Hoping for challenges as in Bentley!

Thursday 9 June 2011

Google.com Doodle Guitar 9 June 2011

Check out the new doodle from Google. If you have not seen it go to Google.com. Not your regional google search page but, the US or .com level page. The doodle is a playable guitar. This is my favourite doodle and also the shortest post on this blog ! Also, Google voice search is available on this google.com domain. in case you did not notice. A picture of the doodle


Thursday 2 June 2011

Spell checker design using scatter hash algorithm

This is the design of a spell checker that, I finally coded up. The principle is same as that of McIlroy but, the hash and the bit vector distribution is based on trial and error. I used a list of words that, I stumbled upon in the web. It is by no means optimized for spell corrections. It has 58112 words which I assume as almighty. I take each word and subsequently hash it 32 times to get 32 values which are mapped into a bit vector. The bit vector from the previous post was reused. So, if a word not in the dictionary is entered and checked, atleast one of the hashes would fail. If it failed, the word is not in the dictionary. Happy. Since this uses scattered hash/sort of :), an invalid word can match the hashes accidentally. The fun part was, how this is affected by the hash table size and the hash function. You can literally see the error increase/decrease as you alter these. 

The input: a list of 58112 word.
The output behaviour:

                Bit vector size     invalid word accepted?
--------------------------------------------------------------
case a                500KB                No.
case b                375KB               No.
case c                250KB                No.
case d                125KB                Yes.
Final case:      93.75 KB                Yes.

The first numbers on the bit vector memory feel not so good for a 58K word list. The mathematics that, describes the relation between the hash bit length (32 here), word list length and the probablity of an invalid word hashing correctly is described by McIlroy. This comes into play in the last two cases above. I did not prove it but, runs of the program point to that clearly. It can still be downsized but, the point was to try the scatter technique for the spell checker.

Observation:  I did change the 32 values used in the hash functions which gave me a good reduction in bit vector size and lower error rate. The
final run used 32 consecutive values from the farther end of the proposed hash table size. Previously, I was using differences of 100 ie N - 100, N - 200, N -300, N -400 ...... N - (32 * 100). All that left me was with even values in the hash functions. This affected the accuracy of the hash I can say from the run of the program (i.e invalid words matched seeminlgy easy). But, when I used the consecutive values i.e N - 1, N -2....N - 32 I have introduced primes not purposefully. But, the urge to change the hash values was there. McIlroy says that, the use of n primes for the n hash functions is very good and he used it. A lot of reading on the web claimed that, the use of primes in hash function and their advantage was not that great. But, this small program with only a couple of primes seems to be faring better than with out them. Why 32? McIlroy used 27 I picked 32!. :)). This choice is to be based on the mathematical relation I pointed to above (refer McIlRoy paper).

The program when profiled in Netbeans takes 0.169ms to find if a word in the dictionary. A run of the program on this 573 word post gave optimized, Netbeans, McIlroy, atleast, probablity, seeminlgy as invalid. True. None of these were in the word list. "seeminlgy" is misspelt in this post! The checking code took 20.4 ms.



Friday 20 May 2011

Designing a spell check or corrector. Round 1

Problem: How to implement a simple dictionary. Given a list of words 60000 words in my case. The tough part is not with checking if a word is in the dictionary or not. It is the correction part that is challenging. 

I was aware of how McIlroy used a hash to convert strings and subsequently used the hash as an index into a table. The space compression using bits is also there. So, how to go about the correcting a word. For example, type in the words 'intelligence', 'indelligence', 'itelligance', 'indelligance' into Google search. The search engine asks if you meant the word intelligence. The similarities with all these is how they sound. Even 'entelligant' will get you intelligent. Some reading gave two methods to approach this problem one was to play with the word and making slight changes to see if the new word is in the dictionary. The second is known as the Soundex method that, identifies words that sound alike. The concept of a signature seems to be crucial in this business but the better the signature is in representing all the letter, their sequence in the final signature the more successful the checker will be. 

Could the signature itself be used to find the locality of the word if not the word itself?

A simple not so effective solution: I used the most common signature for strings. Given a word 'cabbage' the signature is a2b2c1e1g1. Kind of beats the purpose to store a signature larger than the word. So another common knowledge to use a2b2ceg. This could tell me some relations between the existing word in the dictionary only. But, when it comes to a wrongly spelt word for example 'cabage' the signature becomes 'a2bceg'. Binary search through the signature list will not help because the signature can lead you astray. A modification with a look ahead logic to see if the direction of binary search will yeild a result or close, was also not up to the mark. The real question is whether I need to store the correct words' list. Data structures like tries, suffix arrays all come to mind. But, search engine uses tries to go to the list of pages with that word. Here I just want to correct the word.  My instinct says no if I was to go with the first method mentioned above.  

I am curious to look at Knuth's description of the soundex algorithm and went through some documents of english words. But, for the time being I feel like abandoning both, although not completely, in favour of a simple approach to do a simple spell correction in a given list of words. More on this as I near a good to disclose method.

Saturday 14 May 2011

Serialization to the rescue

I wanted to post this ever since I had been through Bentley's articles. This happened when I was working on mobile devices and subject is about exploring alternate solutions for a given software problem, each with its own plus, minus points. The lesson I learned is that, there is always a scope and environment to the code that you write and making the right tradeoffs can get you a solution faster.

The Problem: To display visual warning to users on a PDA (MDA Compact 3 & HTC Qtek S9100) when they cross the speed limit. Simple enough. But, there are strings attached. You have to build the feature on an existing solution which displays a map on the screen and provides navigation, traffic info to the users. The exisitng solution is keeping track of the location via gps <= every second and you have to use that data. Now, the display should be like the speed signs on the streets. You need to provide audio alerts too. Things are not simplified by the fact that, a version of TomTom will also be running on the PDA. PDA has 200MHz processor, 64MB RAM. There is the previous version with still limited features that, the software is to work. How do we show the sign/live feed on the screen ?  A sample sign 



Solution 1: Since we had the speed info from the existing components, we could simply draw our way through. A good drawing algorithms to put the street symbol with the speed data on the screen. After a couple of runs and stress tests (remember the existing load too), it was not very responsive plus the screen was left in an inconsistent state. Also, the platform was Microsoft Windows mobile, programming was done using embedded VC++  and Microsoft foundation classes.

Solution 2: Show the sign on a seperate window and use MFC window region classes. What about the Sign itself ? Store it in the memory. The device could spare some. So why not? The bottle neck was with the MFC classes in the device sdk. It was stripped down and did not have the full region functionalities. You can set the region but, how to define the circular region? Again a dead end.

At this time I was pretty much finished with Jeff Prosise's Programming windows with MFC. It and other old Wintellect books have been my favourite while on Microsoft platform. Not to mention Charles Petzold. At this time I remembered about Serialization. Yes, that can let you save the state of an object. Aha. So, I can define the region on my desktop and serialise the region classes. Read the region classes in the PDA and use them as I like. I ran back to my workstation to try this out and see the effects of the stress tests. And it worked. Next my test representative and myself took a version with the modification on a field test. It performed better than the other option.

Although there were a number of tradeoffs in terms of memory, response time etc, solution worked. I discussed it with my lead, tech consultant, project colleagues and we had a good laugh.




Thursday 12 May 2011

Oho! algorithms

Again I have been coding some stuff and analysing the runtimes. This time was the two algorithms for rotating a vector left by N locations. This, a reverser algorithm, is used by Brian Kerningham and Plaugher in their text editor and also by Ken Thomson in the editor for Unix and by Bentley too. So, if ABCD is a vector you want to rotate it by 2 locations to the left, we get CDAB. 

The first algorithm is a Juggler algorithm which juggles the values at indices similar to a shift. This is something that, I arrived at and so I will describe the other 'Reverser' algorithm. It goes like this. AB be the string. A is N units length and we want to rotate by N so the result is BA. First do a Ar B where Ar stands for the reverse of A. Then Ar Br. Then ( Ar Br )r. This should give you BA. Brian reported that, this worked first time. I also am glad to report that, it worked first time ;) :)). This is ofcourse if you have a reverse function to do the Ar and Br etc. In my case it is just a switch of array locations starting from the extremities and moving to the center of the array. 
The general analysis of the two algorithms is not much a do to share on a blog. The reverser is always better than the Juggler algorithm. They did exhibit some minor changes in their runtimes of the order of 1 or 2 ms. 

Well that said, *There is a general question of how the GCD of N and the length of the string figure in the analysis. So, this is what I did (open to questioning).

Input a 1248000 long string.
N is such that, N is the GCD.  

At lower values of N (the GCD), both algorithms are the closest in terms of runtimes. Here is N = 4. Almost same. 10 ms and 10.7 ms.




  
At huge N for N = 1248, the Juggler takes almost twice the time of the Reverser. 6.99 and 13.1.


This seems to be consistent. Anyways, I think I will use the Reverser in my encrypting text editor Crystal. :)

Tuesday 10 May 2011

Getting runtime from ~ 15 seconds to 4 seconds

This is a bit of information on how I got a ~ 15 second program to work in less than 4 seconds. Although the insights are obvious, using the profiler defenitely gave more info than raw intuition.




The Problem: The anagram problem as discussed by Bentley. For a quick example 'rickets', 'sticker' and 'tickers' are anagrams. Using a permutations approach is defenitely unacceptable because of the time complexity. This brings us to the signature approach where we identify anagrams based on signatures. My program uses quick sort to build a signature. The partition code for the quick sort is that discussed by Bentley as I know from my college days the bugs and the confusion involved in coding the partition using the regular method.

The input to the code is a 58111 long list of english words.

Case 1: Using a swap function, quick sort and signature function. Output was dumped to the screen using 'System.out'.  The total run time of the program was around 15 seconds. The quick sort took close to under 5 seconds. The screen dump took around 7-8 seconds !.

* I replaced the screen dump code with a write to a text file. So, I wrote all the anagrams (the hash map contents) to a text file. But, this was done using Buffered Writer. This saved 8 seconds leaving the whole code to around 6-7seconds.

This is a general approach I always used.



Case 2: I had a look at the time taken by the functions and noted the following. Notice the quicksort, partition, swap methods in the hot spots window. The invocation frquency for the simple swap is a place to tune. I replaced the 3 line swap code at 3 invocation points in the quick sort. So swap as a seperate function is gone.

Before the substitution was made partition code took colse to 2 seconds. After eliminating the method invocation, it takes 414 ms ! Quick sort now takes 2.3 seconds. But, overall time saved is 2 seconds! Now the runtime is 3.6 seconds compared to 5.5 before.








Case 3: Something I have taken for granted is Horners method in hashing a string. This is explained by Robert Lafore in his book. Hasing the strings took 120 ms.

Case 4: Since we can save time on method invocations, I believed that a non-recursive version of quick sort can have a serious effect on the running time. This was not the case. The self time of the quick sort function was more by 1 second.  The program itself was a bit slow by 3-4 seconds. There are more instructions in the quicksort function so, the increase is 'explainable'. The stack which was used to bypass the recurssion was being called, for pop and push a huge number of times. The point is that, the non recursive version of quick sort did not yield anything better. On the contrary it is performing a bit worse (7-9 seconds for the whole program ! looking for the missing 3-4 seconds ).


Thursday 27 January 2011

Crystal v1.0 An encryting text editor.

I have been coding and developing an encrypting text editor recently after I felt the need for such an editor myself.  It protects your text file by encrypting it. It is not complicated to use and has familiar menu actions.

 Some technical details.

1. It uses TripleDES key to encrypt your data onto the disk.

2. The data on the disk i.e. the encrypted binary is in base64 encoding.

3. Built out of SUN JAVA.

Features:

1. It performs the actions of a simple text editor well.

2. Provides encryption for the data. TripleDES with a strong key is good enough and is supported by SUN Java.

3. User provides password from which the encryption key is generated.

4. Since it is purely Java it can run on multiple platforms.

5.  The data from the file can be copied on to an email text and then you can use it at the other end with Crystal and the password.

Limitations and Future features:

1. Handles only one file at version 1.0.

2. AES encryption on next version.

3. Another thought that came to my mind is that, the encryption module can be used in application to protect configuration files.

Installation:

1. Crystal is here  OR http://www.filefactory.com/file/b52f2h0/n/Crystal.jar

2. You will need SUN Java 1.6.x for this to work.

3. Since it used TripleDES you will also need unlimited encryption strength policy files from SUN website here  OR https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=jce_policy-6-oth-JPR@CDS-CDS_Developer

4. Copy the policy files to your JRE/lib/security folder. Thats it.
Hope you find this useful.

Screen shots







Sunday 16 January 2011

Bit Vector approach in Java

This is my implementation of the bit vector mentioned in programming pearls. As mentioned in the book the bit vector has execution time and space advantage.1/4th execution time to order a list of 1 million integers in my experiments compared to a quick sort. The execution time was less than a second. The code is self explanatory. May be some more optimization is possible but, this seems ok. The limitation is that, none of the integers repeat in the list plus and integer type is expected for the array index.

An example bit vector representing the numbers 8, 10, 2, 4, 7 is 0010100110100000 with the corresponding bit location set. (0 is inclusive in the list)

/**
 *
 * @author Hari
 */
public class BitVector
{
    //the array of bytes.
   byte[] bitvector;
   //this is the maximum number we expect in the range.
   int nMaxLimit;
   //byte count needed.
   int nByteCount;
   // the zero mask.
   byte byZero = 0x00;
   //byte values used to set and retireve individual bits.
...................
...................
...................
...................
...................

   // Generate the bytes used to set - reset bits in the sequence
   private void genSet()
   {
       bySets[0] = 0x01;
       for(int nIndex = 1; nIndex < 8; nIndex++)
       {
           bySets[nIndex] = (byte) (bySets[nIndex - 1] << 1);
       }//for
   }//

   // Initialize all the bits to zero.
   private void setAllToZero()
   {
       for(int nIndex = 0; nIndex< nByteCount; nIndex++)
       {
           bitvector[nIndex] &= byZero;
       }//for
   }//


   //set the nth bit in the array.
   public void setAt(int n) throws Exception
   {
...................

...................
...................
...................
...................

       locInVector = qt;
       bitvector[locInVector] |= bySets[7 - rem];
          
   }//

   // check if n is in the vector. i.e by checking the nth bit in the byte array.
   public boolean isPresent(int n) throws Exception
   {
       if(n > nMaxLimit || n < 0)
           throw new Exception("Argument outside range of bit vector.");

       int locInVector;
       int qt = n / 8;
       int rem = n % 8;

       locInVector = qt;
       if((bitvector[locInVector] & bySets[7 - rem])==0)
       {
           return false;
       }
       return true;
   }//

...................
...................
...................
...................
...................

}//class