Friday, 28 October 2011

How to Code your own Search Engine Like Google, Bing or Yahoo

This subject of this post has been upgraded with new algorithm for set intersection. Go February 2012 posts for a better solution.


Problem: 
How to code your own search engine like Google or Bing? How to design the search engine? The search engine should give results in run time Order of (length of search key words)
Input: 
100 html files which have a lot of english words in it. Here we focus on the words only. Parsing out the html tags can be done too.
Solution: 
Here I use Trie and Binary Tree data structures to design the core search engine. Data can be extracted from pages and the data structures can be built by another program called crawler. This can be combined with the spell suggestion program in the previous post to make it close to real life. 
Design considerations here:
The design for the search engine is shown below. The tries as explained in the previous post gives search results in the Order of length of key words. Here we have two words "search " and "engine". The trie has paths for each of these search terms and end in a list of HTML files that have these search terms in them. This list is maintained as a binary search tree. This makes it to find the set intersection of the files which have both the search terms! Any data structure that efficiently finds set intersection can be used.


Working:
1) Crawl the files. Take the words from the files and add them to the trie. At the trie leaf node set the list of files in a binary tree. 
2) Input the search terms.
3) For each term retrieve the set of files. Here it is available as binary search trees at the leaf nodes. Since we use a trie, for any number of files, I get the set of files for any given word proportional to the length of the search key word.
4) Find the set intersection of these and display to the user. Here a binary search tree is used to represent sets. Any set representation can be as long as the process of finding set intersection is done efficiently.


The following screen shot shows the search engine in action for the key words, "search", "engine" and "search engine"




The run-time:
To find a single word is around 39 microseconds on my computer using the netbeans profiler (as the trie in the previous blog post). The run time for my search engine is shown below. The search took only 62 microseconds. Now that is how Google is fast. As long as your search data structure and set data structure is chosen correct, you can have it as fast as Google or Bing. Plus you can add additional meta data such for paid pages, relevance to terms, page rank etc and make it better. As for memory you can use a better trie as mentioned in the previous post. 

Tuesday, 25 October 2011

Spell Suggestion Like Google / Microsoft Word

Problem: 

To implement a spell suggestion program similar to the one you see on Google search box, Microsoft Word and the like. English Language dictionary is used.

Solution / Approach

Algorithm uses a trie to store your keys. Walk through the trie with the user to find the suggestions. An alternative approach is use word edits and employ the spell checker in a previous post. 

Input: A english dictionary of 59000 words.

Advantages: Search for a Key is very fast. Suggestions can be found in good time too.

Disadvantages

The size of the trie can be a limiting factor. But, more than one trie compression technique is available especially for English. But, for this experiment I was focussing on run-time with the first implementation only.

Theory

Trie is a M-ary tree. An indexing operation is done at each node of the tree. The indices are the characters of the search Key. Suppose there are k possible character values for key. In English a-z  + unique marker gives 27. Then each node of the trie has k+1 pointers one for each characters possible at that node. This is similar to the way we search for a word in the dictionary. For example, the word "alpha" is looked up by  looking up character "a" then within that, lookup "l" and so on. You can have the pages of all words beginning with "al". In a trie lookup the first character of the key at the root. Then, follow the pointer in the table corresponding to that character. For the word list the number of nodes in the Trie is 143699 ! The word list is by no means compact / processed for that matter. 

Searching an item simply follows characters in the key through the tree and if you have exhausted the key, you have a hit or a miss. Inserting Keys is simple as in searching, you add nodes when you have a miss for the key values.

In a trie, you search in O(length of the Key). The snapshots speak for themselves on this.     


On the other hand the memory requirement for the basic trie is high. If there are n nodes, then there are n * (k + 1) pointers. The two variations that can be used are 

1) Linked list for Pointers in nodes as most of the table is empty in a Trie for English. The trie using the linked list is a de la Briandais Trie. 

2) Then there is Practical Algorithm To Retrieve Information Coded In Alphanumeric. Patricia Tries! Here, the algorithms for building and maintaining the tree will change since we have to store the keys too.

3) Tries can also be represented as two dimensional arrays without the key values being stored in the table. This can save quite a bit compared to basic trie. 

Experiment:

The run of the java program in NetBeans profiler is here. This is for a small word.


For a really long word the time is bit more as shown in profiler.


Suggestions example runs. The algorithm follows from where the user strayed. Although this throws up more suggestions, it works.