Saturday, 18 May 2013

Knapsack 0/1 problem and algorithm: Implementation in Python, Dynamic programming and Memoization

This post is on the Knapsack algorithm which does the following. You have a list of items each with a value and weight. You have a total capacity as limit. You have to take as many items to increase value within the limit. A good example, similar to the one given by MIT Professor John Guttag (in open courseware) is, you break into a bank vault with a sack which can hold 100Kg of weight. You are faced with a pile of gold coins,  million dollar bonds, a huge stack of Marlboro cigarettes, stacks of red bull and Heineken. How much of each should you take  given your limit? A better example is you are going to Mars and your allotment in the shuttle is 400Kg and you have to take items that are valuable in that limit.

The Algorithm is recursive but, uses memoization to cut short the search through the solution tree. i.e for a pair (number of things remaining to take from, remaining capacity) i.e you have 5 more things to take from and weight limit is now 30Kg then, what do you take for this? If this is already solved in the recursion somewhere we use that, instead going down to that. This is where memoization comes in and Python dictionary is well suited for this. So, you have sub problems and find optimal solutions to those problems and solve the parent problem.

The result of the algorithm is a pair (V, I) where V is the accrued value of items we decided to take. I is the list of items we decided to take.

Knapsack 0/1 Algorithm Pseudo code:

knapsack(<L: list of items>, <W: weight limit now>, <S: solutions accumulated over sub-problems so far>)
{
        IF S is empty i.e first call THEN
           Initialize S the set/dictionary to hold solutions to sub-problems

       IF  combination of (L, W) is already in S THEN
           (V,I) for this sub-problem =  S(L, W)
       ELSE IF L is empty and W is 0 THEN
           (V, I) is (0, nothing)    
       ELSE IF we take the first item on list and its weight, when added, exceeds the limit THEN
              (V,I) = knapsack(L - first item at this call, W, S)
       ELSE
               /* What we do here is. We assume we take the first item. Then what is (V,I) for the remaining
                   walks. Also, what is the result (V,I) id we don't take the item at this point. Choose which
                   ever solution that maximises the value.
               */
               We assume we take the first item of the list.
                Let V1 = Value of first item + (V, I) for knapsack(L - first item, W - weight of first Item, S)
                We also find (V2, I) if we don't take the item.
                       i.e (V2, I) for knapsack(L - first item, W, S)
             
                IF V1 is better than V2 THEN
                       go with that result (V1, I)
                ELSE
                       go with the other result (V2, I)
       Add the corresponding result (V, I) to S
}

Sample with the following items. <Item Name, Weight, Value>

('Gold Biscuits', 20, 100)
('Heineken', 2, 30)
('Marlboro', 1, 10)
('Currency in $10 bills', 10, 10)
('Currency in 50$ bills', 45, 40)
('Million$ Govt Bonds', 1, 1000)
('Diamonds', 25, 80)
('Lays Chips', 1, 4)
('RedBull Cans', 20, 35)
('Predictions on WallStreet', 1, 200)
('Key Combination of Vault', 1, 200)
('Harry Potters List of real life Spells', 2, 1)

Sample Output: As you can see the program takes the things of value :))


Solution Python Code: Is available at
https://drive.google.com/folderview?id=0BxhHg0qy5gi4TkpuTHQ4OWZGYnc&usp=sharing

References:

1. MIT Open courseware on Introduction to Computer Science and Programming, Prof John Guttag

Monday, 13 May 2013

Quick Sort: Partition Algorithm

A post on quick sort is not particularly interesting for most developers. But, when you see more and more people using a complicated partitioning algorithm when explaining quick sort then, it is ok post a simpler partitioning algorithm. This algorithm and similar ones came in the Communications of the ACM and were retold by Bentley in his books. Still, they seem to be lost.
 
Quicksort works recursively by partitioning an array around a pivot element. So, after partitioning all the elements at indices less than the pivot's will be less the pivot itself. Those at indices greater than the pivot's index will be greater. Then, you repeat the same for elements on either side of the pivot. For example, if the array was 4, 19, 1, 3, 43, 45, 11, 2 and you say that, the pivot is 4 then, after partition around element 4, the array looks like this using this partition algorithm (explained further down).

2, 1, 3, 4, 43, 45, 11, 19

Then do quick sort on (2, 1, 3) 4 (43, 45, 11, 19). You get the idea.

So now comes the simpler partitioning algorithm. Here starting from the beginning
we keep track of elements greater than the pivot and whenever we see an element less than the pivot we replace the element which is greater than the pivot (we have it) with the one that is lesser.

So in our example if 4 was the pivot, 19 is the first element that is greater than the pivot, we go on to 1 and see that it is less than the pivot. We swap the two to get 4, 1, 19, 3, 43, 45, 11, 2. Now also we keep track of (index of) the first element that is greater than pivot. Again, we swap it for 3 and the array becomes 4, 1, 3, 19, 43, 45, 11, 2. Then, finally we meet 2 and then swap it for 19 and the array becomes 4, 1, 3, 2, 43, 45, 11, 19. So we have reached the end of the array. Our post condition is that, the pivot should be in such an index that, all elements to the left are less and those to the right are greater. This is achieved by simply swapping the pivot with our track index (here the index of 2). So we swap 4 and 2. Thus we get 2, 1, 3, 4, 43, 45, 11, 19.

Here is a snap of the partition code.
You can run the code in debug mode and see for yourself as to how this works.

The eclipse java project is available at theGoogle docs folder given below.

https://drive.google.com/folderview?id=0BxhHg0qy5gi4V2Q0ZlhrelZlOVU&usp=sharing

Tuesday, 7 May 2013

Google AppEngine: The network is becoming the computer

This post is a take on the Google AppEngine which is a cloud computing solution living in the Platform As a Service (PaaS) layer. A basic sample code is available at the end but, more concrete examples of this post are available on the resources mentioned in the reference section below.

Google AppEngine is Google's, to some extent free, web app execution environment for you. You can develop web applications and deploy it on AppEngine in an instant. You production execution environment is the offer. Traditionally, you design a webapp, the pages, middle tier and the database etc. You also have to install software for your production environment. With AppEngine, you just write the code and upload it. It will be available immediately on Google's infrastructure which you don't have to spend time and money on. On top of it all, is the ability to scale, replicate like Google.

In order to use AppEngine, you just need to signup on the AppEngine site with a email and password. If you have a Google account it is enough. The main applications that are suitable for AppEngine are the ones that require to talk to an HTTP(s) endpoint. Basically, web based applications like an online forum. You are restricted from doing traditional stuff like creating local files, opening outgoing sockets etc. But, you do have a lot of overwhelming options to add to your site like, email, instant messaging, distributed filesystem API, Memory caching, my favorite. You can develop using Python 2.7 which is what I used and also using Java, Google Go with corresponding SDKs from Google. There are number of web frameworks supported including Django. Here I followed the very basic webapp. 

If you have done servlet based development then, it is easy to get first hold of AppEngine. As always you have an application configuration file similar to web.xml. Here it is a .yaml file. My project's yaml file is shown below

You specify the script which will handle the requests to your server url. This script can be thought of as a controller servlet in Java Web programming using servlets. Then with in that python code, you create the application, specify the url mappings, the classes to handle the url mapping requests and you are basically done. So, in helloworld.py, I have the following similar to an MVC web app.

Also the code for the main handler as below, which has get and post to handle the request like servlets.

Set up on Ubuntu Linux:
1) Download the AppEngine Python sdk from Google. This is a zip file.
2) Extract it to say, /usr/local/appengine-python
3) Create links to appcfg.py and dev_appserver.py file in the above path at /usr/local/bin. So that, you can run these programs. This can be done using the command

 sudo ln -s /usr/local/google_appengine/appcfg.py /usr/local/bin/
4) Your folder /usr/local/bin will look something like this
5) Download the Eclipse plugin for AppEngine development and you are done. Although this is not necessary unless you use eclipse.

Uploading your project to AppEngine
1) Go to the root of you webapp project and execute the command as shown below. You will be asked a email and password for authentication and then your app will be provisioned instantly. You can access it at http://<appid when you created it on Appengine site>.appspot.com

2) The site on Google AppEngine
3) Download code for the project from link:
https://drive.google.com/folderview?id=0BxhHg0qy5gi4SFh6ejhBSFE2ZU0&usp=sharing

More explanations on code and features in next post.

Reference:
--------------
1) Core Python Applications Programming Chapter 12 on Google AppEngine.

Monday, 6 May 2013

Part 2 - Friend Suggestion like Facebook, Google+ or Product Suggestion like Amazon: Implementing for your own web app, site

In a previous post in the first half of 2012, I described how to roughly do a friend suggestion like Google or Facebook or item suggestion like Amazon. I am recalling the approach in that post before describing how to use machine learning to get better results. 

In the previous approach, we used to store vectors of user interaction on the hadoop file system. These vectors may include interaction between users or with the site. There we used MapReduce jobs to derive some meaning of the interactions and then using a weighted matrix to apply thresholds and priorities to arrive at the final result.

One thing that, I did not mention in that post is the use of machine learning libraries which have algorithms that can help in such scenarios. For example, Amazon uses machine learning techniques to show you Product recommendations. The counts we described in the previous post are crude although not completely off the chart. This is also used (in addition to proprietary algorithms, Facebook is coming up with a Graph search and algorithms on top of that could possibly make this easy) by Facebook / Linkedin to suggest 'you may know this person' type of recommendations. The focus of this post is that, we should use machine learning libraries to process interaction patterns and arrive at results. 

Input: We imagine that, we have large logs of user interactions, user likes (Facebook), Following items (Google plus) etc at some place on a Hadoop cluster. The best thing would be to use HBase as it is column oriented and is supposedly used at Facebook. Some examples can be user viewed another profile, user likes a movie, user viewed a video of a particular type, user changed his home town. All sorts of interactions. Labels and tags are used to aid this (remember the tags on videos in YouTube and labels on blogger posts!). 

Basically for every interaction or relation we can call it a dimension. For example, viewing another users profile is a dimension. Likes on movies is a dimension. Watching a video of a particular type say 'of Lady Gaga' is a vector in a dimension. Once we have these interaction vectors, we can find what is the affinity of the user in that dimension or type. A good place to start for finding affinity is the percentage frequency count or better the percentage frequency in a particular time period, the latter being more relevant. For example by using MapReduce on the interaction logs on Hadoop, we can arrive at numbers saying that, for the last day userA watched 4 videos tagged with 'Lady Gaga' and viewed profiles of 20 users in Washington D.C area. Such vectors are easy to generate. Refer the 'Hadoop Definitive guide' Tom White's book which has examples similar to this on earthquake data for the US. In a way you don't need to store as logs, you can directly feed the interaction vectors to HBase since it is column oriented ! But, you still need to derive a basic meaning out of the interactions before feeding them to machine learning.

Machine learning can be used to classify, cluster and some other stuff too. Here we are interested in these. Item Recommendations, Clustering and User Neighborhoods are a good place to start when you want to analyse user behavior. Clustering is useful when you want an overall view across all dimensions/vectors we discussed. So, the vectors from the MapReduce jobs are fed to machine learning which gives you results like userA is in neighborhood of userB, userA may like  itemC since he/she is in the same neighborhood as userB, in movies/movie likes dimension userA is similar to userB etc. You can refer to 'Mahout in Action By Sean Owen' on how to get going with this. Sample code in Chapters will be helpful. 

So all together the new approach looks similar to the case studies mentioned in Tom White's book.

Logs /HBase data -> Map Reduce -> Machine Learning -> Recommendations -> Store to RDBMS for provisioning. 

This seems a better way for item recommendations and the like compared to the previous post. More ways on how to this are available online. 

References
--------------
https://www.youtube.com/watch?v=kI4YIYInou0
https://www.youtube.com/watch?v=9A0PnPvQks4
Hadoop the definitive guide - Tom White (Book & code)
Mahout in Action - Sean Owen and team (Book)