Saturday, 18 May 2013

Knapsack 0/1 problem and algorithm: Implementation in Python, Dynamic programming and Memoization

This post is on the Knapsack algorithm which does the following. You have a list of items each with a value and weight. You have a total capacity as limit. You have to take as many items to increase value within the limit. A good example, similar to the one given by MIT Professor John Guttag (in open courseware) is, you break into a bank vault with a sack which can hold 100Kg of weight. You are faced with a pile of gold coins,  million dollar bonds, a huge stack of Marlboro cigarettes, stacks of red bull and Heineken. How much of each should you take  given your limit? A better example is you are going to Mars and your allotment in the shuttle is 400Kg and you have to take items that are valuable in that limit.

The Algorithm is recursive but, uses memoization to cut short the search through the solution tree. i.e for a pair (number of things remaining to take from, remaining capacity) i.e you have 5 more things to take from and weight limit is now 30Kg then, what do you take for this? If this is already solved in the recursion somewhere we use that, instead going down to that. This is where memoization comes in and Python dictionary is well suited for this. So, you have sub problems and find optimal solutions to those problems and solve the parent problem.

The result of the algorithm is a pair (V, I) where V is the accrued value of items we decided to take. I is the list of items we decided to take.

Knapsack 0/1 Algorithm Pseudo code:

knapsack(<L: list of items>, <W: weight limit now>, <S: solutions accumulated over sub-problems so far>)
{
        IF S is empty i.e first call THEN
           Initialize S the set/dictionary to hold solutions to sub-problems

       IF  combination of (L, W) is already in S THEN
           (V,I) for this sub-problem =  S(L, W)
       ELSE IF L is empty and W is 0 THEN
           (V, I) is (0, nothing)    
       ELSE IF we take the first item on list and its weight, when added, exceeds the limit THEN
              (V,I) = knapsack(L - first item at this call, W, S)
       ELSE
               /* What we do here is. We assume we take the first item. Then what is (V,I) for the remaining
                   walks. Also, what is the result (V,I) id we don't take the item at this point. Choose which
                   ever solution that maximises the value.
               */
               We assume we take the first item of the list.
                Let V1 = Value of first item + (V, I) for knapsack(L - first item, W - weight of first Item, S)
                We also find (V2, I) if we don't take the item.
                       i.e (V2, I) for knapsack(L - first item, W, S)
             
                IF V1 is better than V2 THEN
                       go with that result (V1, I)
                ELSE
                       go with the other result (V2, I)
       Add the corresponding result (V, I) to S
}

Sample with the following items. <Item Name, Weight, Value>

('Gold Biscuits', 20, 100)
('Heineken', 2, 30)
('Marlboro', 1, 10)
('Currency in $10 bills', 10, 10)
('Currency in 50$ bills', 45, 40)
('Million$ Govt Bonds', 1, 1000)
('Diamonds', 25, 80)
('Lays Chips', 1, 4)
('RedBull Cans', 20, 35)
('Predictions on WallStreet', 1, 200)
('Key Combination of Vault', 1, 200)
('Harry Potters List of real life Spells', 2, 1)

Sample Output: As you can see the program takes the things of value :))


Solution Python Code: Is available at
https://drive.google.com/folderview?id=0BxhHg0qy5gi4TkpuTHQ4OWZGYnc&usp=sharing

References:

1. MIT Open courseware on Introduction to Computer Science and Programming, Prof John Guttag

Monday, 13 May 2013

Quick Sort: Partition Algorithm

A post on quick sort is not particularly interesting for most developers. But, when you see more and more people using a complicated partitioning algorithm when explaining quick sort then, it is ok post a simpler partitioning algorithm. This algorithm and similar ones came in the Communications of the ACM and were retold by Bentley in his books. Still, they seem to be lost.
 
Quicksort works recursively by partitioning an array around a pivot element. So, after partitioning all the elements at indices less than the pivot's will be less the pivot itself. Those at indices greater than the pivot's index will be greater. Then, you repeat the same for elements on either side of the pivot. For example, if the array was 4, 19, 1, 3, 43, 45, 11, 2 and you say that, the pivot is 4 then, after partition around element 4, the array looks like this using this partition algorithm (explained further down).

2, 1, 3, 4, 43, 45, 11, 19

Then do quick sort on (2, 1, 3) 4 (43, 45, 11, 19). You get the idea.

So now comes the simpler partitioning algorithm. Here starting from the beginning
we keep track of elements greater than the pivot and whenever we see an element less than the pivot we replace the element which is greater than the pivot (we have it) with the one that is lesser.

So in our example if 4 was the pivot, 19 is the first element that is greater than the pivot, we go on to 1 and see that it is less than the pivot. We swap the two to get 4, 1, 19, 3, 43, 45, 11, 2. Now also we keep track of (index of) the first element that is greater than pivot. Again, we swap it for 3 and the array becomes 4, 1, 3, 19, 43, 45, 11, 2. Then, finally we meet 2 and then swap it for 19 and the array becomes 4, 1, 3, 2, 43, 45, 11, 19. So we have reached the end of the array. Our post condition is that, the pivot should be in such an index that, all elements to the left are less and those to the right are greater. This is achieved by simply swapping the pivot with our track index (here the index of 2). So we swap 4 and 2. Thus we get 2, 1, 3, 4, 43, 45, 11, 19.

Here is a snap of the partition code.
You can run the code in debug mode and see for yourself as to how this works.

The eclipse java project is available at theGoogle docs folder given below.

https://drive.google.com/folderview?id=0BxhHg0qy5gi4V2Q0ZlhrelZlOVU&usp=sharing

Tuesday, 7 May 2013

Google AppEngine: The network is becoming the computer

This post is a take on the Google AppEngine which is a cloud computing solution living in the Platform As a Service (PaaS) layer. A basic sample code is available at the end but, more concrete examples of this post are available on the resources mentioned in the reference section below.

Google AppEngine is Google's, to some extent free, web app execution environment for you. You can develop web applications and deploy it on AppEngine in an instant. You production execution environment is the offer. Traditionally, you design a webapp, the pages, middle tier and the database etc. You also have to install software for your production environment. With AppEngine, you just write the code and upload it. It will be available immediately on Google's infrastructure which you don't have to spend time and money on. On top of it all, is the ability to scale, replicate like Google.

In order to use AppEngine, you just need to signup on the AppEngine site with a email and password. If you have a Google account it is enough. The main applications that are suitable for AppEngine are the ones that require to talk to an HTTP(s) endpoint. Basically, web based applications like an online forum. You are restricted from doing traditional stuff like creating local files, opening outgoing sockets etc. But, you do have a lot of overwhelming options to add to your site like, email, instant messaging, distributed filesystem API, Memory caching, my favorite. You can develop using Python 2.7 which is what I used and also using Java, Google Go with corresponding SDKs from Google. There are number of web frameworks supported including Django. Here I followed the very basic webapp. 

If you have done servlet based development then, it is easy to get first hold of AppEngine. As always you have an application configuration file similar to web.xml. Here it is a .yaml file. My project's yaml file is shown below

You specify the script which will handle the requests to your server url. This script can be thought of as a controller servlet in Java Web programming using servlets. Then with in that python code, you create the application, specify the url mappings, the classes to handle the url mapping requests and you are basically done. So, in helloworld.py, I have the following similar to an MVC web app.

Also the code for the main handler as below, which has get and post to handle the request like servlets.

Set up on Ubuntu Linux:
1) Download the AppEngine Python sdk from Google. This is a zip file.
2) Extract it to say, /usr/local/appengine-python
3) Create links to appcfg.py and dev_appserver.py file in the above path at /usr/local/bin. So that, you can run these programs. This can be done using the command

 sudo ln -s /usr/local/google_appengine/appcfg.py /usr/local/bin/
4) Your folder /usr/local/bin will look something like this
5) Download the Eclipse plugin for AppEngine development and you are done. Although this is not necessary unless you use eclipse.

Uploading your project to AppEngine
1) Go to the root of you webapp project and execute the command as shown below. You will be asked a email and password for authentication and then your app will be provisioned instantly. You can access it at http://<appid when you created it on Appengine site>.appspot.com

2) The site on Google AppEngine
3) Download code for the project from link:
https://drive.google.com/folderview?id=0BxhHg0qy5gi4SFh6ejhBSFE2ZU0&usp=sharing

More explanations on code and features in next post.

Reference:
--------------
1) Core Python Applications Programming Chapter 12 on Google AppEngine.

Monday, 6 May 2013

Part 2 - Friend Suggestion like Facebook, Google+ or Product Suggestion like Amazon: Implementing for your own web app, site

In a previous post in the first half of 2012, I described how to roughly do a friend suggestion like Google or Facebook or item suggestion like Amazon. I am recalling the approach in that post before describing how to use machine learning to get better results. 

In the previous approach, we used to store vectors of user interaction on the hadoop file system. These vectors may include interaction between users or with the site. There we used MapReduce jobs to derive some meaning of the interactions and then using a weighted matrix to apply thresholds and priorities to arrive at the final result.

One thing that, I did not mention in that post is the use of machine learning libraries which have algorithms that can help in such scenarios. For example, Amazon uses machine learning techniques to show you Product recommendations. The counts we described in the previous post are crude although not completely off the chart. This is also used (in addition to proprietary algorithms, Facebook is coming up with a Graph search and algorithms on top of that could possibly make this easy) by Facebook / Linkedin to suggest 'you may know this person' type of recommendations. The focus of this post is that, we should use machine learning libraries to process interaction patterns and arrive at results. 

Input: We imagine that, we have large logs of user interactions, user likes (Facebook), Following items (Google plus) etc at some place on a Hadoop cluster. The best thing would be to use HBase as it is column oriented and is supposedly used at Facebook. Some examples can be user viewed another profile, user likes a movie, user viewed a video of a particular type, user changed his home town. All sorts of interactions. Labels and tags are used to aid this (remember the tags on videos in YouTube and labels on blogger posts!). 

Basically for every interaction or relation we can call it a dimension. For example, viewing another users profile is a dimension. Likes on movies is a dimension. Watching a video of a particular type say 'of Lady Gaga' is a vector in a dimension. Once we have these interaction vectors, we can find what is the affinity of the user in that dimension or type. A good place to start for finding affinity is the percentage frequency count or better the percentage frequency in a particular time period, the latter being more relevant. For example by using MapReduce on the interaction logs on Hadoop, we can arrive at numbers saying that, for the last day userA watched 4 videos tagged with 'Lady Gaga' and viewed profiles of 20 users in Washington D.C area. Such vectors are easy to generate. Refer the 'Hadoop Definitive guide' Tom White's book which has examples similar to this on earthquake data for the US. In a way you don't need to store as logs, you can directly feed the interaction vectors to HBase since it is column oriented ! But, you still need to derive a basic meaning out of the interactions before feeding them to machine learning.

Machine learning can be used to classify, cluster and some other stuff too. Here we are interested in these. Item Recommendations, Clustering and User Neighborhoods are a good place to start when you want to analyse user behavior. Clustering is useful when you want an overall view across all dimensions/vectors we discussed. So, the vectors from the MapReduce jobs are fed to machine learning which gives you results like userA is in neighborhood of userB, userA may like  itemC since he/she is in the same neighborhood as userB, in movies/movie likes dimension userA is similar to userB etc. You can refer to 'Mahout in Action By Sean Owen' on how to get going with this. Sample code in Chapters will be helpful. 

So all together the new approach looks similar to the case studies mentioned in Tom White's book.

Logs /HBase data -> Map Reduce -> Machine Learning -> Recommendations -> Store to RDBMS for provisioning. 

This seems a better way for item recommendations and the like compared to the previous post. More ways on how to this are available online. 

References
--------------
https://www.youtube.com/watch?v=kI4YIYInou0
https://www.youtube.com/watch?v=9A0PnPvQks4
Hadoop the definitive guide - Tom White (Book & code)
Mahout in Action - Sean Owen and team (Book)

Monday, 8 April 2013

Programming for Social Media : Twitter : Reading Tweets using python

Just to bring the aspect of search, social media into perspective before the post: 

Todays ACM technews has an article from Eberly College of Science relating to tweets analysis relating to a particular vaccine. It is available at 
http://science.psu.edu/news-and-events/2013-news/Salathe4-2013 Tweets were analysed for terms relating to the vaccine/vaccination. Then, user opinion was deduced. This is an example of using social media to derive trends or responses or acceptance.

As another example of impact of internet search: Earlier in December, during an informal Big Data meetup involving, Ministry of Health, Visa and SingTel, the head of MoH software division commented that, Google could find out if, there would be an outbreak of a particular disease say a flu in a particular region. This is due to the fact that, people would be searching for medicines, symptoms, prevention etc on Google. This could be analyzed to deduce what health issues are more prevalent in a particular region. 

Thinking on that line, Google does not know if, you actually bought the medicine or anything after a search. Whereas Amazon or eBay knows what you bought but, does not know if, you bought the item after a search or your affinity for that item. Standing out is Facebook which knows everything about you and your friends but, does not have the above two factors.

That said here is a small program that, demonstrates Python Twitter API. This retrieves tweets on an interesting subject and processes them. The tags in each tweet are counted and displayed in ascending order. For example, if #Tottenham appears 20 times and  #OldTrafford appears 40 times in the tweet stream, then it comes up in the list. Basically, a measure of what people are dicussing and how popular the topic is with the public.

Objective: Retrieve tweets with "premierleague" on it. Then find the most popular tags. Basically you can search for any key word.

Language used Python 2.7 and Twython for Twitter interface. Operating system Ubuntu 12.04. 

Twython API core code snapshot for Twitter


Sample Screen run on keyword "premierleague". (Printing tag counts by minute rather than by hour so that, output is available frequently)


The complete eclipse python project is available here as zip. You will need a postgresql database and Twython installed. The database table format is available here. You can also run TaTwitterQn/TwitterSearcher.py without the database to get tweets and show the tag counts.

Project Link:
https://docs.google.com/file/d/0B-oeIeog2xb7WDRFLUdzM0w0cEE/edit?usp=sharing

Some Issues/observations developing this program: https://github.com/ryanmcgrath/twython/tree/master/core_examples

1) Tweepy a popular Twitter API for python returns an 401 failure for search / filter operations. This is on the latest version of Tweepy from Git. But, tweepy has streaming api.

2) python-twitter seems to be a good api but, does not authenticate at times and also some of the documented parts do not seem to work. For example, the getHashTags part of the Tweets API does not work.

3) The same program can be coded in Java but, the effort would be more and the code size too. Python on the other hands was small and easy. If you already have a Java environment like Hadoop running then, it seems ok.

References:
1) https://github.com/ryanmcgrath/twython/tree/master/core_examples This is a good list of examples to use twython and you can get started to do your own programs easily.

Sunday, 31 March 2013

ANTLR 4.0

If you do use Google Analytics, you may have noticed that, the API is available for developers to use. Also, on a Youtube video showing "Google Analytics End to End", the engineers show 2 ways that developers can interact with Google Analytics. One is through an API that Google has exposed for developers to use. The second is a small edit box where you can type in the metrics / filters and you click on a button and you get the results based on the filters you mentioned.

Typing things in an English like question / expression which is not exactly sql but, will get you the results you are looking for does come in handy especially since, ordinary users can interface with your system easily. This is conceptually same as the Interactive Shop Finder in malls except that, you type in the question you are looking for. Users will be able to relate to this kind of interface as they will be able to test drive your software. 

Considering that, you need to cook up an expression / question that you will take from the user, parse that and get the results; means you need a small language that will allow your users to express themselves. This may have you filters as pre-defined keywords. For example, ga.City = Pasadena and ga.Country = USA is similar to what Google gives you. 

ANother Tool for Language Recognition will allow you to specify the Backus Naur Form of your small language and build you a parser. So all you have to do is;

u
1) Use ANTLR to specify the BNF of your small language
2) Get the parser for your language using ANTLR.
3) Use the parser to parse the user question
4) Catch the elements in the question that makes sense to you. i.e walk the expression tree. This is possible by assigning labels to your grammar rules. This is extremely helpful when you are walking the tree.
5) Build the real query that your database takes and get the result.



References:

ANTLR website has a lot of examples, and the BNF of the mainstream languages like Java and the like. So developers can hit the ground running designing their own small languages.

Sunday, 23 December 2012

Data Analytics : Hatari !

Embedding Pig in Python - Hatari!
---------------------------------------------
Pig Latin does not allow control structures like Java and other high-level languages. So inorder to get more control over the execution context, loop through a set of parameters etc we can embedd Pig in Python. This is supported only for Python 2.7 and not above. At first this may seem not so useful. But, if you have a single script, i.e a parameterised pig script like the one in the previous post, you can attach parameters to the script from Python. Plus since Python can also be used to write User Defined Functions, all the code Pig, Python, UDFs reside in the same project. This is usefull when things get more complex. In the previous post note that, I used a java project to create a java UDF which was used in Pig. So I had 2 deployments a pig script and a Java jar. But, if we embedd pig in python, we have only one deployment, the python code.

1) Write pig script in a file.
2) Use compileFromFile() function to load and compile your script. 
3) Bind using your list of map of parameters. yes a list of map of parameters.
4) Run the script using runSingle()
5) Get status on how things went using runSingle(). You can also use isSuccessFul().

Better yet Do this again for all your scripts. You can even read parameters from a file, add threading where each thread will execute a script with one set of parameters ;-)

Though this is is very useful. I find it easier to write the pig script in a text pad and test it in my pseudo cluster command line. Then, write a python program as follows to load the script file in the python code then, compile, bind and run.

You need Jython 2.5 for this to work since you use jython to interpret the python script.  May be it is something that I am doing but, some things that may crop up while you develop this are

a) Eclipse hangs when I add the Jython interpretor.

b) Eclipse does not show auto-complete options for such scripts. This is important because, you dont know if a function is available or if you typed it correctly. But, you can always download pig source and refer to get around this.

c) Another error that I came across was this one

ERROR 2998: Unhandled internal error. org/python/util/PythonInterpreter
java.lang.NoClassDefFoundError: org/python/util/PythonInterpreter at org.apache.pig.scripting.jython.JythonScriptEngine.main(JythonScriptEngine.java:338

I got rid of this by adding the jython jar to the Hadoop lib folder. Basically a classpath issue it seems. More at stackoverflow here http://stackoverflow.com/questions/13795993/embedding-pig-into-python

Ref: More details can be found in the presentation at http://www.slideshare.net/julienledem/presentation-pig-scripting by Julien Le Dem.

Ref: Good Read Chapter 9 on Book, Programming Pig by Alan F Gates

Data Analytics: Using Apache Pig User Defined Functions

Apache pig helps to analyse large data sets. We can define data flows in pig latin scripts and extract interesting information from the initial data set. For example, we may have a large data dump of user actions on your website as a csv file or in a database like hbase. Then we want to find the most frequented part of your site or the top 10 things that interests your user base in your site. To write up a solution for this is possible in a highlevel language like Java. The optimizations for handling data size as it grows, splitting the task into multiple parts, keeping track of those individual tasks will be challenging and will consume a lot of effort. Apache Hadoop and Pig help to solve this problem by providing a framework with which we can focus on the data flow rather than on the plumbing.

Derieving interesting data is always tied to a time period. We may want to extract interesting information from the whole life time of the data or we want to perform the same on a given time period say the last month or week. To specify such options, we have user defined functions in pig. This allows us to write filters and operations that we want pig to perform on each entry of the data set. This gives more control on the data filtering, flow. Some of the nuances of pig udf are explained in the example below.

Pig Version of this example: Apache Pig version 0.10.0 (r1328203)

Objective: You want to write a filter function in PIG to filter data rows according to a date range that you are interested in. You want to invoke the script from a scheduler which passes in the date range as command line parameters. A pig script is shown in the image below.



1) Passing command line parameters to pig script: You need to pass command line arguments like this pig -param datefrom='2012-10-20 00:00:00' -param dateto='2012-10-23 00:00:00' -x mapreduce user-based-analytics.pig. (I am actually calling the script from Python, which we will see in the next post).

Here I am using these two date parameters to build my Java UDF. If you are passing parameters with space character, it has to be like this otherwise pig will throw an error saying that, it cannot create your UDF java class.

2) With in the pig script: You refer to the command line parameters using the format '$' for example, '$dateto' and '$datefrom' as in the image above. unless it is an integer like $parallelism.

3) Defining your UDF reference using DEFINE pig keyword: This allows you to create a reference to your UDF which you can call using an alias. For example, the script defines a reference to the UDF as follows,

define date-based-filter com.home.pig.udfs.TimeStampFilter('$datefrom', '$dateto')

where date-based-filter is the alias that I will use to call my UDF com.home.pig.udfs.TimeStampFilter java class.

4) Calling your UDF filter in a pig script using FILTER keyword: Pig does not have a boolean data type. But, expressions are evaluated to boolean true or false. You need to call your UDF as follows, with the alias for your UDF. Here we are checking for datebasedfilter(ts) == TRUE i.e does my UDF 'com.home.pig.udfs.TimeStampFilter' acting on the current row with 'dateto' and 'datefrom' return Java Boolean true or false.

filter-by-date = filter site-data by date-based-filter(ts) == TRUE;

5) Now the Java Class that does the filtering.

a) We have to create a java class that extends FilterFunc from org.apache.pig.FilterFunc. The constructor has to take parameters that, you set in the script above. So we have two parameters.

b) Override the public Boolean exec(Tuple arg0) member function to define how this filter will handle tuples from the script. Here I just get the date from the string and check if it is within the range.

c) Package this in a jar and put it in the same location path as your script.

Why use Pig and UDF
s? Writing UDFs can be easy and saves a lot of time compared to writing a MapReduce Java program or any other option. Plus, if you have a ton of data or will end up with one this is better option since Hadoop will scale and Pig will do the jobs like data groupings, filtering for you.

Better to use Python? Although it is easy to write the UDF in Java and the justification that Pig is in Java, there is already a java environment turned on; it may be better to write User Defined Functions in Python and also trigger the script in for greater control! plus every thing will be at one place.

For more on this topic Refer to Chapter 10 in Book, Programming Pig by Alan F Gates