Monday, 6 May 2013

Part 2 - Friend Suggestion like Facebook, Google+ or Product Suggestion like Amazon: Implementing for your own web app, site

In a previous post in the first half of 2012, I described how to roughly do a friend suggestion like Google or Facebook or item suggestion like Amazon. I am recalling the approach in that post before describing how to use machine learning to get better results. 

In the previous approach, we used to store vectors of user interaction on the hadoop file system. These vectors may include interaction between users or with the site. There we used MapReduce jobs to derive some meaning of the interactions and then using a weighted matrix to apply thresholds and priorities to arrive at the final result.

One thing that, I did not mention in that post is the use of machine learning libraries which have algorithms that can help in such scenarios. For example, Amazon uses machine learning techniques to show you Product recommendations. The counts we described in the previous post are crude although not completely off the chart. This is also used (in addition to proprietary algorithms, Facebook is coming up with a Graph search and algorithms on top of that could possibly make this easy) by Facebook / Linkedin to suggest 'you may know this person' type of recommendations. The focus of this post is that, we should use machine learning libraries to process interaction patterns and arrive at results. 

Input: We imagine that, we have large logs of user interactions, user likes (Facebook), Following items (Google plus) etc at some place on a Hadoop cluster. The best thing would be to use HBase as it is column oriented and is supposedly used at Facebook. Some examples can be user viewed another profile, user likes a movie, user viewed a video of a particular type, user changed his home town. All sorts of interactions. Labels and tags are used to aid this (remember the tags on videos in YouTube and labels on blogger posts!). 

Basically for every interaction or relation we can call it a dimension. For example, viewing another users profile is a dimension. Likes on movies is a dimension. Watching a video of a particular type say 'of Lady Gaga' is a vector in a dimension. Once we have these interaction vectors, we can find what is the affinity of the user in that dimension or type. A good place to start for finding affinity is the percentage frequency count or better the percentage frequency in a particular time period, the latter being more relevant. For example by using MapReduce on the interaction logs on Hadoop, we can arrive at numbers saying that, for the last day userA watched 4 videos tagged with 'Lady Gaga' and viewed profiles of 20 users in Washington D.C area. Such vectors are easy to generate. Refer the 'Hadoop Definitive guide' Tom White's book which has examples similar to this on earthquake data for the US. In a way you don't need to store as logs, you can directly feed the interaction vectors to HBase since it is column oriented ! But, you still need to derive a basic meaning out of the interactions before feeding them to machine learning.

Machine learning can be used to classify, cluster and some other stuff too. Here we are interested in these. Item Recommendations, Clustering and User Neighborhoods are a good place to start when you want to analyse user behavior. Clustering is useful when you want an overall view across all dimensions/vectors we discussed. So, the vectors from the MapReduce jobs are fed to machine learning which gives you results like userA is in neighborhood of userB, userA may like  itemC since he/she is in the same neighborhood as userB, in movies/movie likes dimension userA is similar to userB etc. You can refer to 'Mahout in Action By Sean Owen' on how to get going with this. Sample code in Chapters will be helpful. 

So all together the new approach looks similar to the case studies mentioned in Tom White's book.

Logs /HBase data -> Map Reduce -> Machine Learning -> Recommendations -> Store to RDBMS for provisioning. 

This seems a better way for item recommendations and the like compared to the previous post. More ways on how to this are available online. 

References
--------------
https://www.youtube.com/watch?v=kI4YIYInou0
https://www.youtube.com/watch?v=9A0PnPvQks4
Hadoop the definitive guide - Tom White (Book & code)
Mahout in Action - Sean Owen and team (Book)

No comments: