Sunday, 23 December 2012

Data Analytics : Hatari !

Embedding Pig in Python - Hatari!
---------------------------------------------
Pig Latin does not allow control structures like Java and other high-level languages. So inorder to get more control over the execution context, loop through a set of parameters etc we can embedd Pig in Python. This is supported only for Python 2.7 and not above. At first this may seem not so useful. But, if you have a single script, i.e a parameterised pig script like the one in the previous post, you can attach parameters to the script from Python. Plus since Python can also be used to write User Defined Functions, all the code Pig, Python, UDFs reside in the same project. This is usefull when things get more complex. In the previous post note that, I used a java project to create a java UDF which was used in Pig. So I had 2 deployments a pig script and a Java jar. But, if we embedd pig in python, we have only one deployment, the python code.

1) Write pig script in a file.
2) Use compileFromFile() function to load and compile your script. 
3) Bind using your list of map of parameters. yes a list of map of parameters.
4) Run the script using runSingle()
5) Get status on how things went using runSingle(). You can also use isSuccessFul().

Better yet Do this again for all your scripts. You can even read parameters from a file, add threading where each thread will execute a script with one set of parameters ;-)

Though this is is very useful. I find it easier to write the pig script in a text pad and test it in my pseudo cluster command line. Then, write a python program as follows to load the script file in the python code then, compile, bind and run.

You need Jython 2.5 for this to work since you use jython to interpret the python script.  May be it is something that I am doing but, some things that may crop up while you develop this are

a) Eclipse hangs when I add the Jython interpretor.

b) Eclipse does not show auto-complete options for such scripts. This is important because, you dont know if a function is available or if you typed it correctly. But, you can always download pig source and refer to get around this.

c) Another error that I came across was this one

ERROR 2998: Unhandled internal error. org/python/util/PythonInterpreter
java.lang.NoClassDefFoundError: org/python/util/PythonInterpreter at org.apache.pig.scripting.jython.JythonScriptEngine.main(JythonScriptEngine.java:338

I got rid of this by adding the jython jar to the Hadoop lib folder. Basically a classpath issue it seems. More at stackoverflow here http://stackoverflow.com/questions/13795993/embedding-pig-into-python

Ref: More details can be found in the presentation at http://www.slideshare.net/julienledem/presentation-pig-scripting by Julien Le Dem.

Ref: Good Read Chapter 9 on Book, Programming Pig by Alan F Gates

Data Analytics: Using Apache Pig User Defined Functions

Apache pig helps to analyse large data sets. We can define data flows in pig latin scripts and extract interesting information from the initial data set. For example, we may have a large data dump of user actions on your website as a csv file or in a database like hbase. Then we want to find the most frequented part of your site or the top 10 things that interests your user base in your site. To write up a solution for this is possible in a highlevel language like Java. The optimizations for handling data size as it grows, splitting the task into multiple parts, keeping track of those individual tasks will be challenging and will consume a lot of effort. Apache Hadoop and Pig help to solve this problem by providing a framework with which we can focus on the data flow rather than on the plumbing.

Derieving interesting data is always tied to a time period. We may want to extract interesting information from the whole life time of the data or we want to perform the same on a given time period say the last month or week. To specify such options, we have user defined functions in pig. This allows us to write filters and operations that we want pig to perform on each entry of the data set. This gives more control on the data filtering, flow. Some of the nuances of pig udf are explained in the example below.

Pig Version of this example: Apache Pig version 0.10.0 (r1328203)

Objective: You want to write a filter function in PIG to filter data rows according to a date range that you are interested in. You want to invoke the script from a scheduler which passes in the date range as command line parameters. A pig script is shown in the image below.



1) Passing command line parameters to pig script: You need to pass command line arguments like this pig -param datefrom='2012-10-20 00:00:00' -param dateto='2012-10-23 00:00:00' -x mapreduce user-based-analytics.pig. (I am actually calling the script from Python, which we will see in the next post).

Here I am using these two date parameters to build my Java UDF. If you are passing parameters with space character, it has to be like this otherwise pig will throw an error saying that, it cannot create your UDF java class.

2) With in the pig script: You refer to the command line parameters using the format '$' for example, '$dateto' and '$datefrom' as in the image above. unless it is an integer like $parallelism.

3) Defining your UDF reference using DEFINE pig keyword: This allows you to create a reference to your UDF which you can call using an alias. For example, the script defines a reference to the UDF as follows,

define date-based-filter com.home.pig.udfs.TimeStampFilter('$datefrom', '$dateto')

where date-based-filter is the alias that I will use to call my UDF com.home.pig.udfs.TimeStampFilter java class.

4) Calling your UDF filter in a pig script using FILTER keyword: Pig does not have a boolean data type. But, expressions are evaluated to boolean true or false. You need to call your UDF as follows, with the alias for your UDF. Here we are checking for datebasedfilter(ts) == TRUE i.e does my UDF 'com.home.pig.udfs.TimeStampFilter' acting on the current row with 'dateto' and 'datefrom' return Java Boolean true or false.

filter-by-date = filter site-data by date-based-filter(ts) == TRUE;

5) Now the Java Class that does the filtering.

a) We have to create a java class that extends FilterFunc from org.apache.pig.FilterFunc. The constructor has to take parameters that, you set in the script above. So we have two parameters.

b) Override the public Boolean exec(Tuple arg0) member function to define how this filter will handle tuples from the script. Here I just get the date from the string and check if it is within the range.

c) Package this in a jar and put it in the same location path as your script.

Why use Pig and UDF
s? Writing UDFs can be easy and saves a lot of time compared to writing a MapReduce Java program or any other option. Plus, if you have a ton of data or will end up with one this is better option since Hadoop will scale and Pig will do the jobs like data groupings, filtering for you.

Better to use Python? Although it is easy to write the UDF in Java and the justification that Pig is in Java, there is already a java environment turned on; it may be better to write User Defined Functions in Python and also trigger the script in for greater control! plus every thing will be at one place.

For more on this topic Refer to Chapter 10 in Book, Programming Pig by Alan F Gates