Sunday, 23 December 2012

Data Analytics: Using Apache Pig User Defined Functions

Apache pig helps to analyse large data sets. We can define data flows in pig latin scripts and extract interesting information from the initial data set. For example, we may have a large data dump of user actions on your website as a csv file or in a database like hbase. Then we want to find the most frequented part of your site or the top 10 things that interests your user base in your site. To write up a solution for this is possible in a highlevel language like Java. The optimizations for handling data size as it grows, splitting the task into multiple parts, keeping track of those individual tasks will be challenging and will consume a lot of effort. Apache Hadoop and Pig help to solve this problem by providing a framework with which we can focus on the data flow rather than on the plumbing.

Derieving interesting data is always tied to a time period. We may want to extract interesting information from the whole life time of the data or we want to perform the same on a given time period say the last month or week. To specify such options, we have user defined functions in pig. This allows us to write filters and operations that we want pig to perform on each entry of the data set. This gives more control on the data filtering, flow. Some of the nuances of pig udf are explained in the example below.

Pig Version of this example: Apache Pig version 0.10.0 (r1328203)

Objective: You want to write a filter function in PIG to filter data rows according to a date range that you are interested in. You want to invoke the script from a scheduler which passes in the date range as command line parameters. A pig script is shown in the image below.



1) Passing command line parameters to pig script: You need to pass command line arguments like this pig -param datefrom='2012-10-20 00:00:00' -param dateto='2012-10-23 00:00:00' -x mapreduce user-based-analytics.pig. (I am actually calling the script from Python, which we will see in the next post).

Here I am using these two date parameters to build my Java UDF. If you are passing parameters with space character, it has to be like this otherwise pig will throw an error saying that, it cannot create your UDF java class.

2) With in the pig script: You refer to the command line parameters using the format '$' for example, '$dateto' and '$datefrom' as in the image above. unless it is an integer like $parallelism.

3) Defining your UDF reference using DEFINE pig keyword: This allows you to create a reference to your UDF which you can call using an alias. For example, the script defines a reference to the UDF as follows,

define date-based-filter com.home.pig.udfs.TimeStampFilter('$datefrom', '$dateto')

where date-based-filter is the alias that I will use to call my UDF com.home.pig.udfs.TimeStampFilter java class.

4) Calling your UDF filter in a pig script using FILTER keyword: Pig does not have a boolean data type. But, expressions are evaluated to boolean true or false. You need to call your UDF as follows, with the alias for your UDF. Here we are checking for datebasedfilter(ts) == TRUE i.e does my UDF 'com.home.pig.udfs.TimeStampFilter' acting on the current row with 'dateto' and 'datefrom' return Java Boolean true or false.

filter-by-date = filter site-data by date-based-filter(ts) == TRUE;

5) Now the Java Class that does the filtering.

a) We have to create a java class that extends FilterFunc from org.apache.pig.FilterFunc. The constructor has to take parameters that, you set in the script above. So we have two parameters.

b) Override the public Boolean exec(Tuple arg0) member function to define how this filter will handle tuples from the script. Here I just get the date from the string and check if it is within the range.

c) Package this in a jar and put it in the same location path as your script.

Why use Pig and UDF
s? Writing UDFs can be easy and saves a lot of time compared to writing a MapReduce Java program or any other option. Plus, if you have a ton of data or will end up with one this is better option since Hadoop will scale and Pig will do the jobs like data groupings, filtering for you.

Better to use Python? Although it is easy to write the UDF in Java and the justification that Pig is in Java, there is already a java environment turned on; it may be better to write User Defined Functions in Python and also trigger the script in for greater control! plus every thing will be at one place.

For more on this topic Refer to Chapter 10 in Book, Programming Pig by Alan F Gates

No comments: