Thursday, 12 September 2013

Big Data Analytics: Using Python UDFs for better managing code and development

Apache Pig is a data flow scripting language. We specify how we want the data to be grouped, filtered etc. It generates hadoop mapreduce jobs corresponding to pig statements. But Pig does not have functionalities of a general purpose programming language. User defined functions help by allowing us to extend Pig. We can specify our data manipulation along with Pig operators using UDFs. For example if we are filtering data, we can write a UDF to do this our way. Then, ask Pig to use it by saying FILTER by. In a previous post we saw how to write User defined functions in Java.  As mentioned in the post, since Hadoop and pig are already in Java it make sense to write a UDF in Java. Also it was mentioned that writing a UDF in Python could mean less effort and better code management.  

For Java we create a jar file which contains our code for the UDF. We register that jar with Pig. As the number of UDFs goes up and changes become frequent even in development stage this adds to the effort. Changing the code, testing, jar it, then see how it works with pig. This could end up taking a lot of time.(Refer the previous post).

To save time on development iterations, we can embed Apache pig in python scripts. We have seen this too. UDFs can also be in Python which make things a lot easier. There is our pig script embedded in a python file. UDFs can be in the same file or as seperate modules in the same project. The net effect is that you avoid the overhead of heading over to another project to change and jar your UDF. There is also the case of specifying schemas of your UDF's output. This is extra code in Java but, in python it boils down to annotations. Basically less code to write and maintain. Deployment also becomes easier. Python UDFs and simplicity involved are demonstrated below.

There are two ways to use python and pig.

Examples: We have a dump of data from seismic sensors. We need to find all the locations where there has been an earthquake of magnitude > 5 and we want the number of such quakes over the data. We filter data a UDF and we use another UDF to align data properly.  Python UDFs are used are for sake of demo. The full code for the project is available here. Notice that everything is under one project. In the code there are two folders QuakeDataRunner and QuakeDataRunner2 which demonstrate both these approaches.

A) One is to seperate the python (UDF) and the Pig script that used the python UDFs.
In this case, import the UDF file into the Pig script using jython. Pig uses the internal Jython engine for this purpose. The files are shown below. First the Python UDF is as follows

The Pig script which uses the UDF above is as follows
Run the code as follows
If you have pig in your PATH you can run it as pig -x <pig_file> from the folder.

B) Embedding Pig script in Python
Here also the program structure does not change. All that needs to be done is to use the Java wrappers for Pig in Python. This code is also run with pig command. The UDF file is the same as above, but the main program is a Python script with Pig as follows

Run the code as follows. As before you can add pig to your PATH to avoid referring to the pig executable in every command.

Click for Code used this post

1. Programming Pig by Alan Gates.
2.Embedding Pig in scripting Languages by Julien Le Dem.