Sunday, 23 December 2012

Data Analytics : Hatari !

Embedding Pig in Python - Hatari!
Pig Latin does not allow control structures like Java and other high-level languages. So inorder to get more control over the execution context, loop through a set of parameters etc we can embedd Pig in Python. This is supported only for Python 2.7 and not above. At first this may seem not so useful. But, if you have a single script, i.e a parameterised pig script like the one in the previous post, you can attach parameters to the script from Python. Plus since Python can also be used to write User Defined Functions, all the code Pig, Python, UDFs reside in the same project. This is usefull when things get more complex. In the previous post note that, I used a java project to create a java UDF which was used in Pig. So I had 2 deployments a pig script and a Java jar. But, if we embedd pig in python, we have only one deployment, the python code.

1) Write pig script in a file.
2) Use compileFromFile() function to load and compile your script. 
3) Bind using your list of map of parameters. yes a list of map of parameters.
4) Run the script using runSingle()
5) Get status on how things went using runSingle(). You can also use isSuccessFul().

Better yet Do this again for all your scripts. You can even read parameters from a file, add threading where each thread will execute a script with one set of parameters ;-)

Though this is is very useful. I find it easier to write the pig script in a text pad and test it in my pseudo cluster command line. Then, write a python program as follows to load the script file in the python code then, compile, bind and run.

You need Jython 2.5 for this to work since you use jython to interpret the python script.  May be it is something that I am doing but, some things that may crop up while you develop this are

a) Eclipse hangs when I add the Jython interpretor.

b) Eclipse does not show auto-complete options for such scripts. This is important because, you dont know if a function is available or if you typed it correctly. But, you can always download pig source and refer to get around this.

c) Another error that I came across was this one

ERROR 2998: Unhandled internal error. org/python/util/PythonInterpreter
java.lang.NoClassDefFoundError: org/python/util/PythonInterpreter at org.apache.pig.scripting.jython.JythonScriptEngine.main(

I got rid of this by adding the jython jar to the Hadoop lib folder. Basically a classpath issue it seems. More at stackoverflow here

Ref: More details can be found in the presentation at by Julien Le Dem.

Ref: Good Read Chapter 9 on Book, Programming Pig by Alan F Gates
Post a Comment