The mapper and reducer code in python are shown below
Run the streaming command on Hadoop as follows:
hadoop jar ~/Development/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar \
-input /hadoop/streampython/large-earthquake.data \
-output /hadoop/streampython/filtered \
-file ./stream/mapper.py \
-mapper 'mapper.py' \
-file ./stream/reducer.py \
Sample code: The git, eclipse project for this is available at this link. It also has a data gen script that, can generate 2GB plus test earthquake data.
a) Input files:
b) The SPLIT_RAW_BYTES for the map task.
c) Job Status with two reducers
Some Errors to watch out for:
1) You see the following on the job tracker -> Task tracker for you maps. Your maps seem to have been killed more times than allowed.
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)
For myself, this was due to the fact that, I had not added #!/usr/bin/python to my python files. Once added, there were no issues. There was no need to change permissions on any files.
2) Basic Hadoop Nuances:
a) Add the following property to your core-site.xml. This tells hadoop which location to use for its tmp folder. A location that does not get modified over re-boots.
b) Add the following two properties to your hdfs-site.xml. These tell hadoop where to store its name node details and also the data chunks. You can see the data chunks for your files here.
Otherwise your datanode and namenode playup after a re-boot. You may see exceptions like, could replicate to only 0 nodes and Data node's Connection to namenode timedout after n tries. in the log files.