python streaming mapreduce job on hadoop failed - missing log4j?

772 Views Asked by At

I tried to run a python wordcount on hadoop 2.7.1 which is installed on Ubuntu 15.10 and I got an error:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.

Also I get RunTimeException error in the terminal and info that streaming failed and there's no output file.

I found a few threads saying that probably and log4j.xml are missing, also examples what should contain, I tried one example but no success. Where do I find the files in Hadoop directory (if I can find them) or how can I create them with the right configuration?

The code for mapper and reducer for wordcount is taken from here and it runs absolutely fine with


However, I tried several times to run it on hadoop and it fails. I used different commands trying both when python files are copied to hdfs and when they are on the local file system: This one did not work:

hadoop hadoop-streaming-2.7.1.jar -mapper /user/ -reducer /user/ -input/input_file.txt -output /user/output

nor this one:

hadoop hadoop-streaming-2.7.1.jar -mapper "python /user/" -reducer "python /user/" -input/input_file.txt -output /user/output

This one did work (python files in the local file system):

hadoop hadoop-streaming-2.7.1.jar -mapper "python /home/user_name/Documents/" -reducer "python /home/user_name/Documents/ -input /user/input_file.txt -output /user/output

All the files have the right permissions.

The output - after the standard beginning - is as follows:

16/02/15 09:47:48 INFO mapreduce.Job:  map 0% reduce 0%
16/02/15 09:48:05 INFO mapreduce.Job: Task Id : attempt_1455529218252_0001_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(
    at org.apache.hadoop.util.ReflectionUtils.setConf(
    at org.apache.hadoop.util.ReflectionUtils.newInstance(
    at org.apache.hadoop.mapred.MapTask.runOldMapper(
    at org.apache.hadoop.mapred.YarnChild$
    at Method)
    at org.apache.hadoop.mapred.YarnChild.main(
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(
    at org.apache.hadoop.util.ReflectionUtils.setConf(
    at org.apache.hadoop.util.ReflectionUtils.newInstance(
    at org.apache.hadoop.mapred.MapRunner.configure(
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
    at org.apache.hadoop.streaming.PipeMapRed.configure(
    at org.apache.hadoop.streaming.PipeMapper.configure(
... 22 more
Caused by: Cannot run program "/user/mr/": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(
    at org.apache.hadoop.streaming.PipeMapRed.configure(
... 23 more
Caused by: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(
    at java.lang.ProcessImpl.start(
    at java.lang.ProcessBuilder.start(
... 24 more

And there's a lot more but the final output is about the streaming job failed:

16/02/15 09:49:07 INFO mapreduce.Job: Counters: 13
    Job Counters 
        Failed map tasks=7
        Killed map tasks=1
        Launched map tasks=8
        Other local map tasks=6
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=135543
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=135543
        Total vcore-seconds taken by all map tasks=135543
        Total megabyte-seconds taken by all map tasks=138796032
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
16/02/15 09:49:07 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

What could be the reason for the python code not working when invoked from hdfs?


There are 1 best solutions below


You should just supply the name of the local python files as arguments to -mapper and -reducer. They don't need to be on HDFS, nor should you supply a string with the command line to execute the scripts.

You also need to supply a -file argument for each script. Try using

hadoop hadoop-streaming-2.7.1.jar -file /home/user_name/Documents/ -file /home/user_name/Documents/ -mapper /home/user_name/Documents/ -reducer /home/user_name/Documents/ -input /input_file.txt -output /user/output