I am very new to Hadoop and trying to run a MapReduce job on my University's cluster. I have tested my mapper and reducer locally and they seem to work ok, but when using streaming I get this error over and over again.
The goal is to find the difference between maximum and minimum wind speed for each day
My mapper:
#!/usr/bin/env python
import sys
# Read the header line and strip white space
header = [x.strip() for x in sys.stdin.readline().split(',')]
# Find the positions of the columns
date_index = header.index('YearMonthDay')
wind_speed_index = header.index('Wind Speed (kt)')
for line in sys.stdin:
expected_fields = 21 # some lines have an extra field - define here how many fields there should be
fields = line.strip().split(',') # remove any leading or trailing whitespace
if len(fields) != expected_fields: # if there are more or less fields, ignore the row
continue # ignore this line
date = fields[date_index]
wind_speed = fields[wind_speed_index]
if wind_speed == '-': # if wind_speed is '-', ignore this line
continue
print('%s\t%s' % (date, wind_speed))
My reducer:
#!/usr/bin/env python
import sys
current_date = None
date = None # Initialize date
max_wind_speed = 0
min_wind_speed = float('inf')
for line in sys.stdin:
line = line.strip()
date, wind_speed = line.split('\t', 1)
try:
wind_speed = float(wind_speed)
except ValueError:
continue
if current_date == date:
max_wind_speed = max(max_wind_speed, wind_speed)
min_wind_speed = min(min_wind_speed, wind_speed)
else:
if current_date:
print('%s\t%s' % (current_date, max_wind_speed - min_wind_speed))
max_wind_speed = wind_speed
min_wind_speed = wind_speed
current_date = date
if current_date == date:
print('%s\t%s' % (current_date, max_wind_speed - min_wind_speed))
The command I'm using in the terminal:
hadoop jar /opt/hadoop/current/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-input filename.txt \
-output output1
The full error text:
2023-06-15 18:38:18,871 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar16415805188223852506/] []/tmp/streamjob4666494938393678435.jar tmpDir=null
2023-06-15 18:38:19,468 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at lena-master/128.86.245.64:8032
2023-06-15 18:38:19,565 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at lena-master/128.86.245.64:8032
2023-06-15 18:38:19,751 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/achap001/.staging/job_1684218082735_7133
2023-06-15 18:38:20,026 INFO mapred.FileInputFormat: Total input files to process: 1
2023-06-15 18:38:20,094 INFO mapreduce.JobSubmitter: number of splits:2
2023-06-15 18:38:20,236 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1684218082735_7133
2023-06-15 18:38:20,236 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-06-15 18:38:20,337 INFO conf.Configuration: resource-types.xml not found
2023-06-15 18:38:20,337 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-06-15 18:38:20,575 INFO impl.YarnClientImpl: Submitted application application_1684218082735_7133
2023-06-15 18:38:20,610 INFO mapreduce.Job: The url to track the job: http://lena-master:8088/proxy/application_1684218082735_7133/
2023-06-15 18:38:20,611 INFO mapreduce.Job: Running job: job_1684218082735_7133
2023-06-15 18:38:25,692 INFO mapreduce.Job: Job job_1684218082735_7133 running inuber mode : false
2023-06-15 18:38:25,695 INFO mapreduce.Job: map 0% reduce 0%
2023-06-15 18:38:27,756 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
2023-06-15 18:38:28,780 INFO mapreduce.Job: map 50% reduce 0%
2023-06-15 18:38:30,797 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
2023-06-15 18:38:33,821 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
2023-06-15 18:38:37,846 INFO mapreduce.Job: map 100% reduce 100%
2023-06-15 18:38:37,858 INFO mapreduce.Job: Job job_1684218082735_7133 failed with state FAILED due to: Task failed task_1684218082735_7133_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0
2023-06-15 18:38:37,959 INFO mapreduce.Job: Counters: 42
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=352981
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=645602
HDFS: Number of bytes written=0
HDFS: Number of read operations=3
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Failed map tasks=4
Killed reduce tasks=1
Launched map tasks=5
Launched reduce tasks=1
Other local map tasks=3
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=43550
Total time spent by all reduces in occupied slots (ms)=34160
Total time spent by all map tasks (ms)=8710
Total time spent by all reduce tasks (ms)=6832
Total vcore-milliseconds taken by all map tasks=8710
Total vcore-milliseconds taken by all reduce tasks=6832
Total megabyte-milliseconds taken by all map tasks=44595200
Total megabyte-milliseconds taken by all reduce tasks=34979840
Map-Reduce Framework
Map input records=5004
Map output records=4960
Map output bytes=74400
Map output materialized bytes=84326
Input split bytes=99
Combine input records=0
Spilled Records=4960
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=18
CPU time spent (ms)=810
Physical memory (bytes) snapshot=355909632
Virtual memory (bytes) snapshot=6371753984
Total committed heap usage (bytes)=1052770304
Peak Map Physical memory (bytes)=355909632
Peak Map Virtual memory (bytes)=6371753984
File Input Format Counters
Bytes Read=645503
2023-06-15 18:38:37,960 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
Does anyone know what may be causing this error?
I have tested it many times, created samples of the data and run tests on those, but no luck. Locally everything works fine but as soon as I try to use Hadoop streaming it stops working.