I'm trying to convert xml files through a mapreduce job and receive the error :
2023-04-04 09:41:52,515 INFO mapreduce.Job: map 0% reduce 0%
2023-04-04 09:42:12,676 INFO mapreduce.Job: Task Id : attempt_1680592009322_0021_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
I'm lauching it with the command :
yarn jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
-file "script/mapper_convert_xml.py" -mapper "python3 mapper_convert_xml.py" \
-file "script/reducer_convert_xml.py" -reducer "python3 reducer_convert_xml.py" \
-input /input/philharmonie_data/AIC94.xml -output /output/philharmonie_data
My script have the header #!/usr/bin/env python and I give chmod 744 to mapper, the reducer and ElementTree.py scripts.
The script run fine in local mode with the command line
cat /home/philharmonie_data/AIC94.xml | python3 /script/mapper_convert_xml.py | python3 /script/reducer_convert_xml.py
Here is my mapper script :
#!/usr/bin/env python
from operator import itemgetter
import xml.etree.ElementTree as ET
import sys
import json
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(sys.stdin.read().encode("utf-8", "replace"))
for element in tree:
child_values = ""
for child in element:
if child.tag == "{http://www.loc.gov/MARC21/slim}controlfield":
child_values = child_values + child.attrib["tag"] + "\t" + child.text + "_|_"
if child.tag == "{http://www.loc.gov/MARC21/slim}datafield":
for field in child:
code_uni = child.attrib["tag"] + "$" + field.attrib["code"]
value = field.text
if code_uni is not None:
child_values = child_values + code_uni + "\t" + value + "_|_"
print(child_values)
I've try to chmod -x the scripts, add the #!/usr/bin/env python3 headers on scripts.
The use of -files instead of -file in the command line don't change the error.
According to this question I try to execute this command with no change in the error :
yarn jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
-file "script/mapper_convert_xml.py" -mapper "python3 mapper_convert_xml.py --python-bin /usr/bin/python3" \
-file "script/reducer_convert_xml.py" -reducer "python3 reducer_convert_xml.py --python-bin /usr/bin/python3" \
-input /input/philharmonie_data/AIC94.xml -output /output/philharmonie_data
Edit : I also tryed to change the sys.stdin with no change
import io
# Set encoding explicitly
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
As the wordcount example run, I suspect the import xml.etree.ElementTree as ET line to break the code in HDFS. Any Idea on how to make the job work ? Thanks