I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs.
import warc
f = warc.open("test.warc.gz")
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
I want this code to read streaming input from warc files i.e.
zcat test.warc.gz | warc_reader.py
Kindly tell me how can i modify this code for streaming inputs. Thanks
warc.open()is a shorthand forwarc.WARCFile(), andwarc.WARCFile()can receive afileobjargument, wheresys.stdinis exactly a file object. So what you need to do is something simply like this:But things are a little bit difficult under hadoop streaming when your input file is
.gz, as hadoop will replace all\r\nin WARC file into\n, which will break the WARC format(refer to this question: hadoop converting \r\n to \n and breaking ARC format). As thewarcpackage use a regular expression"WARC/(\d+.\d+)\r\n"to match headers(matching exactly\r\n), you will probably get this error:So you will either modify your
PipeMapper.javafile as it is recommended in the referred question, or write your own parsing script, which parses the WARC file line by line.BTW, simply modifying the
warc.pyto use\ninstead of\r\nin matching headers won't work, because it reads content exactly as the length ofContent-Length, and expecting two empty lines after that. Therefore what hadoop does will definitely make the length of the content mismatches the attributeContent-Lengththerefore cause another error like: