how to write a streaming mapreduce job for warc files in python

473 Views Asked by zahid adeel At 30 October 2025 at 19:49

I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs.

import warc
f = warc.open("test.warc.gz")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

I want this code to read streaming input from warc files i.e.

zcat test.warc.gz | warc_reader.py

Kindly tell me how can i modify this code for streaming inputs. Thanks

Original Q&A

There are 1 best solutions below

CKLu On 05 September 2019 at 06:53

warc.open() is a shorthand for warc.WARCFile(), and warc.WARCFile() can receive a fileobj argument, where sys.stdin is exactly a file object. So what you need to do is something simply like this:

import sys
import warc

f = warc.open(fileobj=sys.stdin)
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

But things are a little bit difficult under hadoop streaming when your input file is .gz, as hadoop will replace all \r\n in WARC file into \n, which will break the WARC format(refer to this question: hadoop converting \r\n to \n and breaking ARC format). As the warc package use a regular expression "WARC/(\d+.\d+)\r\n" to match headers(matching exactly \r\n), you will probably get this error:

IOError: Bad version line: 'WARC/1.0\n'

So you will either modify your PipeMapper.java file as it is recommended in the referred question, or write your own parsing script, which parses the WARC file line by line.

BTW, simply modifying the warc.py to use \n instead of \r\n in matching headers won't work, because it reads content exactly as the length of Content-Length, and expecting two empty lines after that. Therefore what hadoop does will definitely make the length of the content mismatches the attribute Content-Length therefore cause another error like:

IOError: Expected '\n', found 'abc\n'

how to write a streaming mapreduce job for warc files in python

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in HADOOP-STREAMING

Related Questions in WARC

Trending Questions

Popular # Hahtags

Popular Questions