'Search for pattern exhausted' happens when processing WARC file in python3

224 Views Asked by At

I'm trying to fetch some plain text from a WARC dataset (yahoo!webscope L2), and keep meeting ValueError: Search for pattern exhausted when using load() function in python3 module warcat. Have tried some random WARC example files and everything worked well.

The dataset did ask for a further license to commit(and then a password would be provide, according to the readme file;do WARC files come with passwords?) but for now I'm not equipped to send a fax.

I also checked out warcat source code, and found that the ValueError would be raised when file_obj.read(size) is False. It seems making no sense to me so I'm asking here...

The code:

>>> import warcat
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('ydata-embedded-metadata-v1_0.warc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
    self.read_file_object(f)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
    record, has_more = self.read_record(file_object)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/record.py", line 59, in load
    inclusive=True)
  File "/usr/local/lib/python3.4/site-packages/warcat/util.py", line 66, in find_file_pattern
    raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted

Thanks in advance.

0

There are 0 best solutions below