I am using PyQuery to process a large amount of documents from the Web. PyQuery uses lxml to parse the HTML documents.
As a matter of fact, a lot of the documents are not valid HTML. As a consequence, those invalid documents cannot be successfully parsed by lxml, which prevents me from getting the information further. And the the following exceptions are raised quite often:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/hxiao/hiit/crawl/crawl/spiders/basic.py", line 40, in parse
doc = pq(response.body)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 226, in __init__
elements = fromstring(context, self.parser)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 70, in fromstring
result = getattr(lxml.html, meth)(context)
File "/usr/local/lib/python2.7/dist-packages/lxml/html/__init__.py", line 706, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python2.7/dist-packages/lxml/html/__init__.py", line 600, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91904)
lxml.etree.XMLSyntaxError: line 649: htmlParseEntityRef: expecting ';'
What I am asking:
I would like a way to let lxml
to parse in a less strict way so that this invalidity can be ignored.
This answer may not be very helpful, but I investigated similar problem.
Maybe you can have a look at this tip of pyquery ?
http://pythonhosted.org/pyquery/tips.html