I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.
The HTML comes from this page: http://www.wvdnr.gov/
It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88
which contains a lot of <a> tags.
However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
html.parserworked for me:Demo:
See also: