I need is a way to use the html5lib parser to generate a real xml.etree.ElementTree. (lxml is not an option for portability reasons.)
ELementTree.parse can take a parser as an optional parameter
xml.etree.ElementTree.parse(source, parser=None)
but it's not clear what such a parser would look like. Is there a class or object within HTML5 I could use for the parser argument? Documentation for both libraries on this issue is thin.
Context:
I have a malformed XHTML file that can't be parsed with ElementTree.parse:
<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Title</title></head>
<body><div class="cls">Note that this br<br>is missing a closing slash</div></body>
</html>
So I used html5lib.parse instead with the default treebuilder="etree" parameter, which worked fine.
But html5lib apparently does not output an xml.etree.ElementTree object, just one with a near-identical API. There are two problems with this:
- html5lib's
finddoes not support thenamespacesparameter, making XPath excessively verbose without a clumsy wrapper function. - The Eclipse debugger does not support drill-through of html5lib etrees.
So I cannot use either ElementTree or html5lib alone.
Given
xml.etree.ElementTreeasetree(as it is commonly imported as):What's returned is not an
etree.ElementTree, but rather anetree.Element(this is the same as whatetree.fromstringreturns; onlyetree.parsereturns anetree.ElementTree). It is genuinely part of the etree module — it's not something with a similar API. The problem you've run into applies toetree.fromstringas much as it does html5lib.The Python documentation for
xml.etree.ElementTreedoesn't mention thenamespacesargument — it seems to be an undocumented feature ofElementTreeobjects (but notElementobjects). As such, it's probably not something that should really be relied on! Your best bet is likely going to be to use a wrapper function.The fact that Eclipse cannot go through the trees is down to the fact that html5lib defaults to
xml.etree.cElementTreewhen it exists — which is meant to be identical, per the module's documentation, but is implemented in C using CPython's API, stopping Eclipse's debugger from functioning. You can get a treebuilder using the non-accelerated version (note from Python 3.3 both are the C implementation —cElementTreemerely survives as a deprecated alias) using the below: