By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr>
when it should be formatted simply as <wbr>
, similar to the br tag.
How do I add this behavior to lxml?
Good news! This is totally impossible. HTML tag names are baked right into libxml2
.
And lxml.html.html5parser
contains a couple severe bugs whose fixes haven't yet made it into a release.
But heck, let's fix them locally and see what happens.
>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>'), encoding=unicode)
u'<html:p xmlns:html="http://www.w3.org/1999/xhtml">hello<html:wbr></html:wbr>world!</html:p>'
So close, and yet so far. The structure is correct, at least.
One more try:
>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>', parser=lxml.html.html5parser.HTMLParser(namespaceHTMLElements=False)), encoding=unicode)
u'<p>hello<wbr></wbr>world!</p>'
Welp.
It's not wrong, at least.
I think I might go file some bugs against lxml and libxml2.
As a quick fix, why not use the replace
method of strings to remove the close tags?
>>> t = 'Thisisa<wbr></wbr>test'
>>> t.replace('</wbr>', '')
'Thisisa<wbr>test'
Since <wbr>
only exists in HTML5, I suspect the Right Thing to do is use lxml.html.html5parser
.
Short of that, the list of empty tags is defined in regular Python code, so you could always just monkeypatch it; see lxml.html.defs.empty_tags. Patches are welcome, I'm sure. :)
Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)
First define a test program
wbr_test.py
:Make sure that it fails by running
python wbr_test.py
. It should insert a<\wbr>
before<\body>
, and printnot ok
at the end.Download, extract and compile
libxml2
:Install, and install python libxml2 bindings:
Test your
wbr_test.py
once more, to make sure it fails with the latest libxml2 version.First make a copy of
HTMLparser.c
e.g. in/var/tmp
.Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word
forced
(only one occurrence). You will be at the<br>
tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of<var>
). To get the final comma right in the table insert the three lines before the one with just'}'
not the one with'};'
.In the newly inserted code Replace
br
withwbr
and changeDECL clear_attrs
toNULL
(assuming that a new tag does not have deprecated attributes).The result should diff with the version in
/var/tmp
(diff -u HTMLparser.c /var/tmp
) as follows:Make and install:
Test your
wbr_test.py
once more. Should showOK