lxml and <wbr> tags

999 Views Asked by At

By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr> when it should be formatted simply as <wbr>, similar to the br tag.

How do I add this behavior to lxml?

4

There are 4 best solutions below

6
On BEST ANSWER

Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)

First define a test program wbr_test.py:

from lxml import etree
from cStringIO import StringIO

wbr_html = """\
<html>
  <head>
    <title>wbr test</title>
  </head>
<body>
  Test for a breakable<wbr>word implemenation change
</body>
</html>
"""

parser = etree.HTMLParser()
tree   = etree.parse(StringIO(wbr_html), parser)

result = etree.tostring(tree.getroot(),
                         pretty_print=True, method="html")
if result.split() != wbr_html.split(): # split, as we are not interested in whitespace differences
    print(result)
    print("not ok")
else:
    print("OK")

Make sure that it fails by running python wbr_test.py. It should insert a <\wbr> before <\body>, and print not ok at the end.

Download, extract and compile libxml2:

wget ftp://xmlsoft.org/libxml2/libxml2-2.8.0.tar.gz
tar xvf libxml2-2.8.0.tar.gz 
cd libxml2-2.8.0/
./configure --prefix=/usr
make -j8  # adjust number to match your number of cores

Install, and install python libxml2 bindings:

sudo make install
cd to_python_bindings
sudo python setup.py install

Test your wbr_test.py once more, to make sure it fails with the latest libxml2 version.

First make a copy of HTMLparser.c e.g. in /var/tmp.

Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word forced (only one occurrence). You will be at the <br> tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of <var>). To get the final comma right in the table insert the three lines before the one with just '}' not the one with '};'.

In the newly inserted code Replace br with wbr and change DECL clear_attrs to NULL (assuming that a new tag does not have deprecated attributes).

The result should diff with the version in /var/tmp ( diff -u HTMLparser.c /var/tmp) as follows:

@@ -1039,6 +1039,9 @@
 },
 { "var",   0, 0, 0, 0, 0, 0, 1, "instance of a variable or program argument",
DECL html_inline, NULL, DECL html_attrs, NULL, NULL
+},
+{ "wbr",   0, 2, 2, 1, 0, 0, 1, "possible line break ",
+   EMPTY , NULL , DECL core_attrs, NULL , NULL
 }
 };

Make and install:

make && sudo make install

Test your wbr_test.py once more. Should show OK

0
On

Good news! This is totally impossible. HTML tag names are baked right into libxml2.

And lxml.html.html5parser contains a couple severe bugs whose fixes haven't yet made it into a release.

But heck, let's fix them locally and see what happens.

>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>'), encoding=unicode)
u'<html:p xmlns:html="http://www.w3.org/1999/xhtml">hello<html:wbr></html:wbr>world!</html:p>'

So close, and yet so far. The structure is correct, at least.

One more try:

>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>', parser=lxml.html.html5parser.HTMLParser(namespaceHTMLElements=False)), encoding=unicode)
u'<p>hello<wbr></wbr>world!</p>'

Welp.

It's not wrong, at least.

I think I might go file some bugs against lxml and libxml2.

0
On

As a quick fix, why not use the replace method of strings to remove the close tags?

>>> t = 'Thisisa<wbr></wbr>test'
>>> t.replace('</wbr>', '')
'Thisisa<wbr>test'
2
On

Since <wbr> only exists in HTML5, I suspect the Right Thing to do is use lxml.html.html5parser.

Short of that, the list of empty tags is defined in regular Python code, so you could always just monkeypatch it; see lxml.html.defs.empty_tags. Patches are welcome, I'm sure. :)