lxml and tags

Question

lxml and tags

988 Views Asked by bukzor At 20 August 2025 at 02:45

By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as  when it should be formatted simply as , similar to the br tag.

How do I add this behavior to lxml?

Original Q&A

There are 4 best solutions below

Eevee On 23 May 2012 at 22:04

Good news! This is totally impossible. HTML tag names are baked right into libxml2.

And lxml.html.html5parser contains a couple severe bugs whose fixes haven't yet made it into a release.

But heck, let's fix them locally and see what happens.

>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>'), encoding=unicode)
u'<html:p xmlns:html="http://www.w3.org/1999/xhtml">hello<html:wbr></html:wbr>world!</html:p>'

So close, and yet so far. The structure is correct, at least.

One more try:

>>> lxml.html.tostring(lxml.html.html5parser.fromstring('<p>hello<wbr>world!</p>', parser=lxml.html.html5parser.HTMLParser(namespaceHTMLElements=False)), encoding=unicode)
u'<p>hello<wbr></wbr>world!</p>'

Welp.

It's not wrong, at least.

I think I might go file some bugs against lxml and libxml2.

Roland Smith On 29 May 2012 at 19:57

As a quick fix, why not use the replace method of strings to remove the close tags?

>>> t = 'Thisisa<wbr></wbr>test'
>>> t.replace('</wbr>', '')
'Thisisa<wbr>test'

Eevee On 26 April 2012 at 22:19

Since  only exists in HTML5, I suspect the Right Thing to do is use lxml.html.html5parser.

Short of that, the list of empty tags is defined in regular Python code, so you could always just monkeypatch it; see lxml.html.defs.empty_tags. Patches are welcome, I'm sure. :)

**Anthon** · Accepted Answer

Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)

First define a test program wbr_test.py:

from lxml import etree
from cStringIO import StringIO

wbr_html = """\
<html>
  <head>
    <title>wbr test</title>
  </head>
<body>
  Test for a breakable<wbr>word implemenation change
</body>
</html>
"""

parser = etree.HTMLParser()
tree   = etree.parse(StringIO(wbr_html), parser)

result = etree.tostring(tree.getroot(),
                         pretty_print=True, method="html")
if result.split() != wbr_html.split(): # split, as we are not interested in whitespace differences
    print(result)
    print("not ok")
else:
    print("OK")

Make sure that it fails by running python wbr_test.py. It should insert a <\wbr> before <\body>, and print not ok at the end.

Download, extract and compile libxml2:

wget ftp://xmlsoft.org/libxml2/libxml2-2.8.0.tar.gz
tar xvf libxml2-2.8.0.tar.gz 
cd libxml2-2.8.0/
./configure --prefix=/usr
make -j8  # adjust number to match your number of cores

Install, and install python libxml2 bindings:

sudo make install
cd to_python_bindings
sudo python setup.py install

Test your wbr_test.py once more, to make sure it fails with the latest libxml2 version.

First make a copy of HTMLparser.c e.g. in /var/tmp.

Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word forced (only one occurrence). You will be at the   tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of <var>). To get the final comma right in the table insert the three lines before the one with just '}' not the one with '};'.

In the newly inserted code Replace br with wbr and change DECL clear_attrs to NULL (assuming that a new tag does not have deprecated attributes).

The result should diff with the version in /var/tmp ( diff -u HTMLparser.c /var/tmp) as follows:

@@ -1039,6 +1039,9 @@
 },
 { "var",   0, 0, 0, 0, 0, 0, 1, "instance of a variable or program argument",
DECL html_inline, NULL, DECL html_attrs, NULL, NULL
+},
+{ "wbr",   0, 2, 2, 1, 0, 0, 1, "possible line break ",
+   EMPTY , NULL , DECL core_attrs, NULL , NULL
 }
 };

Make and install:

make && sudo make install

Test your wbr_test.py once more. Should show OK

lxml and <wbr> tags

There are 4 best solutions below

Related Questions in PYTHON

Related Questions in HTML

Related Questions in LXML

Related Questions in WBR

Trending Questions

Popular # Hahtags

Popular Questions