printing html entities using lxml in python

2.1k Views Asked by At

I'm trying to make a div element from the below string with html entities. Since my string contains html entities, & reserved char in the html entity is being escaped as & in the output. Thus html entities are displayed as plain text. How can I avoid this so html entities are rendered properly?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>
1

There are 1 best solutions below

0
On

You can specify encoding while calling tostring():

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

As a side note, you should definitely use lxml.html.tostring() while dealing with HTML data:

Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.

Also see: