Can lxml / scrapy selector not transfer the html entities

212 Views Asked by At

How to get the original html entities after using the lxml / scrapy selector xpath?

I've already tried lxml instead of the parsel package, same issue.

mytext = '<html><body><span>go&nbsp;od</span></body></html>'
sel = parsel.Selector(text=mytext)
sel.xpath('//body').extract()

Actual output:

['<body><span>go\xa0od</span></body>']

Expected output:

['<body><span>go&nbsp;od</span></body>']

The &nbsp; got converted, how to keep them as it is?

1

There are 1 best solutions below

5
Rithin Chalumuri On

According to the docs, currently, .extract() and .getall() methods return raw html with unicode characters like \xa0 i.e. &nbsp;. More info here.

However, .extract_first() and .get() method return only the first in the list and the output is without unicode characters. (Docs)

print(sel.xpath('//body').get())

Outputs:

<body><span>go od</span></body>

But if you really wanted to have &nbsp; chracters instead of '' or \xa0. Then one solution is to do a regular string replace for those characters.

Example:

body = sel.xpath('//body').extract()

result = [i.replace('\xa0', '&nbsp;') for i in body]

print(result)

Outputs:

['<body><span>go&nbsp;od</span></body>']