Can lxml / scrapy selector not transfer the html entities

212 Views Asked by Xiaojin Liu At 05 November 2019 at 22:16

How to get the original html entities after using the lxml / scrapy selector xpath?

I've already tried lxml instead of the parsel package, same issue.

mytext = '<html><body><span>go&nbsp;od</span></body></html>'
sel = parsel.Selector(text=mytext)
sel.xpath('//body').extract()

Actual output:

['<body><span>go\xa0od</span></body>']

Expected output:

['<body><span>go&nbsp;od</span></body>']

The   got converted, how to keep them as it is?

Original Q&A

There are 1 best solutions below

Rithin Chalumuri On 05 November 2019 at 23:00

According to the docs, currently, .extract() and .getall() methods return raw html with unicode characters like \xa0 i.e.  . More info here.

However, .extract_first() and .get() method return only the first in the list and the output is without unicode characters. (Docs)

print(sel.xpath('//body').get())

Outputs:

<body><span>go od</span></body>

But if you really wanted to have   chracters instead of '' or \xa0. Then one solution is to do a regular string replace for those characters.

Example:

body = sel.xpath('//body').extract()

result = [i.replace('\xa0', '&nbsp;') for i in body]

print(result)

Outputs:

['<body><span>go&nbsp;od</span></body>']

Can lxml / scrapy selector not transfer the html entities

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCRAPY

Related Questions in LXML

Related Questions in HTML-ENTITIES

Related Questions in PARSEL

Trending Questions

Popular # Hahtags

Popular Questions