How to get the original html entities after using the lxml / scrapy selector xpath?
I've already tried lxml instead of the parsel package, same issue.
mytext = '<html><body><span>go od</span></body></html>'
sel = parsel.Selector(text=mytext)
sel.xpath('//body').extract()
Actual output:
['<body><span>go\xa0od</span></body>']
Expected output:
['<body><span>go od</span></body>']
The got converted, how to keep them as it is?
According to the docs, currently,
.extract()and.getall()methods return raw html with unicode characters like\xa0i.e. . More info here.However,
.extract_first()and.get()method return only the first in the list and the output is without unicode characters. (Docs)Outputs:
But if you really wanted to have
chracters instead of''or\xa0. Then one solution is to do a regular string replace for those characters.Example:
Outputs: