I have an html file with some divs like this (a lot simplified):
<div num="1" class="class1">
<div class="class1-text">
<span class="class2">
<span class="class3"> some chinese text </span>
some english text
</span>
</div>
</div>
I'm trying to remove all the Chinese text by removing the span node that contains it with lxml:
parser = et.XMLParser(remove_blank_text=True, recover=True)
documentXml=et.parse(html_FileName, parser)
for class1Node in documentXml.xpath('//div[@class="class1-text"]'):
chineseNode=class1Node.xpath('.//span[@class="class3"]')
chineseNode.getparent().remove(chineseNode)
but instead of getting just the span class3 node from xpath I get the span class2, and so I end to remove all the content (even the English text).
If I don't parse with lxml I get parsing errors (maybe Chinese characters problem or bad html).
You can try with
strip_elements()
function, like:It yields: