lxml: I can't remove a span tag and the text inside

726 Views Asked by At

I have an html file with some divs like this (a lot simplified):

<div num="1" class="class1">
  <div class="class1-text">
    <span class="class2">
      <span class="class3"> some chinese text </span>
      some english text
    </span>
  </div>
</div>

I'm trying to remove all the Chinese text by removing the span node that contains it with lxml:

parser = et.XMLParser(remove_blank_text=True, recover=True)
documentXml=et.parse(html_FileName, parser)
for class1Node in documentXml.xpath('//div[@class="class1-text"]'):
    chineseNode=class1Node.xpath('.//span[@class="class3"]')
    chineseNode.getparent().remove(chineseNode)

but instead of getting just the span class3 node from xpath I get the span class2, and so I end to remove all the content (even the English text).

If I don't parse with lxml I get parsing errors (maybe Chinese characters problem or bad html).

2

There are 2 best solutions below

0
On BEST ANSWER

You can try with strip_elements() function, like:

from lxml import etree as et

parser = et.XMLParser(remove_blank_text=True, recover=True)
documentXml=et.parse(html_FileName, parser)
for class1Node in documentXml.xpath('//div[@class="class1-text"]'):
    chineseNode=class1Node.xpath('.//span[@class="class3"]')
    et.strip_elements(chineseNode[0].getparent(), 'span', with_tail=False)

print(et.tostring(documentXml))

It yields:

b'<div num="1" class="class1"><div class="class1-text"><span class="class2">\n      some english text\n    </span></div></div>'
0
On

You should be able to simplify your xpath selector to:

for chineseNode in documentXml.xpath("//div[@class='class1-text']//span[@class='class3']"):
    chineseNode.getparent().remove(chineseNode)