Add a new element surrounding a given word in the texts of a given element and its tail using lxml

32 Views Asked by At

So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:

<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>

I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs element:

<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>

To simplify this, let's say I have a list of names I could find in the text, such as:

names = ["James", "Peter", "Mary"]

I want to use lxml for this. I know I could use the etree.SubElement() and append a new element at the end of the p element, but I don't know how to deal with the tails and the other possible elements.

I understand that I need to handle the three references in my example differently.

  1. The first James is in the text of the p element. I could just do this:
p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"

Right?

  1. The second James is in the tail of the p element. I don't know how to deal with that.
  2. The reference to Peter is in the text of hi element. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.
rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"

My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!

2

There are 2 best solutions below

0
Jack Fleeting On BEST ANSWER

It's a little convoluted, but can be done.

Let's say your XML looks like this:

play = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <div>
      <p>
         -I like James
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure James understands
         <hi>Peter</hi>
         's problems.
      </p>
   </div>
   <div>
      <p>
         -I like Mary
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure Peter understands
         <hi>James</hi>
         's problems.
      </p>
   </div>
</root>
'''

I inserted another div, and added formatting for clarity. Note that this assumes that each <div> contains only one <p>; if that's not the case, it will have to be refined more.

doc = etree.XML(play.encode())
names = ["James", "Peter", "Mary"]

#find all the divs that need changing
destinations = doc.xpath('//div')

#extract the string representation of the current <p> (the "target")
for destination in destinations:
    target = destination.xpath('./p')[0]
    target_str = etree.tostring(target).decode()

    #replace the names with the required tag:
    for name in names:
        if name in target_str:
            target_str = target_str.replace(name, f'<rs>{name}</rs>')
    
    #remove the original <p> and replace it with the new one,
    #as an element formed from the new string 
    destination.remove(target)
    destination.insert(0,etree.fromstring(target_str))

print(etree.tostring(doc).decode())

In this case, the output should be:

<root>
   <div>
      <p>
         -I like <rs>James</rs>
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure <rs>James</rs> understands
         <hi><rs>Peter</rs></hi>
         's problems.
      </p></div>
   <div>
      <p>
         -I like <rs>Mary</rs>
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure <rs>Peter</rs> understands
         <hi><rs>James</rs></hi>
         's problems.
      </p></div>
</root>
2
Michael Kay On

I know you want to use lxml, but XSLT is custom-made for this sort of thing. In XSLT 3.0,

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0" expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:param name="names" select="'James', 'Peter', 'Mary'"/>
<xsl:template match="text()">
  <xsl:analyze-string select="." 
                      regex="{string-join($names,'|')}">
     <xsl:matching-substring>
       <rs>{.}</rs>
     </xsl:matching-substring>
     <xsl:non-matching-substring>{.}</xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>
</xsl:transform>