Wrapping Software Mentions in XML

87 Views Asked by At

I have an xml file with mentions of software inside of <p> tags. I would like to create a script to catch those sofwtares and wrapp them in a some kind of tag (like <software>) just to be able to identify them after.

so far i have tried using :

  • xmltree
  • beautifull soup
  • lxml
  • converting the xml to json

The main issue is that my mentions are usually located inside a <p> tag that has some <ref>tags. Here is an example of one the <p> tags and the mention of the software "MODELLER"(I truncated it because the <p> can be up to 3 or 4 time bigger):

<p> The reliability of the model structure was tested using the ENERGY commands of MODELLER <ref type="bibr" target="#b45">(Sali and Blundell, 1993)</ref>. The modelled structures were also validated using the program PROSA <ref type="bibr" target="#b54">(Wiederstein and Sippl, 2007)</ref>.</p>

the expected result :

<p> The reliability of the model structure was tested using the ENERGY commands of <software>MODELLER</software> <ref type="bibr" target="#b45">(Sali and Blundell, 1993)</ref>. The modelled structures were also validated using the program PROSA <ref type="bibr" target="#b54">(Wiederstein and Sippl, 2007)</ref>.</p>

A simple search and replace operation won't work as I'm trying to find a specific mention of the software. To be precise I have a JSON with every mentions found in the xml. In the JSON every mentions is linked with a the context. So for the entry for MODELLER in the json would be linked to this context : "The reliability of the model structure was tested using the ENERGY commands of MODELLER (Sali and Blundell, 1993)."(the context contains the sentence with the software and without any sub-child).

an example of the JSON:

{
            "type": "software",
            "software-type": "software",
            "software-name":
            {
                "rawForm": "MODELLER",
                "normalizedForm": "MODELLER",
                "offsetStart": 79,
                "offsetEnd": 87,
                "boundingBoxes":
                [
                    {
                        "p": 4,
                        "x": 495.722,
                        "y": 320.344,
                        "w": 53.9323,
                        "h": 8.0517
                    }
                ]
            },
            "context": "The reliability of the model structure was tested using the ENERGY commands of MODELLER (Sali and Blundell, 1993)."
}

The best script that I made would be this one :

import xml.etree.ElementTree as ET

xml = '<root><p>hello, this is a test sentence with a <ref type="table" target="tab_0">reference_1</ref>, my software is MOD and here is another <ref type="table" target="tab_0">reference_2</ref> and some more text</p></root>'
root = ET.fromstring(xml)

p = root.find('p')
p_string = "".join(p.itertext())

software = 'MOD'

p_str_wo_child = str(p.text)

sub_tags_list = []
list_p = list(p)
p_software_index= p_str_wo_child.find(software)
for elm in list_p:
    index = p_string.find(elm.text)
    print(elm.tail)
    if index != -1:
        sub_tags_list.append([elm.tag, elm.text, elm.tail, index])
        p_string = p_string.replace(elm.text, '', 1)
    p_str_wo_child += str(elm.tail)
    p.remove(elm)

sub_tags_list = sorted(sub_tags_list, key=lambda x: x[3])
print(sub_tags_list)
p_software_index= p_str_wo_child.find(software)

if p_software_index:

    sub_tags_list.append(["software", software, False,p_software_index])
    sub_tags_list = sorted(sub_tags_list, key=lambda x: x[3])
    print(sub_tags_list)
    for tag_name, tag_content, tail ,index in sub_tags_list:
        tag = ET.Element(tag_name)
        tag.text = tag_content
        if tail == False:
            index_software = last_tail.find(software)
            tag.tail = last_tail[index_software:].replace(software,'')
            print(tag.tail)
        else:
            if tail.find(software):
                tag.tail = tail[:tail.find(software)].replace(software, '')
                last_tail = tail
            else:
                tag.tail = tail
                last_tail = tail
        p.append(tag)

modified_xml = ET.tostring(root, encoding='unicode')

print(modified_xml)
1

There are 1 best solutions below

0
Martin Honnen On

I might try to tackle this using XSLT 3 (and saxonche from PyPi) as follows:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  expand-text="yes">
  
  <xsl:template match="p">
    <xsl:iterate select="$json-doc?*[?type = 'software']">
      <xsl:param name="p" select="."/>
      <xsl:on-completion select="$p"/>
      <xsl:if test="contains($p, ?context)">
        <xsl:variable name="transformed-p" as="element(p)">
          <xsl:apply-templates select="$p" mode="process">
            <xsl:with-param name="software" select="." tunnel="yes"/>
          </xsl:apply-templates>
        </xsl:variable>
        <xsl:next-iteration>
          <xsl:with-param name="p" select="$transformed-p"/>
        </xsl:next-iteration>
      </xsl:if>
    </xsl:iterate>
  </xsl:template>
  
  <xsl:mode name="process" on-no-match="shallow-copy"/>
  
  <xsl:template mode="process" match="p//text()">
    <xsl:param name="software" tunnel="yes" as="map(*)"/>
    <xsl:choose>
      <xsl:when test="contains($software?context, .) and contains(., $software?software-name?normalizedForm)">
        <xsl:apply-templates select="analyze-string(., $software?software-name?normalizedForm)" mode="wrap"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:next-match/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
  
  <xsl:template mode="wrap" match="*:match">
    <software>{.}</software>
  </xsl:template>
  

  <xsl:output method="xml" indent="no"/>

  <xsl:mode on-no-match="shallow-copy"/>
  
  <xsl:param name="json-doc" select="json-doc('your-json-file.json')"/>
 
</xsl:stylesheet>

Example fiddle. There the JSON data is inlined but you can of course read it from a file.