I have an xml file with mentions of software inside of <p> tags. I would like to create a script to catch those sofwtares and wrapp them in a some kind of tag (like <software>) just to be able to identify them after.
so far i have tried using :
- xmltree
- beautifull soup
- lxml
- converting the xml to json
The main issue is that my mentions are usually located inside a <p> tag that has some <ref>tags.
Here is an example of one the <p> tags and the mention of the software "MODELLER"(I truncated it because the <p> can be up to 3 or 4 time bigger):
<p> The reliability of the model structure was tested using the ENERGY commands of MODELLER <ref type="bibr" target="#b45">(Sali and Blundell, 1993)</ref>. The modelled structures were also validated using the program PROSA <ref type="bibr" target="#b54">(Wiederstein and Sippl, 2007)</ref>.</p>
the expected result :
<p> The reliability of the model structure was tested using the ENERGY commands of <software>MODELLER</software> <ref type="bibr" target="#b45">(Sali and Blundell, 1993)</ref>. The modelled structures were also validated using the program PROSA <ref type="bibr" target="#b54">(Wiederstein and Sippl, 2007)</ref>.</p>
A simple search and replace operation won't work as I'm trying to find a specific mention of the software. To be precise I have a JSON with every mentions found in the xml. In the JSON every mentions is linked with a the context. So for the entry for MODELLER in the json would be linked to this context : "The reliability of the model structure was tested using the ENERGY commands of MODELLER (Sali and Blundell, 1993)."(the context contains the sentence with the software and without any sub-child).
an example of the JSON:
{
"type": "software",
"software-type": "software",
"software-name":
{
"rawForm": "MODELLER",
"normalizedForm": "MODELLER",
"offsetStart": 79,
"offsetEnd": 87,
"boundingBoxes":
[
{
"p": 4,
"x": 495.722,
"y": 320.344,
"w": 53.9323,
"h": 8.0517
}
]
},
"context": "The reliability of the model structure was tested using the ENERGY commands of MODELLER (Sali and Blundell, 1993)."
}
The best script that I made would be this one :
import xml.etree.ElementTree as ET
xml = '<root><p>hello, this is a test sentence with a <ref type="table" target="tab_0">reference_1</ref>, my software is MOD and here is another <ref type="table" target="tab_0">reference_2</ref> and some more text</p></root>'
root = ET.fromstring(xml)
p = root.find('p')
p_string = "".join(p.itertext())
software = 'MOD'
p_str_wo_child = str(p.text)
sub_tags_list = []
list_p = list(p)
p_software_index= p_str_wo_child.find(software)
for elm in list_p:
index = p_string.find(elm.text)
print(elm.tail)
if index != -1:
sub_tags_list.append([elm.tag, elm.text, elm.tail, index])
p_string = p_string.replace(elm.text, '', 1)
p_str_wo_child += str(elm.tail)
p.remove(elm)
sub_tags_list = sorted(sub_tags_list, key=lambda x: x[3])
print(sub_tags_list)
p_software_index= p_str_wo_child.find(software)
if p_software_index:
sub_tags_list.append(["software", software, False,p_software_index])
sub_tags_list = sorted(sub_tags_list, key=lambda x: x[3])
print(sub_tags_list)
for tag_name, tag_content, tail ,index in sub_tags_list:
tag = ET.Element(tag_name)
tag.text = tag_content
if tail == False:
index_software = last_tail.find(software)
tag.tail = last_tail[index_software:].replace(software,'')
print(tag.tail)
else:
if tail.find(software):
tag.tail = tail[:tail.find(software)].replace(software, '')
last_tail = tail
else:
tag.tail = tail
last_tail = tail
p.append(tag)
modified_xml = ET.tostring(root, encoding='unicode')
print(modified_xml)
I might try to tackle this using XSLT 3 (and saxonche from PyPi) as follows:
Example fiddle. There the JSON data is inlined but you can of course read it from a file.