Stuttgart Stuttgart Stuttgart

Generate TRAIN_DATA for spacy from xml

44 Views Asked by At

I have xml data, which looks like this:

<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>

I want to use texts like this as trainingdata in spacy, therfore i need it in the form spacy requieres:

doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]}
example = Example.from_dict(doc, gold_dict)

Especially the creation of the offset, i.e. when an entity starts and when it ends, I still can't get it right. Is there a particularly suitable procedure for this?

Edit: here is what I have tried so far with ElementTree:

from xml.etree import ElementTree as ET

data = '''
<root>
<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>
</root>
'''
def get_entity_type(ana):
    if 'regO' in ana:
        return 'PLACE'
    if 'regP' in ana:
        return 'PERSON'
    if 'regW' in ana:
        return 'WORK'
    if 'regP' in ana:
        return "PERIODICA"
 
root = ET.fromstring(data)
print(root)
#text = ""
entities = []
current_pos = 0

for node in root.iter():
    #print(node)
    if node.tag == "anchor" and node.get('type')=='b':
        start_pos = current_pos
        ana = node.get('ana')
        entity_type = get_entity_type(ana)
        #print(entity_type)
    elif node.tag == "anchor" and node.get('type')=='e':
        entities.append((entity_type, start_pos, current_pos))       
                    
#print (entities)

So catching the entities-types is working, but the idea to catch the beginning and ending position of the entities is wrong. Also I tried to do it with pawpaw, described like here. But it always fails to find "Ito"

That's what I tried with pawpaw:

from pawpaw import ito
root = ET.fromstring(data)
elements = root.findall('.//')
print(elements)

for e in elements:
    plain_text = e.Ito.find('*[d:text]')
#     print(plain_text)
1

There are 1 best solutions below

0
Hermann12 On

To grep the text you need element .tail:

import xml.etree.ElementTree as ET

xml_str ="""
<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>
"""
root = ET.fromstring(xml_str)

text = []
for elem in root.iter():
    if elem.tail is not None:
        # with linebreak \n
        text.append(elem.tail+'\n')
        
t = ''.join(text)
print(t)
print(repr(t))

Output:

Stuttgart
 d. 20. Sept [19]97
Lieber Herr Schmidt!
Ich bin sehr glücklich über die Aufnahme 
meines 
Shakespeare
 bei euch, vielen Dank.

'Stuttgart\n d. 20. Sept [19]97\nLieber Herr Schmidt!\nIch bin sehr glücklich über die Aufnahme \nmeines \nShakespeare\n bei euch, vielen Dank.\n'