iterparse elements getting cleared before I can capture the data

752 Views Asked by At

I'm trying to use Python to parse a large XML file (27GB) using cElementTree and iterparse. I'm able to extract all the tags, but for some reason none of the element text is being retrieved (its always showing 'None'). I've checked the documentation and StackOverflow but to no avail. I tried the parsing with lxml as a last resort and it works, but I'd prefer to figure it out on cElementree if possible. Update: When I comment out the elem.clear() line it shows the data being parsed, but now I'm trying to figure out why the clear() method is wiping the data before it gets printed (ultimately I want to put the data into a separate data-structure like a database). I'm assuming I need to clear the data so that I don't max out the memory during the file parse. Is this one of those "everything in Python is an object" situations?

Using a smaller sample extracted from the file I'm still getting the same error. The XML file looks something like this (albeit it many more entries):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><entityList><entity 
xmlns:ns2="urn:hl7-org:v3" xmlns:ns3="urn:axolotl-com:pdo">
<fragmentId>d68e616e-a6bc-4630-b104-3891859a8ce4</fragmentId>
<aggregateId>H1060734453</aggregateId>
<source>b6167864-5f74-40e5-97c5-7e551a3a4a7d</source>
<sourceName>SHM ADT</sourceName>
<sourceOid>2.16.840.1.113883.3.2.2.3.1.21.3</sourceOid>
<sourceAaoid>2.16.840.1.113883.3.62.2</sourceAaoid>
</entity></entityList>

Here's a snippet of the misbehaving code:

import xml.etree.ElementTree as etree
xml=r'C:\sample.xml'

count = 0

for event, elem in etree.iterparse(xml):
    if event == 'end':
        if elem.tag == 'entity':
            count+=1        
                for child in elem:
                    print (child.tag, child.attrib, child.text)
    elem.clear()
print(count)

I'm getting

fragmentId {} None
aggregateId {} None
source {} None
sourceName {} None
sourceOid {} None
sourceAaoid {} None

Why does elem.clear() wipe the text even though it looks like the printing should happen first? Any suggestions?

2

There are 2 best solutions below

1
gold_cy On

Here is how I would do it, also I am not sure what you want to do with the data so I am just printing it as you are:

import xml.etree.ElementTree as ET

tree = ET.parse(path_to_xml)
root = tree.getroot()

def tree_parser(root):
    for child in root.getchildren():
        if not child.getchildren():
            print(child.tag, child.text)
        else:
            tree_parser(child)

tree_parser(root) 

fragmentId d68e616e-a6bc-4630-b104-3891859a8ce4
aggregateId H1060734453
source b6167864-5f74-40e5-97c5-7e551a3a4a7d
sourceName SHM ADT
sourceOid 2.16.840.1.113883.3.2.2.3.1.21.3
sourceAaoid 2.16.840.1.113883.3.62.2

As per your comment:

def tree_parser(root, seen=set()):
    for child in root.getchildren():
        if not child.getchildren():
            data = (child.tag, child.text)
            seen.add(data)
        else:
            tree_parser(child, seen)
    return seen

for _, element in etree.iterparse(path_to_xml):
    c = tree_parser(element)

print(c)

{('aggregateId', 'H1060734453'),
 ('fragmentId', 'd68e616e-a6bc-4630-b104-3891859a8ce4'),
 ('source', 'b6167864-5f74-40e5-97c5-7e551a3a4a7d'),
 ('sourceAaoid', '2.16.840.1.113883.3.62.2'),
 ('sourceName', 'SHM ADT'),
 ('sourceOid', '2.16.840.1.113883.3.2.2.3.1.21.3')}
0
pizza On

Moving elem.clear() to the block under the if elem.tag == 'entity': statement works. This ensures that the child elements are cleared only after you have processed them.

count = 0

for event, elem in etree.iterparse(xml):
    if event == 'end':
        if elem.tag == 'entity':
            count+=1        
            for child in elem:
                print (child.tag, child.attrib, child.text)
            elem.clear()    # Clear only if </entity> is encountered
print(count)

In your original example, by the time the </entity> closing tag is encountered, all the child elements have already been cleared (their closing tags are encountered earlier).

count = 0

for event, elem in etree.iterparse(xml):
    if event == 'end':
        if elem.tag == 'entity':
            count+=1        
                for child in elem:
                    print (child.tag, child.attrib, child.text)
    elem.clear()    # Clears fragmentId ... sourceAaoid before </entity>
print(count)