I'm trying to use Python to parse a large XML file (27GB) using cElementTree and iterparse. I'm able to extract all the tags, but for some reason none of the element text is being retrieved (its always showing 'None'). I've checked the documentation and StackOverflow but to no avail. I tried the parsing with lxml as a last resort and it works, but I'd prefer to figure it out on cElementree if possible. Update: When I comment out the elem.clear() line it shows the data being parsed, but now I'm trying to figure out why the clear() method is wiping the data before it gets printed (ultimately I want to put the data into a separate data-structure like a database). I'm assuming I need to clear the data so that I don't max out the memory during the file parse. Is this one of those "everything in Python is an object" situations?
Using a smaller sample extracted from the file I'm still getting the same error. The XML file looks something like this (albeit it many more entries):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><entityList><entity
xmlns:ns2="urn:hl7-org:v3" xmlns:ns3="urn:axolotl-com:pdo">
<fragmentId>d68e616e-a6bc-4630-b104-3891859a8ce4</fragmentId>
<aggregateId>H1060734453</aggregateId>
<source>b6167864-5f74-40e5-97c5-7e551a3a4a7d</source>
<sourceName>SHM ADT</sourceName>
<sourceOid>2.16.840.1.113883.3.2.2.3.1.21.3</sourceOid>
<sourceAaoid>2.16.840.1.113883.3.62.2</sourceAaoid>
</entity></entityList>
Here's a snippet of the misbehaving code:
import xml.etree.ElementTree as etree
xml=r'C:\sample.xml'
count = 0
for event, elem in etree.iterparse(xml):
if event == 'end':
if elem.tag == 'entity':
count+=1
for child in elem:
print (child.tag, child.attrib, child.text)
elem.clear()
print(count)
I'm getting
fragmentId {} None
aggregateId {} None
source {} None
sourceName {} None
sourceOid {} None
sourceAaoid {} None
Why does elem.clear() wipe the text even though it looks like the printing should happen first? Any suggestions?
Here is how I would do it, also I am not sure what you want to do with the data so I am just printing it as you are:
As per your comment: