I have inherited some xml that I need to process in Python. I am using xml.etree.cElementTree
, and I am having some trouble associating text that occurs after an empty element with that empty element's tag. The xml is quite a bit more complicated than I what I have pasted below, but I have simplified it to make the problem clearer (I hope!).
The result I would like to have is a dict like this:
DESIRED RESULT
{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}
The tuples can also contain strings (e.g., ('9', '1')
). I really don't care at this early stage.
Here is the XML:
test1.xml
<div1 type="chapter" num="9">
<p>
<section num="1"/> <!-- The empty element -->
As they say, A student has usually three maladies: <!-- Here lies the trouble -->
<section num="2"/> <!-- Another empty element -->
poverty, itch, and pride.
</p>
</div1>
WHAT I HAVE TRIED
Attempt 1
>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('test1.xml')
>>> root = tree.getroot()
>>> chapter = root.attrib['num']
>>> d = dict()
>>> for p in root:
for section in p:
d[(int(chapter), int(section.attrib['num']))] = section.text
>>> d
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty
Attempt 2
>>> for p in root:
for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense
d[(int(chapter), int(section.attrib['num']))] = text.strip()
>>> d
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}
As you can see in the latter attempt, p
and p.itertext()
are two different lengths. The value of (9, 2)
is the value I am trying to associate with key (9, 1)
, and the value I want to associate with (9, 2)
does not even show up in d
(since zip
truncates the longer p.itertext()
).
Any help would be appreciated. Thanks in advance.
Have you tried using
.tail
?