How to go to sub child node using Python xml.dom.minidom

130 Views Asked by At

I have the below xml structure, and I need to get data in below format to be exported to spreadsheet Output to be written in spreadsheet

Image of expected spreadsheet output

Source XML Data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root testAttr="testValue">
    <family name="Hardwood">
        <child name="Jack">First</child>
        <child name="Rose">Second</child>
        <child name="Blue Ivy">Third
                <grandchildren>
                    <data>One</data>
                    <data>Two</data>
                    <unique>Twins</unique>
                </grandchildren>
            </child>
        <child name="Jane">Fourth</child>
    </family>
    <family name="Downie">
        <child name="Bill">First</child>
        <child name="Rosie">Second</child>
        <child name="Edward">
            Third
            </child>
        <child name="Jane">Fourth</child>
    </family>
</root>

I have tried the below however not able to enter into grandchild tag.

import xml.dom.minidom
doc=xml.dom.minidom.parse('Sample_XML.xml')
children = doc.getElementsByTagName('family')
for child in children:
    print(child.getAttribute('name'))
    print(child.getElementsByTagName('child')[0].childNodes[0].nodeValue)
    print(child.getElementsByTagName('child')[1].childNodes[0].nodeValue)
    print(child.getElementsByTagName('child')[2].childNodes[0].nodeValue)
    print(child.getElementsByTagName('child')[2].childNodes[0].nodeValue)
2

There are 2 best solutions below

1
Jack Fleeting On

There may be a simpler way of doing it, but it can be done if you use lxml instead of minidom, because of the former's better support of xpath. lxml parser the document and creates a dataframe which can be saved as-is, as csv or as an Excel file. It's not simple (because your xml is deeply nested), but if pretty self explanatory, I believe, if you know xpath. If not, you should read up on it...

from lxml import etree
import pandas as pd

doc = etree.parse("Sample_XML.xml")
families = doc.xpath('//family')

rows = []
cols = ["Family","Child","Child Name","Grandchild"]

for family in families:
    children = family.xpath('.//child')
    for child in children:
        gcs = child.xpath('.//grandchildren') 
        entry = [gc.xpath('.//*/text()') for gc in gcs] if gcs else None
        row = [family.xpath('./@name')[0], child.xpath('./text()')[0].strip(), child.xpath('./@name')[0]]
        if entry is not None:            
            for ent in entry[0]:                
                subrow = row+[ent]
                rows.append(subrow)
        else:            
            row.append("none")
            rows.append(row)

pd.DataFrame(rows,columns=cols)

Output should look like this (pardon the formatting):

    Family    Child     Child Name  Grandchild
0   Hardwood    First   Jack      none
1   Hardwood    Second  Rose      none
2   Hardwood    Third   Blue Ivy    One
3   Hardwood    Third   Blue Ivy    Two
4   Hardwood    Third   Blue Ivy    Twins
5   Hardwood    Fourth  Jane    none
6   Downie  First   Bill      none
7   Downie  Second  Rosie     none
8   Downie  Third   Edward    none
9   Downie  Fourth  Jane      none
1
Parfait On

Consider multiple nested loops, walking down the tree from family to child to grandchildren while building a list of dictionaries. Special handling is needed for the optional grandchildren nodes. From there use, DictWriter to write data to CSV.

from csv import DictWriter
import xml.dom.minidom as md

doc = md.parse('Sample_XML.xml')

# PARSE XML
data = []
for family in doc.getElementsByTagName('family'):
    for child in family.getElementsByTagName('child'):
        inner = {}
        inner["Family"] = family.getAttribute('name')
        inner["Child"] = child.childNodes[0].nodeValue.strip()
        inner["Child Name"] = child.getAttribute('name')
        inner["Grandchild"] = None
        
        for grandchildren in child.getElementsByTagName('grandchildren'):
            for grandchild in grandchildren.childNodes:
                if grandchild.nodeType == md.Node.ELEMENT_NODE:
                    inner = {}
                    inner["Family"] = family.getAttribute('name')
                    inner["Child"] = child.childNodes[0].nodeValue.strip()
                    inner["Child Name"] = child.getAttribute('name')
                    inner["Grandchild"] = grandchild.childNodes[0].nodeValue
                    
                    data.append(inner)
                    
        data.append(inner)

# WRITE CSV
dkeys = list(data[0].keys())

with open("Output.csv", "w", newline="") as f:
    dw = DictWriter(f, fieldnames=dkeys)
    dw.writeheader()    
    dw.writerows(data)