Issues Parsing Multi Nested Childs in XML using lxml

294 Views Asked by At

Im having issues parsing out each child node within an xml file. The number of nodes can change per Instrument_Root. For instance, Instrument_Watch is NULL here, but will be populated in other instances after this. My goal is to have each child node parsed individually (Instrument_Ratings, Instrument_Attribute_Ratings, Instrument_Organization, Instrument_Supports, etc.)

I tried doing the following, but it just returned the first intance repeatedly - there are 3700 Instrument_Root in the file, and Instrument_Rating for this one Instrument_Root was repeated 3700 times. I also ran into errors with etree due to the namespace.

from lxml import objectify

xml = objectify.parse(file)
root = xml.getroot()

tree1 = []
tree2 = []
tree3 = []
tree4 = []
for children in range(len(root.getchildren())):
    tree1.append([child.text for child in root.getchildren()[children].iterchildren()])
    for children2 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.getchildren():
        tree2.append([child2.text for child2 in  root.Instrument_Root.Instrument_Ratings.Instrument_Rating.getchildren()])
        for children3 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.Instrument_Rating_Attributes.Instrument_Rating_Attribute.getchildren():
            tree3.append([child3.text for child3 in root.Instrument_Root.Instrument_Ratings.Instrument_Rating.Instrument_Rating_Attributes.Instrument_Rating_Attribute.getchildren()])
            for children4 in root.Instrument_Root.Instrument_Organizations.Instrument_Organization.getchildren():
                tree4.append([child4.text for child4 in root.Instrument_Root.Instrument_Organizations.Instrument_Organization.getchildren()])


<?xml version="1.0" encoding="utf-8"?>              
<Instrument_Roots xmlns="" xmlns:xsi="http://www.XXXXXXX.XMLSchema-instance" file_type="Baseline" frequency="Hourly-12" generation_time="2020-04-06T12:00:00Z">             
        <Security_Description>Class B</Security_Description>        
        <Instrument_Type_Text>PASS-THRU CTFS</Instrument_Type_Text>     
        <Private_Placement_Text>Not Applicable</Private_Placement_Text>     
        <Coupon_Rate xsi:nil="true"/>       
        <Instrument_Description xsi:nil="true"/>        
        <Product_Line_Description>MBS - Prime</Product_Line_Description>        
        <Series_Class_Text>Class B</Series_Class_Text>      
                <Rating_Class_Text>Senior Secured</Rating_Class_Text>
                <Duration_Text>Long-Term Debt Rating</Duration_Text>
                <Seniority_Text>Senior Secured</Seniority_Text>
                <Evaluation_Type_Text>Credit Risk</Evaluation_Type_Text>
                <Rating_Subclass_Code xsi:nil="true"/>
                <Rating_Subclass_Text xsi:nil="true"/>
                <Currency_Capd_Text>Local Currency</Currency_Capd_Text>
                <Credit_Grade xsi:nil="true"/>
                <Rating_Direction_Text>DECISION NOT TO RATE</Rating_Direction_Text>
                <Rating_Type_Text>Long-Term Debt Rating</Rating_Type_Text>
                <Rating_Termination_Date xsi:nil="true"/>
                <Rating_Termination_Local_Date xsi:nil="true"/>
                <Rating_Reason_Text>DECISION NOT TO RATE</Rating_Reason_Text>
                <Rating_Currency_Text>Australian Dollar</Rating_Currency_Text>
                <Instrument_Watchlist xsi:nil="true"/>
        <Instrument_Supports xsi:nil="true"/>       
                <Organization_Role_Text>Issuer Account Bank</Organization_Role_Text>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Organization_Role_Text>Cash Manager</Organization_Role_Text>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>
        <Instrument_Identifiers xsi:nil="true"/>        
                <Rating_Attribute_Type_Text>SF Indicator</Rating_Attribute_Type_Text>
                <Termination_Date xsi:nil="true"/>
                <Rating_Attribute_Type_Text>SEC Exempt</Rating_Attribute_Type_Text>
                <Termination_Date xsi:nil="true"/>
                <Termination_Date xsi:nil="true"/>

Any ideas on how to attack this would be greatly appreciated. Thanks.


There are 1 best solutions below


The source of your problem is that your XML has a default namespace (, so each attempt to locate an element must include this namespace (your code failed on this detail).

To process your XML file I used the following code:

  1. Import:

    from lxml import etree as et
  2. Read the XML file:

    parser = et.XMLParser(remove_blank_text=True)
    tree = et.parse('Instrum.xml', parser)
    root = tree.getroot()
  3. Define the namespace used:

    ns = {'xx': ''}

    (will be used below).

  4. Fill tree1 with text content of children of each Instrument_Root:

    tree1 = []
    for elem in root.findall('xx:Instrument_Root/*', ns):
        txt = elem.text
        if txt is not None:

    Note that Instrument_Root is a direct descendant of the root node, so it is enough to put just the node name.

  5. Fill tree2 with text content of children of each Instrument_Rating:

    tree2 = []
    for elem in root.findall('.//xx:Instrument_Rating/*', ns):
        txt = elem.text
        if txt is not None and len(txt.strip()) > 0:

    This time Instrument_Rating is located somewhere deeper in the XML tree, so XPath must include // to perform "all levels" search.

    I added also some logic to avoid appending either non-existing text or text containing only "while" chars (delete it if you don't want to skip them).

For your XML input sample I got:

  1. tree1:

    ['831295951', '831275547', '18705', 'Pass-Through', 'PAS', '2020-03-21T00:00:00',
     'AUD', 'N', '2051-03-21T00:00:00', '2051', '2020-03-21T00:00:00', '7.2534316791',
     'N', 'N', 'Class B', '24657', 'PASS-THRU CTFS', '24922', 'Not Applicable',
     '26', 'Floating', 'FLT', '16', 'Monthly', 'MON', 'MBS - Prime', 'Class B',
     'AUSTRALIA', '11.2500000000', 'Y', '3']
  2. tree2:

    ['831295951', '37203', '2020-03-02T01:30:03', '831295958', 'I', 'Senior Secured',
     '18705', 'Pass-Through', 'PAS', '25636', 'Long-Term Debt Rating', 'LT',
     '18743', 'Senior Secured', 'SS', '25648', 'Credit Risk', '5734', 'Enhanced',
     'ENH', '19142', 'Local Currency', 'NR', '0', '19102', 'DECISION NOT TO RATE',
     'NR', '534', 'Long-Term Debt Rating', 'LT', 'ENH', '2020-03-02T17:30:03', '25530',
     'DECISION NOT TO RATE', '20525', 'Australian Dollar', 'AUD', '1', 'Y']

Note that there is no need for any nested loops.

I think that based on the code above you will know how to extract content to fill tree3 and tree4.