Is there a way to skip nodes/elements with iterparse lxml?

764 Views Asked by At

Is there a way using lxml iterparse to skip an element without checking the tag? Take this xml for example:

<root>
    <sample>
        <tag1>text1</tag1>
        <tag2>text2</tag2>
        <tag3>text3</tag3>
        <tag4>text4</tag4>
    </sample>
    <sample>
        <tag1>text1</tag1>
        <tag2>text2</tag2>
        <tag3>text3</tag3>
        <tag4>text4</tag4>
    </sample>
</root>
    

If I care about tag1 and tag4, checking tag2 and tag3 will eat up some time. If the file isn't big, it doesn't really matter but if I have a million <sample> nodes, I could reduce search time some if I don't have to check tag2 nd tag3. They're always there and I never need them.

using iterparse in lxml

import lxml

xmlfile = 'myfile.xml'
context = etree.iterparse(xmlfile, events('end',), tag='sample')

for event, elem in context:
    for child in elem:
        if child.tag == 'tag1'
            my_list.append(child.text)

            #HERE I'd like to advance the loop twice without checking tag2 and tag3 at all
            #something like:

            #next(child)
            #next(child)

        elif child.tag == 'tag4'
             my_list.append(child.text)
    
1

There are 1 best solutions below

3
Daniel Haley On BEST ANSWER

If you use the tag arg in iterchildren like you do in iterparse, you can "skip" elements other than tag1 and tag4.

Example...

from lxml import etree

xmlfile = "myfile.xml"

my_list = []

for event, elem in etree.iterparse(xmlfile, tag="sample"):
    for child in elem.iterchildren(tag=["tag1", "tag4"]):
        if child.tag == "tag1":
            my_list.append(child.text)
        elif child.tag == "tag4":
            my_list.append(child.text)

print(my_list)

Printed output...

['text1', 'text4', 'text1', 'text4']