etree.ElementTree parses xml which then builds a tree, is it an efficiently-searchable data structure?

190 Views Asked by At

I have an XML string

<tags>
   <person1>dave jones</person1>
   <person2>ron matthews</person2>
   <person3>sally van heerden</person3>
   <place>tygervalley</place>
   <ocassion>shopping</ocassion>
</tags>

and I would like to search this xml string using search terms such as "Sally Van Heerden" or "Tygervalley"

Is it faster to use regex to find the terms in this string or is the find() method of Python fast enough? I can also search using the element tree XML parser for python and then build the XML tree then searching it but I fear it will be too slow.

Which of the above three is the fastest? Also any other suggestions?

2

There are 2 best solutions below

0
On

I try to compare regexp and lxml for not large xml files and there was no strong differences between.

0
On

The answer will really depend on what you are going to do with the search results. The only case when you should even consider not using an XML parser is when you don't remotely care about the XML document structure.

If this is the case, you can try timing all three, but building a tree is then not necessary and can take too much time to compete with the substring search.

Time all three to see the difference on a typical file for your problem. For instance, on your small example file:

$ python -m timeit "any('tygervalley' in line for line in open('t.xml'))"
100000 loops, best of 3: 14.6 usec per loop

$ python -m timeit "import re" "for line in open('t.xml'):" "    re.findall('tygervalley', line)"
10000 loops, best of 3: 27.4 usec per loop


$ python -m timeit "from lxml.etree import parse" "tree = parse('t.xml')" "tree.xpath('//*[text()=\'tygervalley\']')"
10000 loops, best of 3: 133 usec per loop

You can play around with the actual methods to call, there's always choice.

Edit: note how things change on a 100 times longer file:

$ python -m timeit "any('tygervalley' in line for line in open('t.xml'))"
100000 loops, best of 3: 20.8 usec per loop

$ python -m timeit "import re" "for line in open('t.xml'):" "    re.findall('tygervalley', line)"
1000 loops, best of 3: 252 usec per loop

$ python -m timeit "from lxml.etree import parse" "tree = parse('t.xml')" "tree.xpath('//*[text()=\'tygervalley\']')"
1000 loops, best of 3: 1.34 msec per loop

Be careful interpreting the results :)