I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.
The relevant piece of code is this:
import lxml.etree as et
import lxml.sax
# deleted non relevant code
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith('.xml'):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# This is a function that I define elsewhere in which I use XPath to identify relevant
# elements and extract, change or add some information
xml_dosomething(file_root)
# init tree object from file_root
tree = et.ElementTree(file_root)
# save modified xml tree object to file with an added text so that I can keep a copy of original.
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:
handler = lxml.sax.ElementTreeContentHandler()
then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:
tree = handler.etree
lxml.etree.tostring(tree.getroot())
I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?