How to skip a node which raises an error when using cElementTree.iterparse()

2.8k Views Asked by At

I am trying to parse a very big XML file and do lower case and remove punctuation. The problem is that when I try to parse this file using the cET parse function for big files, at some point it comes across a bad formatted tag or character which raises syntax error:

SyntaxError: not well-formed (invalid token): line 639337, column 4

Note: It is nearly impossible for me to read the file, so I can not see where the problem is.

How can I skip or fix this?

from xml.etree import cElementTree as cET

for event, elem in cET.iterparse(xmlFile, events=("start", "end")):
    ...do something...
2

There are 2 best solutions below

1
AudioBubble On

You could use a tool like xmllint to verify and clean your XML. The errors reported by this tool should help you to fix the XML file.

Edit: An example:

$ cat invalid.xml 
<?xml version="1.0"?>
<foo>
<bar>
</foo>
$ xmllint invalid.xml 
invalid.xml:4: parser error : Opening and ending tag mismatch: bar line 3 and foo
</foo>
      ^
invalid.xml:5: parser error : Premature end of data in tag foo line 2

^
2
Martijn Pieters On

Use lxml instead of the standard library ElementTree; it supports the same API, but can handle broken XML; it'll attempt to repair it if at all possible:

parser = etree.XMLParser(recover=True)
context = etree.iterparse(filename, parser)