Parsing large XML file using 'xmltodict' module results in OverflowError

4.2k Views Asked by At

I have a fairly large XML File of about 3GB size that I am wanting to parse in streaming mode using 'xmltodict' utility. The code I have iterates through each item and forms a dictionary item and appends to the dictionary in memory, eventually to be dumped as json in a file.

I have the following working perfectly on a small xml data set:

    import xmltodict, json
    import io

    output = []

    def handle(path, item):
       #do stuff
       return

    doc_file = open("affiliate_partner_feeds.xml","r")
    doc = doc_file.read()        
    xmltodict.parse(doc, item_depth=2, item_callback=handle)

    f = open('jbtest.json', 'w')
    json.dump(output,f)

On a large file, I get the following:

Traceback (most recent call last):
  File "jbparser.py", line 125, in <module>
    **xmltodict.parse(doc, item_depth=2, item_callback=handle)**
  File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse
    parser.Parse(xml_input, True)
  OverflowError: size does not fit in an int

The exact location of exception inside xmltodict.py is:

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', **kwargs):

        handler = _DictSAXHandler(namespace_separator=namespace_separator,
                                  **kwargs)
        if isinstance(xml_input, _unicode):
            if not encoding:
                encoding = 'utf-8'
            xml_input = xml_input.encode(encoding)
        if not process_namespaces:
            namespace_separator = None
        parser = expat.ParserCreate(
            encoding,
            namespace_separator
        )
        try:
            parser.ordered_attributes = True
        except AttributeError:
            # Jython's expat does not support ordered_attributes
            pass
        parser.StartElementHandler = handler.startElement
        parser.EndElementHandler = handler.endElement
        parser.CharacterDataHandler = handler.characters
        parser.buffer_text = True
        try:
            parser.ParseFile(xml_input)
        except (TypeError, AttributeError):
            **parser.Parse(xml_input, True)**
        return handler.item

Any way to get around this? AFAIK, the xmlparser object is not exposed for me to play around and change 'int' to 'long'. More importantly, what is really going on here? Would really appreciate any leads on this. Thanks!

1

There are 1 best solutions below

0
On

Try to use marshal.load(file) or marshal.load(sys.stdin) in order to unserialize the file (or to use it as a stream) instead of reading the whole file into memory and then parse it as a whole.

Here is an example:

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

STDIN:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']