I have a fairly large XML File of about 3GB size that I am wanting to parse in streaming mode using 'xmltodict' utility. The code I have iterates through each item and forms a dictionary item and appends to the dictionary in memory, eventually to be dumped as json in a file.
I have the following working perfectly on a small xml data set:
import xmltodict, json
import io
output = []
def handle(path, item):
#do stuff
return
doc_file = open("affiliate_partner_feeds.xml","r")
doc = doc_file.read()
xmltodict.parse(doc, item_depth=2, item_callback=handle)
f = open('jbtest.json', 'w')
json.dump(output,f)
On a large file, I get the following:
Traceback (most recent call last):
File "jbparser.py", line 125, in <module>
**xmltodict.parse(doc, item_depth=2, item_callback=handle)**
File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse
parser.Parse(xml_input, True)
OverflowError: size does not fit in an int
The exact location of exception inside xmltodict.py is:
def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
namespace_separator=':', **kwargs):
handler = _DictSAXHandler(namespace_separator=namespace_separator,
**kwargs)
if isinstance(xml_input, _unicode):
if not encoding:
encoding = 'utf-8'
xml_input = xml_input.encode(encoding)
if not process_namespaces:
namespace_separator = None
parser = expat.ParserCreate(
encoding,
namespace_separator
)
try:
parser.ordered_attributes = True
except AttributeError:
# Jython's expat does not support ordered_attributes
pass
parser.StartElementHandler = handler.startElement
parser.EndElementHandler = handler.endElement
parser.CharacterDataHandler = handler.characters
parser.buffer_text = True
try:
parser.ParseFile(xml_input)
except (TypeError, AttributeError):
**parser.Parse(xml_input, True)**
return handler.item
Any way to get around this? AFAIK, the xmlparser object is not exposed for me to play around and change 'int' to 'long'. More importantly, what is really going on here? Would really appreciate any leads on this. Thanks!
Try to use marshal.load(file) or marshal.load(sys.stdin) in order to unserialize the file (or to use it as a stream) instead of reading the whole file into memory and then parse it as a whole.
Here is an example:
STDIN: