How speed up xbrl file parsing in python lxml?

625 Views Asked by At

I am trying to parse xbrl file (1.35Gb) via arelle. During debug I spot that execution holds on line ModelDocument.py:157. It holds more than 30 minutes. Python process take about 8Gb RAM and slowly increases memory consuming:

enter image description here

It looks like python parses xml with 20-50Kb/s speed which is extremelly slow. Especially if we take into account that python have C optimization code. Note also that I got 1 core loaded 100% so CPU does some heavy work (but what exactly?)

Any ideas how xbrl parsing can be speeded up?

System: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05)

1

There are 1 best solutions below

1
On

Maybe my answer will be more relevant, in the long term, to developers of XBRL processors, but I would encourage taking a look at what makes an instance streaming-friendly, notably, the following candidate recommendation by XBRL international:

https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html

Producing and consuming large XBRL instances in a streaming fashion helps avoiding the issue of being stuck at the parsing call, i.e., instead of loading and parsing the entire instance in a bulk, streaming reduces pressure on memory, as the facts can be converted on the fly to the processor's internal memory structure.

In general, just streaming through 1-2 GB of data doing simple things takes much less than a minute. If it takes 30 minutes, it seems that there is optimization potential for the implementation of a processor. I do not think that this is an issue only with Arelle, and I think that with more users opening larger files, implementors will at some point start looking into this.