How to split a large XBRL file?

178 Views Asked by At

I have xbrl file which is ~50Gb long. When I try to open it via arelle I got MemoryError. Is there a way to split xbrl file into smaller pieces? Does xbrl specification supports this?

2

There are 2 best solutions below

0
On

Let me first give a general comment from a database perspective (agnostic to XBRL).

When dealing with large amounts of data, it is common practice in data management to indeed split the input to multiple, smaller files (up to 100s of MB each) located in the same directory. This is what is typically done for large datasets, with file names carrying increasing integers in the same directory. It has practical reasons such that making it much easier to copy over the dataset to other locations.

I am not sure, however, whether there is yet a public standard for splitting XBRL instances in this way (even though this would be relatively straightforward to do and implement for an engine developer: just partition the facts and write one partition to each file with only the contexts and units in the transitive closure -- this is really a matter of standardizing the way it is done).

Very large files (50GB but also more), however, can still be read in general with limited memory (say, 16GB or even less) for queries that are streaming-friendly (such as filtering, projecting, counting, converting to another format, etc).

In the case of XBRL, the trick is to structure the file in such a way that it can be read in a streaming fashion, as pdw mentions. I recommend looking at the following official document by XBRL International [1], which is now a candidate recommendation and which explains how to create XBRL instances that are can be read in a streaming fashion:

[1] https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html

If the engine supports this, there is no theoretical limit to the size the instance can have, except for the capacity of your disk and how much intermediate data the query needs to maintain in memory as it streams through (for example, a grouping query aggregating on a count will need to keep track of its keys and associated counts). 50GB is relatively on the small side compared to what can be done. I would still expect that it would take at least a one- or two-digit number of minutes to process depending on the exact use case.

I am not sure whether Arelle supports streaming at this point. Most XBRL processors today materialize the instance in memory, but I expect that there will be some XBRL processors out there that will implement the Streaming Extensions.

Finally, I second pdw that reducing the size of the input such as using the CSV syntax can help on both the speed and the memory footprint. It is likely that a 50G XBRL instance can be stored in less than 50G of memory with the right format, and tables (CSV) are a pretty good way to do that. Having said that, one should also keep in mind that the syntax used on disk does not have to match the data structures in memory, which any engine is free to design the way it sees fit as long as the outside behavior is unchanged.

0
On

There is not an easy or standard way to split up an XBRL file into smaller pieces, although there are way that can be done. You could copy batches of facts into separate files, but when doing so, you'd need to make sure that you also copy the referenced context and unit definitions for the facts. This is made trickier by the fact that the contexts and units may appear before or after the facts that reference them, so you'd probably need to do it in multiple streaming parses.

If you are generating the data yourself, I'd recommend looking at xBRL-CSV. This is a new specification suited to representing large, record-based XBRL datasets in a much more compact form. I believe that there is initial support for this in Arelle.