How to make DASK read BSON files?

113 Views Asked by At

I need to read mongo_db dump BSON files.

And i'm thinking about how to read these files by writing custom DASK bson_reader.

The problem is to parse mongo_db BSON files and iterate through all file. When iterating throught BSON need to find BSON blocks endings, to separate each file block and prevent load whole file into memory. The pymongo bson module can do it, but return only a file iterator and this iterator can not use in for example: dask.bug.load_csv(file).map(iterator).

What the boiler plate to write custom dask.bug.bson_read? or any ideas?

1

There are 1 best solutions below

0
On

If the block endings are consistent then you can use the dask.bytes.read_bytes function to create a list of Dask delayed objects. You can then apply your chunk-of-bytes -> list-or-dataframe function to each of those delayed objects, and then use the from_delayed from from either dask.bag or dask.delayed.

https://docs.dask.org/en/latest/delayed-collections.html