How to make DASK read BSON files?

123 Views Asked by VadimCh At 12 March 2020 at 09:20

I need to read mongo_db dump BSON files.

And i'm thinking about how to read these files by writing custom DASK bson_reader.

The problem is to parse mongo_db BSON files and iterate through all file. When iterating throught BSON need to find BSON blocks endings, to separate each file block and prevent load whole file into memory. The pymongo bson module can do it, but return only a file iterator and this iterator can not use in for example: dask.bug.load_csv(file).map(iterator).

What the boiler plate to write custom dask.bug.bson_read? or any ideas?

Original Q&A

There are 1 best solutions below

MRocklin On 13 March 2020 at 16:37

If the block endings are consistent then you can use the dask.bytes.read_bytes function to create a list of Dask delayed objects. You can then apply your chunk-of-bytes -> list-or-dataframe function to each of those delayed objects, and then use the from_delayed from from either dask.bag or dask.delayed.

https://docs.dask.org/en/latest/delayed-collections.html

How to make DASK read BSON files?

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in DASK

Related Questions in BSON

Trending Questions

Popular # Hahtags

Popular Questions