I need to read mongo_db dump BSON files.
And i'm thinking about how to read these files by writing custom DASK bson_reader.
The problem is to parse mongo_db BSON files and iterate through all file. When iterating throught BSON need to find BSON blocks endings, to separate each file block and prevent load whole file into memory. The pymongo bson module can do it, but return only a file iterator and this iterator can not use in for example: dask.bug.load_csv(file).map(iterator).
What the boiler plate to write custom dask.bug.bson_read? or any ideas?
If the block endings are consistent then you can use the dask.bytes.read_bytes function to create a list of Dask delayed objects. You can then apply your chunk-of-bytes -> list-or-dataframe function to each of those delayed objects, and then use the
from_delayedfrom from either dask.bag or dask.delayed.https://docs.dask.org/en/latest/delayed-collections.html