I am working on a Machine Learning project with a large dataset (+10Gb) stored in a JSON file. I found out that one of the best practices is to use Dask. However I encounter an error while reading the file using chunksize
PS: I want to use chunksize because it takes a lot of time to read the file
import pandas as pd
import dask.dataframe as dd
df=dd.read_json('data/train.jsonl', chunksize=1000)
This outputs the following error :
ValueError: An error occurred while calling the read_json method registered to the pandas backend.
Original Message: I/O operation on closed file.
I also tried to use this:
with pd.read_json('data/train.jsonl', lines=True, chunksize=100000) as reader:
but I'm having a hard time to make it work so I can do some preprocessing and ML on it
Finally, do you have any tips and best practices on working on such scenarios.
Thank you!
The argument for specifying partition sizes in
dask.dataframe.read_json
is blocksize, not chunksize. From the API documentation:Additional keyword arguments are passed through to pandas, so you were essentially using dask to create a single-partition dask.dataframe with the contents being a pandas ChunkIterator, which doesn’t work.
So the following should do what you’re looking for:
You will however want to change your blocksize so it’s a reasonable number of bytes for a partition - something like
int(1e8)
should work.