How to read a large json dataset using Dask?

725 Views Asked by At

I am working on a Machine Learning project with a large dataset (+10Gb) stored in a JSON file. I found out that one of the best practices is to use Dask. However I encounter an error while reading the file using chunksize

PS: I want to use chunksize because it takes a lot of time to read the file

import pandas as pd
import dask.dataframe as dd

df=dd.read_json('data/train.jsonl', chunksize=1000)

This outputs the following error :

ValueError: An error occurred while calling the read_json method registered to the pandas backend.
Original Message: I/O operation on closed file.

I also tried to use this:

with pd.read_json('data/train.jsonl', lines=True, chunksize=100000) as reader: 

but I'm having a hard time to make it work so I can do some preprocessing and ML on it

Finally, do you have any tips and best practices on working on such scenarios.

Thank you!

1

There are 1 best solutions below

0
On

The argument for specifying partition sizes in dask.dataframe.read_json is blocksize, not chunksize. From the API documentation:

blocksize: None or int
If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.

Additional keyword arguments are passed through to pandas, so you were essentially using dask to create a single-partition dask.dataframe with the contents being a pandas ChunkIterator, which doesn’t work.

So the following should do what you’re looking for:

df=dd.read_json('data/train.json', blocksize=1000)

You will however want to change your blocksize so it’s a reasonable number of bytes for a partition - something like int(1e8) should work.