Create Dataframe in Pandas - Out of memory error while reading Parquet files

6.9k Views Asked by At

I have a Windows 10 machine with 8 GB RAM and 5 cores.

I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB. When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:

Pandas :

df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed

Dask:

import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed

Vaex:

df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed

Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.

4

There are 4 best solutions below

7
On

For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.

0
On

It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?

Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).

0
On

pip install pyarrow==0.15.0 worked for me.

0
On

Check that you are using the latest version of pyarrow. A few times updating has helped me.

pip install -U pyarrow