Create Dataframe in Pandas - Out of memory error while reading Parquet files

6.9k Views Asked by Rishim Mittal At 02 July 2025 at 01:16

I have a Windows 10 machine with 8 GB RAM and 5 cores.

I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB. When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:

Pandas :

df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed

Dask:

import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed

Vaex:

df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed

Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.

Original Q&A

There are 4 best solutions below

Raza On 26 November 2020 at 10:02

For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.

mdurant On 26 November 2020 at 13:53

It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?

Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).

iHao On 18 January 2023 at 03:32

pip install pyarrow==0.15.0 worked for me.

deevroman On 15 February 2021 at 23:38

Check that you are using the latest version of pyarrow. A few times updating has helped me.

pip install -U pyarrow

Create Dataframe in Pandas - Out of memory error while reading Parquet files

There are 4 best solutions below

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DASK

Related Questions in PARQUET

Related Questions in VAEX

Trending Questions

Popular # Hahtags

Popular Questions