I have a Windows 10 machine with 8 GB RAM and 5 cores.
I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB. When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:
Pandas :
df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Dask:
import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed
Vaex:
df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.
For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.