Process killed when reading big feather file with pandas due to OOM

35 Views Asked by At

I'm trying to read a big feather file and my process get's killed:

gerardo@hal9000:~/Projects/Morriello$ du -h EtOH.feather 
5.3G    EtOH.feather

I'm using pandas and pyarrow, here are the versions

gerardo@hal9000:~/Projects/Morriello$ pip freeze | grep "pandas\|pyarrow"
pandas==2.2.1
pyarrow==15.0.0

When I try to load the dataset into a dataframe I just get the process killed:

In [1]: import pandas as pd

In [2]: df = pd.read_feather("EtOH.feather", dtype_backend='pyarrow')
Killed

I'm on linux and I'm using Python 3.12, on a machine with 16Gb of RAM.

I saw the process get's killed due to an Out Of Memory error.

Out of memory: Killed process 918058 (ipython) total-vm:24890996kB, anon-rss:8455548kB, file-rss:640kB, shmem-rss:0kB, UID:1000 pgtables:17228kB oom_score_adj:100

I've also trying reading it in batches as suggested here and by @David but the process still gets killed:

In [4]: import pyarrow

In [5]: reader = pyarrow.ipc.open_file('./EtOH.feather')

In [6]: first_batch = reader.get_batch(0)

How do I read the file in this case? And if I manage to read it, would there been noticeable advantages in converting it to a parquet format?

0

There are 0 best solutions below