Why is Pyarrow and Pandas Dataframe Compression Create Higher Memory Files Than AWS Databrew?

332 Views Asked by user20035230 At 28 July 2025 at 11:29

I'm going from a dataframe to a parquet file using pyarrow or pandas dataframe function 'to_parquet' and in both of them, they have a field to specify what kind of compression you want done. The issue is when I generate the parquet files using these libraries, the file size is twice as much as the file size of the output of a job in AWS Databrew with all of the same settings attached and referencing the same input data

For pyarrow:

df = df.convert_dtypes()
stream = pa.BufferOutputStream()
table = pa.Table.from_pandas(df)
pq.write_table(table, stream, compression='SNAPPY')

For pandas dataframe:

df = df.convert_dtypes()
stream = io.BytesIO()
df.to_parquet(stream, compression='snappy',engine='pyarrow', index=False)

I've checked that the dataframe has the exact data that goes through AWS, and all of the datatypes are matching, but I'm not getting close to the same file size.

I've also tried:

pa.compress(df, codec='snappy', memory_pool=None)

with the compression in the 'to_parquet' functions set to None, but this seems to give me garbage data that AWS can't read and somehow is less than the file size I'm expecting.

Am I missing something? Do the 'to_parquet' functions actually compress the data? What kind of voodoo is AWS Databrew doing to get a file to get their magical file size? I can't seem to find good answers in Google or in documentation and feels like I'm going in circles, so any help is very appreciated. From what I've seen, it looks like AWS libraries use pyarrow for this kind of thing, so I've just been more confused why I can't seem to match file sizes

Original Q&A

Why is Pyarrow and Pandas Dataframe Compression Create Higher Memory Files Than AWS Databrew?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in AMAZON-WEB-SERVICES

Related Questions in PYARROW

Related Questions in AWS-DATABREW

Trending Questions

Popular # Hahtags

Popular Questions