I'm going from a dataframe to a parquet file using pyarrow or pandas dataframe function 'to_parquet' and in both of them, they have a field to specify what kind of compression you want done. The issue is when I generate the parquet files using these libraries, the file size is twice as much as the file size of the output of a job in AWS Databrew with all of the same settings attached and referencing the same input data
For pyarrow:
df = df.convert_dtypes()
stream = pa.BufferOutputStream()
table = pa.Table.from_pandas(df)
pq.write_table(table, stream, compression='SNAPPY')
For pandas dataframe:
df = df.convert_dtypes()
stream = io.BytesIO()
df.to_parquet(stream, compression='snappy',engine='pyarrow', index=False)
I've checked that the dataframe has the exact data that goes through AWS, and all of the datatypes are matching, but I'm not getting close to the same file size.
I've also tried:
pa.compress(df, codec='snappy', memory_pool=None)
with the compression in the 'to_parquet' functions set to None, but this seems to give me garbage data that AWS can't read and somehow is less than the file size I'm expecting.
Am I missing something? Do the 'to_parquet' functions actually compress the data? What kind of voodoo is AWS Databrew doing to get a file to get their magical file size? I can't seem to find good answers in Google or in documentation and feels like I'm going in circles, so any help is very appreciated. From what I've seen, it looks like AWS libraries use pyarrow for this kind of thing, so I've just been more confused why I can't seem to match file sizes