work with chunked data while groupby operations are needed

117 Views Asked by At

I have a dataset df with three columns: 'String_key_val', 'Float_other_val1', 'Int_other_val2'. I want to groupby on key_val, then extract the sum of val1 (resp. val2) with respect to these groups. Here is my code:

df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()

res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')

My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame, and I don't think HDFStore can store simple arrays. What can I do?

1

There are 1 best solutions below

0
On

I believe a simple approach would be something along these lines....

import pandas as pd

summary = pd.DataFrame()
chunker = pd.read_csv('test.csv',iterator=True,chunksize=50000)

for chunk in chunker:
    group = chunk.groupby('String_key_val')
    out = group[['Float_other_val1','Int_other_val2']].sum()
    summary = summary.append(out)
    summary = summary.reset_index()
    group = summary.groupby('String_key_val')
    summary = group[['Float_other_val1','Int_other_val2']].sum()