I have a dataset df
with three columns: 'String_key_val'
, 'Float_other_val1'
, 'Int_other_val2'
. I want to groupby on key_val, then extract the sum of val1
(resp. val2
) with respect to these groups. Here is my code:
df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()
res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')
My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore
, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame
, and I don't think HDFStore
can store simple arrays.
What can I do?
I believe a simple approach would be something along these lines....