Memory Error - itertools.product between pd.groupby() groups without using lists

43 Views Asked by At

I'm trying to itertools.product multiple very large DataFrames and perform simple math on each combination to filter out unwanted combinations.

The problem is that translating the results from list(itertools.product(*args)) to pd.DataFrame is impossible, it'll always create a MemoryError. I understand that I need to work with generators or with pandas vectorization tools if they exist for this issue. I'm just not familiar with these things yet.

Example:

I use pd.concat(dfs) to combine all dfs into a single one, in this example with each df has a different 'group' column and has 3 rows.

df:

index   group   item    value
0       1       0       4
1       1       1       2
2       1       2       6
3       2       0       1
4       2       1       5
5       2       2       3
6       3       0       7
7       3       1       4
8       3       2       2

The result I want is to get all combinations between the groups (1 ,2 ,3) like you do with lists but with .groupby rows in each group, so that each combination has 1 row from each group df.

product_df:

index   combination    group    item    value
0       1               1       0       4
1       1               2       0       1
2       1               3       0       7
3       2               1       0       4
4       2               2       0       1
5       2               3       1       4
6       3               1       0       4
7       3               2       0       1
8       3               3       2       2

I would perform vectorized math on multiple columns (in this case sum the 'value' column) and filter out unwanted combinations.

 result_df = product_df \
 .groupby('combination', as_index=False, sort=False) \
 .sum() \
 ['value'] \
 .rename({'value': 'sum_value'}, axis=1)

Here I need to filter out only the wanted combinations and creates a final df with only the combinations I wanted (ex. sum_value > 8).

final_df:

index   combination    group    item    value
0       1               1       0       4
1       1               2       0       1
2       1               3       0       7
3       2               1       0       4
4       2               2       0       1
5       2               3       1       4

The problem is that the original df isn't 3 dfs combined, I have up to 15 of them, and the number of rows in each is in the many thousands if not more. I tried creating a list of indexes from each separate df (before concat) to put in itertools.product().

This produces an enormous amount of data that I can't translate into a list to create a df from (not to mention then having to merge the index df with the original df to get the original rows which is also a heavy task).

Is it possible to vectorize a dataframe or use a generator to save only the filtered results as the original individual rows in a row (like product_df) without using lists and causing a MemoryError?

0

There are 0 best solutions below