Memory Error - itertools.product between pd.groupby() groups without using lists

43 Views Asked by Lev Zhitnik At 10 February 2024 at 05:42

I'm trying to itertools.product multiple very large DataFrames and perform simple math on each combination to filter out unwanted combinations.

The problem is that translating the results from list(itertools.product(*args)) to pd.DataFrame is impossible, it'll always create a MemoryError. I understand that I need to work with generators or with pandas vectorization tools if they exist for this issue. I'm just not familiar with these things yet.

Example:

I use pd.concat(dfs) to combine all dfs into a single one, in this example with each df has a different 'group' column and has 3 rows.

df:

index   group   item    value
0       1       0       4
1       1       1       2
2       1       2       6
3       2       0       1
4       2       1       5
5       2       2       3
6       3       0       7
7       3       1       4
8       3       2       2

The result I want is to get all combinations between the groups (1 ,2 ,3) like you do with lists but with .groupby rows in each group, so that each combination has 1 row from each group df.

product_df:

index   combination    group    item    value
0       1               1       0       4
1       1               2       0       1
2       1               3       0       7
3       2               1       0       4
4       2               2       0       1
5       2               3       1       4
6       3               1       0       4
7       3               2       0       1
8       3               3       2       2

I would perform vectorized math on multiple columns (in this case sum the 'value' column) and filter out unwanted combinations.

 result_df = product_df \
 .groupby('combination', as_index=False, sort=False) \
 .sum() \
 ['value'] \
 .rename({'value': 'sum_value'}, axis=1)

Here I need to filter out only the wanted combinations and creates a final df with only the combinations I wanted (ex. sum_value > 8).

final_df:

index   combination    group    item    value
0       1               1       0       4
1       1               2       0       1
2       1               3       0       7
3       2               1       0       4
4       2               2       0       1
5       2               3       1       4

The problem is that the original df isn't 3 dfs combined, I have up to 15 of them, and the number of rows in each is in the many thousands if not more. I tried creating a list of indexes from each separate df (before concat) to put in itertools.product().

This produces an enormous amount of data that I can't translate into a list to create a df from (not to mention then having to merge the index df with the original df to get the original rows which is also a heavy task).

Is it possible to vectorize a dataframe or use a generator to save only the filtered results as the original individual rows in a row (like product_df) without using lists and causing a MemoryError?

Original Q&A

Memory Error - itertools.product between pd.groupby() groups without using lists

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in PYTHON-ITERTOOLS

Trending Questions

Popular # Hahtags

Popular Questions