I'm trying to itertools.product multiple very large DataFrames and perform simple math on each combination to filter out unwanted combinations.
The problem is that translating the results from list(itertools.product(*args)) to pd.DataFrame is impossible, it'll always create a MemoryError. I understand that I need to work with generators or with pandas vectorization tools if they exist for this issue. I'm just not familiar with these things yet.
Example:
I use pd.concat(dfs) to combine all dfs into a single one, in this example with each df has a different 'group' column and has 3 rows.
df:
index group item value
0 1 0 4
1 1 1 2
2 1 2 6
3 2 0 1
4 2 1 5
5 2 2 3
6 3 0 7
7 3 1 4
8 3 2 2
The result I want is to get all combinations between the groups (1 ,2 ,3) like you do with lists but with .groupby rows in each group, so that each combination has 1 row from each group df.
product_df:
index combination group item value
0 1 1 0 4
1 1 2 0 1
2 1 3 0 7
3 2 1 0 4
4 2 2 0 1
5 2 3 1 4
6 3 1 0 4
7 3 2 0 1
8 3 3 2 2
I would perform vectorized math on multiple columns (in this case sum the 'value' column) and filter out unwanted combinations.
result_df = product_df \
.groupby('combination', as_index=False, sort=False) \
.sum() \
['value'] \
.rename({'value': 'sum_value'}, axis=1)
Here I need to filter out only the wanted combinations and creates a final df with only the combinations I wanted (ex. sum_value > 8).
final_df:
index combination group item value
0 1 1 0 4
1 1 2 0 1
2 1 3 0 7
3 2 1 0 4
4 2 2 0 1
5 2 3 1 4
The problem is that the original df isn't 3 dfs combined, I have up to 15 of them, and the number of rows in each is in the many thousands if not more. I tried creating a list of indexes from each separate df (before concat) to put in itertools.product().
This produces an enormous amount of data that I can't translate into a list to create a df from (not to mention then having to merge the index df with the original df to get the original rows which is also a heavy task).
Is it possible to vectorize a dataframe or use a generator to save only the filtered results as the original individual rows in a row (like product_df) without using lists and causing a MemoryError?