I'm new to the graphs, but trying to get my path through.
Basically, the idea is very simple - we have "transactions" with multiple "features" and need to assign the same Id to transactions, which have 2 or more common features (same values). The number of "transactions" is about 5500 000 records.
For example:
| Transaction | A | B | C | D |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 2 |
| 1 | 2 | 1 | 1 | 7 |
| 2 | 3 | 1 | 2 | 9 |
| 3 | 4 | 1 | 3 | 8 |
| 4 | 5 | 2 | 3 | 4 |
- Here only transactions 0 and 1 have 2 common features, so they should be combined with same id.
| Transaction | Id |
|---|---|
| 0 | 1 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
My first approach was to create a graph with all nodes (transactions), then in dataframe filter out matching pairs with duplicates in 2 or more features and create edges for those nodes. But here I face an issue that it's impossible to process so huge dataframe in normal amount of time, even with multiprocessing.
So, the second approach is to create a bipartite graph where source nodes - transactions and target nodes - features.
Then I was able to extract connected components but the result groups were too huge, as transactions even with a single common edge were grouped to the same Id.
Now I'm struggling with the task of how to get connected source nodes that have 2 or more common target features..
Appreciate any help.