I have a large dataframe that looks something like that:
| id | x |
|---|---|
| 1 | 123 |
| 2 | 123 |
| 2 | 125 |
| 3 | 125 |
| 4 | 200 |
tibble(id = c(1, 2, 2, 3, 4),
x = c(123, 123, 125, 125, 200))
I want to link my records and expand them. I.e. where they have the same x, they should have the same id. Note, record id 2 & 3 are considered same entity because of their shared link with 1. In other words, without record id 1, records 2 and 3 will not link. I'd also like to give it the smallest id value where the records link.
Results should look something like:
| id | x |
|---|---|
| 1 | 123 |
| 1 | 123 |
| 1 | 125 |
| 1 | 125 |
| 4 | 200 |
An equivalent way to restate this problem is "if the data frame represented the 'to' and 'from' nodes of a directed graph, what would be the first ID of each unique component of the graph?"
We can show this graphically:
If we plot the graph, we will see that there are two unconnected clusters of nodes (called components); one for the grouped IDs 1:3, and one for ID 4:
We can use the function
componentsto work out which group each ID belongs to, and get thefirstID in each group:Created on 2023-09-07 with reprex v2.0.2