I have a data frame like below.
dat <- data.frame(v1=c("a","b","c","c","a","w","f"),
v2=c("z","a","a","w","p","e","h"))
v1 v2
1 a z
2 b a
3 c a
4 c w
5 a p
6 w e
7 f h
I want to add a group column based on whether these letters appear in the same row.
v1 v2 gp
1 a z 1
2 b a 1
3 c a 1
4 c w 1
5 a p 1
6 w e 1
7 f h 2
My idea is to first assign the first row to group 1, and then any row that v1 or v2 is "a" or "z" will also be assigned to group 1.
There are scenarios like row 3 and 4, where c is assigned to group 1, because, in row 3, v2 is "a". And "w" is assigned to group 1 because in row 4 v1 is "c", which is assigned to group 1 previously. But my list is very long, so I cannot keep checking all the "descendants".
I wonder if there is a way to group these letters, and return a list with group number. Something like the below table will do.
letter gp
a 1
b 1
c 1
e 1
f 2
h 2
w 1
z 1
One way to solve this is to consider the letters as vertices of a graph and being in the same row as a link between the vertices. Then what you are asking for is the connected components of the graph. All of that is easy using the
igraph
package in R.If you want a data.frame containing this information
In order to think through why this works, it may help you to look at the graph that was created and think how that represents your problem.