Function to find sub ID's of an ID in a data frame

319 Views Asked by At

I have a data frame that contains two columns, an ID column and a column with sub ID's that are related to the corresponding ID. The sub ID's can again have sub ID's (in this case the previous sub ID is now an ID).

library(tibble)

df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))

df

# A tibble: 6 x 2
     id sub_id
  <dbl>  <dbl>
1     1      2
2     1      3
3     2      4
4     2      5
5     3      6
6     7      8

I would like to write a function that finds all sub ID's that are related to an ID. It should return a vector with all sub ID's.

find_all_sub_ids <- function (data, id) {
data %>% ...
}

find_all_sub_ids(df, id = 1)

[1] 2 3 4 5 6

find_all_sub_ids(df, id = 2)

[1] 4 5

find_all_sub_ids(df, id = 9)

[1] NULL

This is very different from everything I have done in R so far and it was hard for me to phrase a good title for this question. So it is possible that with the right phrasing I could have already found an answer by just googling.

My first intuition for solving this was while loops. Since I also do not know how many sublevels there could be the function should continue until all are found. I never used while loops though and don't really know how I could implement them here.

Maybe someone knows a good solution for this problem. Thanks!

Edit: Forgot to assign the tibble to df and to use this argument in the function call.

3

There are 3 best solutions below

5
On BEST ANSWER

With igraph:

library(igraph)
g <- graph_from_data_frame(d, directed = TRUE)

find_all_subs <- function(g,id){
  #find child nodes, first one being origin
  r <- igraph::subcomponent(g,match(id, V(g)$name),"out")$name
  #remove origin
  as.numeric(r[-1])
}
find_all_subs(g,1)
[1] 2 3 4 5 6

find_all_subs(g,2)
[1] 5 6
3
On

I think it's easiest to formulate this as a graph problem.
Your data.frame describes a directed graph (vertices going from id to sub_id), and you are interested in which nodes are reachable from a certain vertex.

Using tidygraph, this can be achieved as such:

library(tidyverse)
library(tidygraph)

df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))

find_all_sub_ids <- function (id) {
  if (!(id %in% df$id)) {
    return(NULL)
  }

  
  grph <- df %>% 
    as_tbl_graph(directed = TRUE)
  
  id <- which(grph %>% pull(name) == as.character(id))
  
  grph %>% 
    activate(nodes) %>% 
    mutate(reachable = !is.na(bfs_dist(id))) %>% 
    as_tibble() %>% 
    filter(reachable) %>% 
    pull(name) %>% 
    as.numeric()
}

We see which nodes are reachable (they have a non-NA distance to your given node), we use bfs_dist (see here for explanation).
This gives

> find_all_sub_ids(1)
[1] 1 2 3 4 5 6

> find_all_sub_ids(2)
[1] 2 4 5

> find_all_sub_ids(9)
NULL

The advantage of such an approach is that it can search many levels deep without you needing to write a loop explicitly.

Edit There was a bug in my code, tidygraph::bfs_dist uses a differend id than I expected. Fixed it now.
On the new example:

> find_all_sub_ids(10)
[1]  10 200 300
1
On

I did it using a dataframe. The following works.

x= c(1,1,2,2,3,7)
y = c(2, 3, 4, 5, 6, 8)
df <- data.frame(cbind(x,y))
colnames(df) =c('id', 'sub_id')


find_all_sub_ids <- function (df, id_requested) {
  si <- df[df$id==id_requested,]$sub_id
  return(si)
}
find_all_sub_ids(df,id=2)
[1] 4 5