Collapse List into Unique Combinations Using Index Values

124 Views Asked by At

Firstly, I wish I could think of a better way to describe my problem. I'm trying to take the following example list:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7))

And turn it into this:

res<- list(c(1,2,3), c(1,2,3), c(1,2,3), c(4), c(5,6,7,8), c(5,6,7,8),c(5,6,7,8), c(5,6,7,8))

OR even this:

also_good_answer<- list(c(1, 2, 3), logical(0), logical(0), 4, c(5, 6, 7, 8), logical(0), 
    logical(0), logical(0))

I've been using a couple while statements in combination with setdiff to do this, but I'm wondering if there is an eloquent and faster way to do this over a large list?

As always, thank you in advance. -nate

My method:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7)) # Original List
to_sequence<- 1:length(lst) 
res<- lapply(1:length(lst), function(x) {return(NA)})  # Building Result Object

while(length(to_sequence) > 0){
  tres<- c()
  idx<- to_sequence[1]
  next_idxs<- setdiff(lst[[idx]], NA)
  tres<- c(tres, next_idxs)
  while(length(next_idxs) >=1){
    next_idxs<- lapply(seq_along(next_idxs), function(x){
      lst[[next_idxs[x]]]
    } ) %>% unlist() %>% setdiff(., unlist(tres)) # uses slow setdiff
    tres<-c(tres, next_idxs)
  }
  res[[idx]]<- tres
  to_sequence<- setdiff(to_sequence, tres) # Another slow setdiff
  cat("Length to_sequence:", base::prettyNum(length(to_sequence),big.mark = ","), "\n") 
}

res<- lapply(res, sort)

EDITS
Updated lst: lst<- list(c(2), c(1,3), c(2,8), c(4), c(6), c(5, 7), c(6,9), c(3),c(7))

Giving Desired Output: list(c(1, 2, 3, 8), logical(0), logical(0), 4, c(5, 6, 7, 9), logical(0), logical(0), logical(0), logical(0))

2

There are 2 best solutions below

0
G. Grothendieck On

The revised question includes commentary that implies that the problem can be interpreted as finding the connected components of an undirected graph described by an input list of vectors of nodes, lst, such that lst[[i]] is a vector of the nodes connected to node i. 

To compute the result, using the input lst (defined in the question and also shown in the Note at the end), first build the Rcpp_SimplexTree class graph object, st, extract the connected_components property, which in this case is c(1, 1, 1, 3, 5, 5, 5, 1, 5) and replace each component of lst with the unique ordered set of nodes in its connected component.

Finally replace duplicate elements in that with logical(0) . The ## line can be omitted if duplicates are ok.

library(simplextree)

st <- simplex_tree()
for(i in seq_along(lst)) insert(st, t(cbind(lst[[i]],i)))

out <- ave(lst, st$connected_components, FUN = \(x) list(sort(unique(unlist(x)))))
out[duplicated(out)] <- list(logical(0)) ##

# check - see Note at end taken from question for desired
identical(out, desired)
## [1] TRUE

plot(st)

screenshot

Old

In light of the revised explanation in the question this is no longer relevant. Use above instead.

I am not completely sure I understand the processing in the question but the following code does not use setdiff (although it does use intersect) and there are no explicit loops giving the same result for the example. ok is TRUE for each element for which all elements before and after it -- the after includes the element itself - are disjoint. cumsum(ok) gives a grouping vector - c(1, 1, 1, 2, 3, 3, 3, 3) -- and then we create a list L which contains the group components themselves and finally repeat it the required number of times. Another thing to try is to see whether Rfast2::Intersect or easy.utils::fastIntersect speeds up the intersect operation.

ok <- sapply(seq_along(lst), function(i, i1 = seq_len(i-1)) {
  (i == 1) || length(intersect(unlist(lst[i1]), unlist(lst[-i1]))) == 0
})
L <- tapply(lst, cumsum(ok), function(x) sort(unique(unlist(x))))
result <- rep(L, lengths(L))

str(result)

giving

List of 8
 $ 1: num [1:3] 1 2 3
 $ 1: num [1:3] 1 2 3
 $ 1: num [1:3] 1 2 3
 $ 2: num 4
 $ 3: num [1:4] 5 6 7 8
 $ 3: num [1:4] 5 6 7 8
 $ 3: num [1:4] 5 6 7 8
 $ 3: num [1:4] 5 6 7 8

Note

lst <- list(2, c(1, 3), c(2, 8), 4, 6, c(5, 7), c(6, 9), 3, 7)

desired <- list(
  c(1, 2, 3, 8), logical(0), logical(0), 4, c(5, 6, 7, 9), logical(0),
  logical(0), logical(0), logical(0)
)
0
ThomasIsCoding On

Update

According to the updated example and expected output, we use replace and duplicated to nullify the cluster duplicates (see the last line out <- ...)

library(igraph)
g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
clst <- membership(components(g))
nms <- as.integer(names(clst))
res <- unname(by(nms, clst, sort))[clst[order(nms)]]
out <- replace(res, duplicated(res), list(logical(0)))

and we can obtain

> out
[[1]]
[1] 1 2 3 8

[[2]]
logical(0)

[[3]]
logical(0)

[[4]]
[1] 4

[[5]]
[1] 5 6 7 9

[[6]]
logical(0)

[[7]]
logical(0)

[[8]]
logical(0)

[[9]]
logical(0)

Previous Answer

You can use membership from igraph to detect the clusters

library(igraph)
g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
clst <- membership(components(g))
nms <- as.integer(names(clst))
res <- unname(by(nms, clst, sort))[clst[order(nms)]]

and you will obtain

> res
[[1]]
[1] 1 2 3

[[2]]
[1] 1 2 3

[[3]]
[1] 1 2 3

[[4]]
[1] 4

[[5]]
[1] 5 6 7 8

[[6]]
[1] 5 6 7 8

[[7]]
[1] 5 6 7 8

[[8]]
[1] 5 6 7 8

Graph Interpretation

If you are interested in how the problem is interpreted into a graph, you can use plot(as.undirected(g)), which shows

enter image description here