This is a followup of a question I asked yesterday, now extended to include more than 2 inputs. I was able to find two related answers on SO, but none of them gave quite enough information for me to solve this in a performant way.
I would like to combine a list of IRanges into a single IRanges. Here's an example input:
[[1]]
IRanges object with 2 ranges and 1 metadata column:
start end width | on_betalac
<integer> <integer> <integer> | <logical>
[1] 1 21 21 | FALSE
[2] 22 22 1 | TRUE
[[2]]
IRanges object with 2 ranges and 1 metadata column:
start end width | on_other
<integer> <integer> <integer> | <logical>
[1] 1 21 21 | FALSE
[2] 22 22 1 | TRUE
[[3]]
IRanges object with 1 range and 1 metadata column:
start end width | on_pen
<integer> <integer> <integer> | <logical>
[1] 1 22 22 | FALSE
[[4]]
IRanges object with 3 ranges and 1 metadata column:
start end width | on_quin
<integer> <integer> <integer> | <logical>
[1] 1 3 3 | FALSE
[2] 4 13 10 | TRUE
[3] 14 22 9 | FALSE
For ease of replication, the dput of this list is at the end of my post.
And my desired output is:
IRanges object with 4 ranges and 4 metadata columns:
start end width | on_betalac on_other on_pen on_quin
<integer> <integer> <integer> | <logical> <logical> <logical> <logical>
[1] 1 3 3 | FALSE FALSE FALSE FALSE
[2] 4 13 10 | FALSE FALSE FALSE TRUE
[3] 14 21 8 | FALSE FALSE FALSE FALSE
[4] 22 22 1 | TRUE TRUE FALSE FALSE
You can see that the output is kind of like the disjoin of the input, but with the mcols propagated through, so that each output row has the mcols of the input row that "gave rise" to it.
Here's my solution, which works, but is quite slow.
combine_exposures <- function(exposures) {
cd <- do.call(what = c, args = exposures)
mc <- mcols(cd)
dj <- disjoin(x = cd, with.revmap = TRUE)
r <- mcols(dj)$revmap
d <- as.data.frame(matrix(nrow = length(dj), ncol = ncol(mc)))
names(d) <- names(mc)
for (i in 1:length(dj)) {
d[i,] <- sapply(X = 1:ncol(mc), FUN = function(j) { mc[r[[i]][j], j] })
}
mcols(dj) <- d
return(dj)
}
And here's the dput of the sample input:
list(new("IRanges", start = c(1L, 22L), width = c(21L, 1L), NAMES = NULL,
elementType = "ANY", elementMetadata = new("DataFrame", rownames = NULL,
nrows = 2L, listData = list(on_betalac = c(FALSE, TRUE
)), elementType = "ANY", elementMetadata = NULL, metadata = list()),
metadata = list()), new("IRanges", start = c(1L, 22L), width = c(21L,
1L), NAMES = NULL, elementType = "ANY", elementMetadata = new("DataFrame",
rownames = NULL, nrows = 2L, listData = list(on_other = c(FALSE,
TRUE)), elementType = "ANY", elementMetadata = NULL, metadata = list()),
metadata = list()), new("IRanges", start = 1L, width = 22L,
NAMES = NULL, elementType = "ANY", elementMetadata = new("DataFrame",
rownames = NULL, nrows = 1L, listData = list(on_pen = FALSE),
elementType = "ANY", elementMetadata = NULL, metadata = list()),
metadata = list()), new("IRanges", start = c(1L, 4L, 14L),
width = c(3L, 10L, 9L), NAMES = NULL, elementType = "ANY",
elementMetadata = new("DataFrame", rownames = NULL, nrows = 3L,
listData = list(on_quin = c(FALSE, TRUE, FALSE)), elementType = "ANY",
elementMetadata = NULL, metadata = list()), metadata = list()))
I've come up with a more efficient version, but still suspect it could be much faster.
I ran bench::mark and found that this version is about 3x faster. This is probably good enough for my application, but I get the sense I'm not using IRanges quite right.