patterns in a deep matrix/dataframe

141 Views Asked by At

I am trying to find the frequency of patterns in a deep matrix/dataframe (cols: id, variable, value) with 10s of millions of rows. This is easy to do in a wide matrix as shown below. I was wondering if there was a way to do the same (in a deep matrix) without first converting to wide format. Thanks.

require(dplyr)
require(tidyr)

set.seed(100)
ncol <- 10
nrow <- 100000

#create sample matrix in wide format
df1 <- as.data.frame(matrix((runif(nrow*ncol)>0.8) + 0, ncol=ncol))
cols <- colnames(df1)
df1 <- filter(df1, rowSums(df1)>0)
df1 <- cbind(id=seq_len(nrow(df1)), df1)

#compute frequency of patterns
out1 <- df1 %>%
    group_by_(.dots=cols) %>% summarise(freq=n()) %>% as.data.frame() %>% arrange(desc(freq))

#convert to deep format
df2 <- df1 %>% 
    gather(variable, value, -id) %>% filter(value>0)

#compute frequency of patterns
out2 <- df2 %>% spread(variable, value, fill=0) %>% 
    group_by_(.dots=cols) %>% summarise(freq=n()) %>% as.data.frame() %>% arrange(desc(freq))

identical(out1, out2)
2

There are 2 best solutions below

2
On

(too long for a comment)

I doubt that it's possible (can't say for sure, though).

Two challenges:

  • in the long format, each unique "pattern" is distributed over at least ncol rows. How would you use "summarise" and break this down to a single row (which could only hold a single value which means it's an incomplete pattern)?
  • Second problem I see with your sample code: when you create df2 and use filter(value > 0) you effectively destroy most of the existing patterns because the vast majority of patterns (in wide format) include 0s in some rows. The only complete pattern that you could still observe then could consist of only 1s, right?

More precisely: It might be possible, but I believe it would require a greater workaround than conversion from long to wide.


I just changed my mind, but I'm not sure whether this really so much different than conversion from long to wide format:

out2 <- group_by(df2, id) %>% arrange(id, variable) %>%
          summarise(pattern = toString(value)) %>%
           count(pattern)

Result:

> out2 %>% arrange(desc(n))
Source: local data frame [896 x 2]

                        pattern    n
1  0, 0, 0, 0, 0, 0, 0, 1, 0, 0 2794
2  0, 0, 1, 0, 0, 0, 0, 0, 0, 0 2754
3  0, 0, 0, 0, 0, 0, 0, 0, 0, 1 2742
4  0, 0, 0, 0, 0, 0, 0, 0, 1, 0 2716
5  0, 0, 0, 0, 0, 1, 0, 0, 0, 0 2716
6  1, 0, 0, 0, 0, 0, 0, 0, 0, 0 2710
7  0, 1, 0, 0, 0, 0, 0, 0, 0, 0 2685
8  0, 0, 0, 1, 0, 0, 0, 0, 0, 0 2633
9  0, 0, 0, 0, 1, 0, 0, 0, 0, 0 2630
10 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 2618
..                          ...  ...

To compare to the other data and generate df2, I use:

set.seed(100)
ncol <- 10
nrow <- 100000

#create sample matrix in wide format
df1 <- as.data.frame(matrix((runif(nrow*ncol)>0.8) + 0, ncol=ncol))
cols <- colnames(df1)
df1 <- filter(df1, rowSums(df1)>0)
df1 <- cbind(id=seq_len(nrow(df1)), df1)

#compute frequency of patterns
out1 <- df1 %>%
    group_by_(.dots=cols) %>% summarise(freq=n()) %>% as.data.frame() %>% arrange(desc(freq))

#convert to deep format
df2 <- df1 %>%                      # this is the input for my code
    gather(variable, value, -id)    # note that I don't use `filter(value>0)` here!

Compare with out1:

> head(out1[order(-out1$freq),])
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 freq
1  0  0  0  0  0  0  0  1  0   0 2794
2  0  0  1  0  0  0  0  0  0   0 2754
3  0  0  0  0  0  0  0  0  0   1 2742
4  0  0  0  0  0  0  0  0  1   0 2716
5  0  0  0  0  0  1  0  0  0   0 2716
6  1  0  0  0  0  0  0  0  0   0 2710

Obviously, I can't use identical(out1, out2) here because out2 only has 2 columns .. but I can use it on the frequency counts:

identical(out1$freq, out2$n)
#[1] TRUE

.. and if you wanted to convert out2 to something identical to out1, you could use separate from tidyr:

separate(out2, col = pattern, into = paste0("V", seq_len(ncol)), sep = ",")
5
On

One possibility for the 'wide' arrangement is to paste the columns together

id = do.call(paste, c(df1[, -1], sep="*"))

and table the results

table(id)

Somehow this seems simpler than the dplyr syntax, though relying on a 'trick' instead of general operations. Other summaries are straight-forward, e.g., by column index and count

uid = unique(id)
data.frame(rowid=match(uid, id), count=tabulate(match(id, uid)))

or augmenting a unique version of the data.frame with the count information

cbind(df1[!duplicated(id),,drop=FALSE], count = tabulate(match(id, uid)))

For the (filtered) deep representation, I generated the data

set.seed(100)
ncol <- 10; nrow <- 100000
m1 <- matrix((runif(nrow*ncol)>0.8) + 0, ncol=ncol)
m1 <- m1[rowSums(m1) != 0,]                         # filter
m2 <- cbind(id=as.vector(row(m1)), var=as.vector(col(m1)), val=as.vector(m1))

then iterated through each column to calculate a unique 'key' by shifting the (unique index of the) value by enough to make the key unique

nid <- max(m2[,"id"])
nvar <- max(m2[,"var"])
key <- numeric(nid)
scale <- 1
for (i in seq_len(nvar)) {
    idx <- m2[, "var"] == i
    id <- m2[idx, "id"]
    val <- m2[idx, "val"]
    uval <- sort(unique(val))    # sort() not strictly necessary
    key[id] <- key[id] + scale * (match(val, uval) - 1L)
                                 # match() allows for non-integer 'val'
    scale <- scale * length(uval)
}

The keys can be summarized into a count table

ukey <- unique(key)
out2m <- data.frame(ukey=ukey,
                    rowid=seq_len(nid)[match(ukey, key)],
                    count=tabulate(match(key, ukey)))

and displayed in various ways

o <- order(out2m$count, decreasing=TRUE)
head(out2m[o,])
m1[out2m$rowid[head(o)],]

This is faster and more memory-efficient that dplyr, but again a special purpose algorithm. It also requires that the scale be less than the maximum number of unique double-precision numbers, something like 2^53.

It's a little hard to know where to start and end a benchmark, but since the data could as easily be a data frame or matrix, and since we're apparently interested in counts maybe the following are reasonable

fdf2 <- function(df2) {
    group_by(df2, id) %>% arrange(id, variable) %>%
        summarise(pattern = toString(value)) %>%
            count(pattern)
}

fm2 <- function(m2) {
    nid <- max(m2[,"id"])
    nvar <- max(m2[,"var"])
    key <- numeric(nid)
    scale <- 1
    for (i in seq_len(nvar)) {
        idx <- m2[, "var"] == i
        id <- m2[idx, "id"]
        val <- m2[idx, "val"]
        uval <- sort(unique(val))
        key[id] <- key[id] + scale * (match(val, uval) - 1L)
        scale <- scale * length(uval)
    }

    ukey <- unique(key)
    data.frame(ukey=ukey, rowid=seq_len(nid)[match(ukey, key)],
               count=tabulate(match(key, ukey)))
}

No need for microbenchmark or similar,

> system.time(fdf2(df2))
   user  system elapsed 
  4.640   0.000   4.639 
> system.time(fm2(m2))
   user  system elapsed 
  0.587   0.000   0.587 

It could be that the simulated data are not realistic, or that the algorithms scale differently and that one or the other is more competitive with the real data; the question as originally posed is not phrased clearly enough to perform more relevant tests.

Memory use is harder to gauge in R; I guessed that fm2 makes needs no more than the memory to hold a nvar * 3 elements, e.g., if all doubles

> print(object.size(double(nid)) * 3, units="auto")
2 Mb

I guess dplyr is clever, so harder to reason about, but some of the intermediate objects are large, e.g.,

> print(object.size(group_by(df2, id)), units="auto")
22.5 Mb

I'm actually not really sure how to easily characterize memory use rigorously, especially since dplyr calls in to C code and may well use non-R memory.