I am trying to find the frequency of patterns in a deep matrix/dataframe (cols: id, variable, value) with 10s of millions of rows. This is easy to do in a wide matrix as shown below. I was wondering if there was a way to do the same (in a deep matrix) without first converting to wide format. Thanks.
require(dplyr)
require(tidyr)
set.seed(100)
ncol <- 10
nrow <- 100000
#create sample matrix in wide format
df1 <- as.data.frame(matrix((runif(nrow*ncol)>0.8) + 0, ncol=ncol))
cols <- colnames(df1)
df1 <- filter(df1, rowSums(df1)>0)
df1 <- cbind(id=seq_len(nrow(df1)), df1)
#compute frequency of patterns
out1 <- df1 %>%
group_by_(.dots=cols) %>% summarise(freq=n()) %>% as.data.frame() %>% arrange(desc(freq))
#convert to deep format
df2 <- df1 %>%
gather(variable, value, -id) %>% filter(value>0)
#compute frequency of patterns
out2 <- df2 %>% spread(variable, value, fill=0) %>%
group_by_(.dots=cols) %>% summarise(freq=n()) %>% as.data.frame() %>% arrange(desc(freq))
identical(out1, out2)
(too long for a comment)
I doubt that it's possible (can't say for sure, though).Two challenges:
ncol
rows. How would you use "summarise" and break this down to a single row (which could only hold a single value which means it's an incomplete pattern)?filter(value > 0)
you effectively destroy most of the existing patterns because the vast majority of patterns (in wide format) include 0s in some rows. The only complete pattern that you could still observe then could consist of only 1s, right?More precisely: It might be possible, but I believe it would require a greater workaround than conversion from long to wide.
I just changed my mind, but I'm not sure whether this really so much different than conversion from long to wide format:
Result:
To compare to the other data and generate
df2
, I use:Compare with out1:
Obviously, I can't use
identical(out1, out2)
here becauseout2
only has 2 columns .. but I can use it on the frequency counts:.. and if you wanted to convert out2 to something identical to out1, you could use
separate
from tidyr: