Clustering a large, very sparse, binary matrix in R

3.7k Views Asked by At

I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:

d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")

The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?

3

There are 3 best solutions below

5
On BEST ANSWER

I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary"). It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y) where ^ is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor or or is non-zero.

I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary") for a few random datasets.

On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.

bDist.cpp

#include <Rcpp.h>
using namespace Rcpp;

//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation

const uint64_t m1  = 0x5555555555555555; //binary: 0101...
const uint64_t m2  = 0x3333333333333333; //binary: 00110011..
const uint64_t m4  = 0x0f0f0f0f0f0f0f0f; //binary:  4 zeros,  4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...

int countBits(uint64_t x) {
  x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
  x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
  x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
  return (x * h01)>>56;  //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ... 
}

// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
  uint64_t* x = (uint64_t*)RAW(rv);
  return(countBits(*x));
}

// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
  int nr(mat.nrow()), nc(mat.ncol());
  int nw = nr / 8;
  NumericVector res(nc * (nc - 1) / 2);
  // Access the raw data as unsigned 64 bit integers
  uint64_t* data = (uint64_t*)RAW(mat);
  uint64_t a(0);
  // Work through each possible combination of columns (rows in the original integer matrix)
  for (int i = 0; i < nc - 1; i++) {
    for (int j = i + 1; j < nc; j++) {
      uint64_t sx = 0;
      uint64_t so = 0;
      // Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
      for (int k = 0; k < nw; k++) {
        uint64_t o = data[nw * i + k] | data[nw * j + k];
        // If (x | y == 0) then (x ^ y) will also be 0
        if (o) {
          // Use Hamming weight method to calculate number of set bits
          so = so + countBits(o);
          uint64_t x = data[nw * i + k] ^ data[nw * j + k];
          if (x) {
            sx = sx + countBits(x);
          }
        }
      }
      res(a++) = (double)sx / so;
    }
  }
  return (res);
}

R source

library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")

# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
  packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}

as.PackedMatrix <- function(x, packWidth = 64L) {
  UseMethod("as.PackedMatrix")
}

# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
  stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
  storage.mode(x) <- "raw"
  if (ncol(x) %% packWidth != 0) {
    x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
  }
  out <- packBits(t(x))
  dim(out) <- c(ncol(x) %/% 8, nrow(x))
  class(out) <- "PackedMatrix"
  out
}

# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
  out <- rawToBits(x)
  dim(out) <- c(nrow(x) * 8L, ncol(x))
  storage.mode(out) <- "integer"
  t(out)
}

# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
  x <- replicate(nObs, {
    y <- integer(nVariables)
    y[sample(nVariables, sample(maxBits, 1))] <- 1L
    if (packed) {
      packRow(y, 64L)
    } else {
      y
    }
  })
  if (packed) {
    class(x) <- "PackedMatrix"
    x
  } else {
    t(x)
  }
}

# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
  if (missing(file) && !missing(text)) {
    file <- textConnection(text)
    on.exit(close(file))
  }
  if (is.character(file)) {
    file <- file(file, "rt")
    on.exit(close(file))
  }
  if (!inherits(file, "connection")) 
    stop("'file' must be a character string or connection")
  if (!isOpen(file, "rt")) {
    open(file, "rt")
    on.exit(close(file))
  }
  lst <- list()
  i <- 1
  while(length(line <- readLines(file, n = 1)) > 0) {
    lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
    i <- i + 1
  }
  out <- do.call("cbind", lst)
  class(out) <- "PackedMatrix"
  out
}

# Wrapper for the C++ code which 
binaryDist <- function(x) {
  if (class(x) != "PackedMatrix") {
    x <- as.PackedMatrix(x)
  }
  dst <- bDist(x)
  attr(dst, "Size") <- ncol(x)
  attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
  attr(dst, "method") <- "binary"
  attr(dst, "call") <- match.call()
  class(dst) <- "dist"
  dst
}

x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)

system.time(bd <- binaryDist(x))

From original answer:

Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.

A couple of relatively straightforward options that might be faster without needing much code are the vegdist function from the vegan package and the Dist function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.

0
On

The reason this takes so long to compute is that the call to dist is computing and storing more than 760 million pairwise distances. If your data is stored sparsely, this will take a long time and huge amount of storage. If your data is not stored sparsely, then each distance computation requires at least 14,000 operations, for a total operation count exceeding 1 quadrillion!

An approach that will be much quicker is k-means clustering, since it doesn't require pre-computing a distance matrix; at each iteration you will need only 39000*k distance calculations, where k is the number of clusters. To get pairwise distances that are similar to the Jaccard index (0 if identical, 1 if no indices coincide, in between if some but not all indices coincide), you could divide each row x by sqrt(2*sum(x^2)). For instance, if you had the following input matrix:

(mat <- rbind(c(1, 0, 0, 0, 0), c(0, 0, 0, 1, 1)))
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    0    0    0    0
# [2,]    0    0    0    1    1

the normalized version would be (assuming binary values only in the matrix; if this were not the case you would use rowSums(mat^2)):

(mat.norm <- mat / sqrt(2*rowSums(mat)))
#           [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7071068    0    0  0.0  0.0
# [2,] 0.0000000    0    0  0.5  0.5

These two observations (which have no indices in common), have Euclidean distance 1, coinciding with the Jaccard distance for this case.

dist(mat.norm, "euclidean")
#   1
# 2 1

Additionally, identical observations will clearly have Euclidean distance 0, again corresponding to the Jaccard distance.

0
On
  1. do you have duplicate rows? There is no need to compute their distances twice.

  2. all rows with a single 1 will be 100% different from all rows with a single one in a different place.

Thus, it does not make sense to run clustering on such data. The output is rather predictable, and boils down to finding the 1.

Try restricting your data set to those objects that have more than one 1 only. Unless you can get interesting results on these only, no need to continue further. Binary data has too little information.