I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:
d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")
The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than
dist(x, method = "binary")
. It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal tox ^ y / (x | y)
where^
is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of thexor
oror
is non-zero.I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as
dist(x, method = "binary")
for a few random datasets.On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
bDist.cpp
R source
From original answer:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
A couple of relatively straightforward options that might be faster without needing much code are the
vegdist
function from the vegan package and theDist
function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.