Given a matrix like mat
> set.seed(1)
> mat <- matrix(rbinom(100,1,0.5),10,10)
> rownames(mat) <- paste0(sample(LETTERS[1:2],10,replace=T),c(1:nrow(mat)))
> colnames(mat) <- paste0(sample(LETTERS[1:2],10,replace=T),c(1:ncol(mat)))
> mat
A1 A2 A3 B4 B5 B6 B7 A8 B9 B10
B1 0 0 1 0 1 0 1 0 0 0
B2 0 0 0 1 1 1 0 1 1 0
B3 1 1 1 0 1 0 0 0 0 1
A4 1 0 0 0 1 0 0 0 0 1
A5 0 1 0 1 1 0 1 0 1 1
A6 1 0 0 1 1 0 0 1 0 1
A7 1 1 0 1 0 0 0 1 1 0
B8 1 1 0 0 0 1 1 0 0 0
A9 1 0 1 1 1 1 0 1 0 1
A10 0 1 0 0 1 0 1 1 0 1
I want to sample submatrices of the form:
A B
A 0 1
B 1 0
EDIT: Specifically, the submatrix should contain a 1 in the A-row/B-column, a 1 in the B-row/A-column, and 0s in the other two cells.
I can do this by randomly selecting one A-row and one B-row, then randomly selecting one A-column and one B-column, then checking whether it has the desired pattern. But, I'm trying to find a more efficient method that will work even in large/sparse matrices. Thanks!
One could enumerate all possible pairwise combinations of elements containing the value
1
, then eliminate pairs that share a row or column and pairs that would not result in0
elements for the submatrix main diagonal. The resulting rows and columns from each remaining pair would define all possible submatrices that meet the desired pattern and these would be trivial to sample. This is feasible for matrices with a relatively small number of1
elements (e.g., <100K--depending on available memory).For sparse matrices with a large number of
1
elements, a straightforward way to get an efficient, vectorized solution is also with rejection: sample pairs of1
elements for each submatrix's antidiagonal and reject if the corresponding main diagonal elements are not0
. The below solution assumes more elements are0
than1
. (If the opposite is true, it should be modified to sample two0
elements for the main diagonal and reject if the antidiagonal elements are not1
.) The rejection rate will depend mainly on the density of the sparse matrix. The example matrix is rather dense, so it has a relatively high rejection rate.Here it is implemented as a function that returns a specified number of samples either with or without replacement.