Is filebacked.big.matrix in the bigmemory packagage memory neutral?

287 Views Asked by At

I have been using filebacked.big.matrix to store a very large matrix (~1 million x 20 thousand). I am working on a cluster with very high memory, but not quite that much. I have previously used the ff package which worked great and kept the memory usage consistent despite the matrix size, but it died when I surpassed 10^32 items in the matrix (R community really needs to fix that problem). the filebacked.big.matrix initially seemed to work very well and generally runs without problems, but when I check on the memory usage it is sometimes spiking into the 100s of GBs. I am careful to only read/write to the matrix a relatively few rows at a time, so I think there should not be much in memory at any given time.

Does it do some sort of automatic memory caching or something that is driving the memory usage up? If so can this caching be disabled or limited? The high memory usage is causing some nasty side effects on the cluster so I need a way to do this that is memory neutral. I have checked the filebacked.big.matrix help page, but can't find any pertinent information there. Thanks!

UPDATE:
I am also using bigmemoryExtras.
I was wrong earlier, the problem is happening when I loop through the entire matrix reading it into a different, smaller file.backed matrix like this:

tmpGeno=fileBackedMatrix(rowIndex-1,numMarkers,'double',tmpDir)  
front=1  
 back=40000  
 large matrix must be copied in chunks to avoid integer.max errors  
 while(front < rowIndex-1){  
     if(back>rowIndex-1) back=rowIndex-1 
     tmpGeno[front:back,1:numMarkers]=genotypeMatrix[front:back,1:numMarkers,drop=F]  
     front=front+40000  
     back=back+40000  
}  

The physical memory usage is initially very low (with virtual memory very high). But while running this loop, and even after it has finished it seems to just keep most of the matrix in physical memory. I need it to only keep the one small chunk of the matrix in memory at a time.

UPDATE 2:
It is a bit confusing to me: the cluster metrics and top command say that it is using tons of memory (~80GB), but the gc() command says that memory usage never went over 2GB. The free command says that all the memory is used, but in the -/+ buffers/cache line is says only 7GB are being used total.

0

There are 0 best solutions below