Very slow raster::sampleRandom, what can I do as a workaround?

358 Views Asked by At

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.

Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.

As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:

pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)

But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
e.g.:

pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)

I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:

  1. 1.4GB, 1 variable, filesystem 1
  2. 2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
  3. 2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
  4. 2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
  5. 1.2GB, 1 variable, filesystem 3
  6. 1.2GB, 1 variable, filesystem 3

Note that:

  1. files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
  2. RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
  3. sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
  4. When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.

The full script which produces the output below is here, with comments etc.


If I try to read all the 30k cells:

  1. cannot allocate vector of size 2.6 Gb

If I read 1000 cells:

  1. 5 minutes
  2. 45 m
  3. 30 m
  4. 30 m
  5. 20 m
  6. 20 m

If I read 3000 cells:

  1. 15 minutes
  2. 18 m
  3. 35 m
  4. 34 m
  5. 60 m
  6. 60 m

If I try to read 5000 cells:

  1. 2.5 h
  2. 22 h
  3. for >2 I had to stop after 18h, I had to use the workstation for other tasks

With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).

  1. Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
  2. What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.

EDIT: Working workaround is to do:

cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
cells[,1]
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile

This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

0

There are 0 best solutions below