I'm using doMPI in R to parallelize saving netCDF climate data. This data is stored in R in a 4-dimensional matrix m
, with data for 6 variables, at 20000 timepoints, over a latitude and longitude grid. m
is thus indexed as m[lon,lat,time,variable]
. Based on how netCDF stores its data on disk, the most efficient way to write data to disk is by timeslice. Consequently, I'd like to iterate over m
one timeslice at a time, for each variable. Currently, my code looks like this:
ntime <- 20000
output.vars <- list("rainfall", "snowfallwateq", "snowmelt", "newsnow", "snowdepth", "swe")
for (var.index in seq_along(output.vars)) {
ncout <- nc_open(output.files[var.index], write=TRUE)
val <- foreach(time.index=1:ntime, .packages=c("ncdf4")) %dopar%
{
ncvar_put(ncout, output.vars[[var.index]],
vals=m[,,time.index,var.index],
start=c(1, 1, time.index),
count=c(nlon, nlat, 1))
}
nc_close(ncout)
}
This unnecessarily copies the entire m
matrix to each worker. That takes up a prohibitively large amount of memory, and I need to reduce the amount of copied data. What I thought of from this answer was that I could iterate over each timeslice of the matrix, so only the data for the timeslice was copied to each worker at each iteration. The foreach
construct allows multiple objects to iterate simultaneously, so I could even have the time index alongside the matrix timeslice without a problem. Unfortunately, I don't know of any way to iterate over a matrix by-timeslice. Is there a way to do so, such that at each iteration t
of the foreach
loop for variable var
, I can have a variable data
which holds the 2-dimensional matrix m[,,t,var]
?
I've already tried the intuitive approach below, but it iterates over each individual element, rather than an entire timeslice at a time.
val <- foreach(time.index=1:ntime, slice=m[,,,var], ...
If you can massage your data in the main R process, you could try transforming each 2-dimensional slice into a
big.matrix
from thebigmemory
package, and use that in your parallel workers. This would only be useful if the time required to process each slice in the slave process is significant.See this example, and notice that you can nest 2
foreach
loops with%:%
If you can't/won't use
bigmemory
, or if processing each 2-dimensional slice is too fast (yes, this may be problematic for multi-processing, see this answer), maybe you can extract 3-dimensional slices from your data and use.noexport
to only copy one at a time, something like:I'm actually not 100% sure the above would prevent copying the whole
slices_3d
, if not, you may have to manually extract subsets in chunks in the main R process (e.g.slices_3d[1L:num_parallel_workers]
and so on each time), and make sure that only one chunk is exported in each call toforeach
.