What I want to achieve is to split my dataset to n
chunks, where n is the number of available processors, and send each chunk to a different node. So, node 1 would get chunk 1, node2 receives chunk 2, ... , and node n gets chunk n.
Previously, I used to send the whole dataset to every node and subset data only on each node, but as my dataset has grown larger I cannot afford doing so since using clusterExport
would generate the following error message:
Error in serialize(data, node$con) : error writing to connection
I have tried so many different versions of my function together with clusterApply
or clusterCall
but none of them worked out as apparently having the data (partitioned as a list with n
elements) as an argument of the function in clusterApply
call is not that much different from sending the data to the nodes and it looks like that R
still sends copies in order to perform the function which results in a similar error.
Also, I have tried splitting the data on the master (i.e. data_1, ..., data_n were created) and tried to read each part on a single node without sending the whole data using:
data_dist <- function(data){
node_data <<- eval(as.name(data))
}
clusterApply(cluster, str_c("data", 1:ncores, sep = "_"), data_dist)
but still the data are not accessible to the nodes. Any help would be greatly appreciated.