I have two data frames (here for reproducibility) trainFin1 and trainFin2, both sampled from a same bigger dataset.

I'm trying to run cross-validated rpart on them using caret over multiprocessor using doSNOW package.

Interestingly, trainFin1 was trained nicely across 4 processors (finishing in about 25 seconds). But trainFin2 seems to be stuck only on one processor (observed in Windows Task Manager window), and I never get to see it finish processing even after almost half an hour.

My code below


fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

#setup parallel processing
cl <- makeCluster(4, type = "SOCK")

firstSet <- train(x = trainFin1[, names(trainFin1) != "Happiness"],
                  y = trainFin1$Happiness,
                  method = "rpart2", trControl = fitControl)

secondSet <- train(x = trainFin2[, names(trainFin2) != "Happiness"],
                   y = trainFin2$Happiness,
                   method = "rpart2", trControl = fitControl)


Do note that I avoided use of formula in train and instead feed it raw data, to avoid caret converting my ordinal variables into dummy categorical variables (see answer to this question). When I used formula (i.e. train(Happiness ~ ., data = trainFin2, method = "rpart2", trControl = fitControl)), there seems to be no issue with parallel processing. But I want to avoid using formula as per the other question.

Any suggestions on how I can parallel-process this data without converting the predictors to categorical dummies ?


