how to debug errors like: "dim(x) must have a positive length" with caret

1.3k Views Asked by At

I'm running a predict over a fit similar to what is found in the caret guide:

Caret Measuring Performance

predictions <-  predict(caretfit, testing, type = "prob")

But I get the error:

Error in apply(x, 1, paste, collapse = ",") : 
dim(X) must have a positive length

I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.

1) So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():

10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel, 
   newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")

Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:

Github Caret C5.0 source

But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?

2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:

fitcontrol <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 1,
                       classProbs = TRUE,
                       summaryFunction = custom.summary,
                       allowParallel = TRUE)


fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
             data = training[1:200000,],
             method = "C5.0",
             trControl = fitcontrol,
             verbose = FALSE,
             na.action = na.pass)
1

There are 1 best solutions below

0
On BEST ANSWER

1 Debuging Procedure

You can pinpoint the problem using a couple of functions.

Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:

getAnywhere(makeDataFile)

to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.

In my particular problem if I run:

newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)

(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.

So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)

  1. in caret/R/predict.train.R, predict.train (defined on line 108) calls probFunction on line 153
  2. in caret/R/probFunction, probFunction (defined on line 3) calls method$prob function which is a stored function in the fit object caretfit$modelInfo$prob which can be inspected by entering this into the console. This is the same function found in caret/models/files/C5.0.R on line 58 which calls 'predict' on line 59
  3. something in caret knows to use C50/R/predict.C5.0.R which you can see by searching with getAnywhere()
  4. this function runs makeDataFile on line 25 (part of the C50 package)
  5. which calls paste, which calls apply, which dies with stop

2 Particular Problem with caret's predict

As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:

prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)

it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.

Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.