Cross Validation in GBM and Decision Tree ERROR

62 Views Asked by At

I have been trying for several days to calculate the LogLoss and AUC metrics for my GBM and Decision Trees models, but I am encountering error after error. For GBM I have been using 2 scripts, the first one is this one here:

library(gbm)
install.packages("xfun", type = "binary")


gbmWithCrossValidation = gbm(Class ~ .,
                             distribution = "bernoulli",
                             data = datos.test,
                             n.trees = 500,
                             shrinkage = 0.01,
                             n.minobsinnode = 100, 
                             cv.folds = 5)

But when it reaches fold 5, R aborts and closes.

That is why now I am working with this script:

library(vtreat)
library(gbm)
set.seed(123)
k <-5
splitPlan.GBM <- kWayCrossValidation(nRows = dim(Credito_mod1)[1], k)
splitPlan.GBM
splitPlan.GBM[[1]]


testLogLossCV5.GBM=NULL
testAUCCV5.GBM=NULL

for(i in 1:k) {
  split.GBM <- splitPlan.GBM[[i]]
  modelo_gbm <- gbm(Class ~ ., distribution = "bernoulli",data=Credito_mod1[split.GBM$train,],n.trees=500)
  yprob.GBM <- predict(modelo_gbm, newdata = Credito_mod1[split.GBM$app,], n.trees = 500, type = "response")
  ypred.GBM <- factor(as.numeric(yprob >= 0.5 ), labels = levels(Credito_mod1$Class))
  testLogLossCV5.GBM[i]<-MLmetrics::LogLoss(yprob.GBM,as.numeric(Credito_mod1[split.GBM$app,]$Class)-1)
  testAUCCV5.GBM[i]<-MLmetrics::AUC(yprob.GBM,as.numeric(Credito_mod1[split.GBM$app,]$Class)-1)
  
}

gbm.iter = gbm.perf(modelo_gbm, method = "test")


testLogLossCV5.GBM
mean(testLogLossCV5.GBM)

#AUC
testAUCCV5.GBM
mean(testAUCCV5.GBM)

However when I get the results, LogLoss only has NA values and the AUC is very low compared to when the model does not use CV.

> testLogLossCV5.GBM
[1] NaN NaN NaN NaN NaN
> mean(testLogLossCV5.GBM)
[1] NaN
> testAUCCV5.GBM
[1] 0.5234419 0.5378472 0.5419651 0.4497702 0.5530294
> mean(testAUCCV5.GBM)
[1] 0.5212108

For Decision Tree I have been using 1 script, the one here:

library(caret)

set.seed(123)

# Crear el control para la validación cruzada
ctrl <- trainControl(method = "cv", number = 5)

# Entrenar el modelo de Árbol de Decisión con validación cruzada
model_tree <- train(
  Class ~ .,
  data = rbind(datos.entreno, datos.test),
  method = "rpart",
  trControl = ctrl,
  metric = "LogLoss",  # Puedes elegir la métrica adecuada para tu problema
  verbose = FALSE
)

print(model_tree)

predicted_probs.tree.test <- predict(model_tree, newdata = datos.test, type = "prob")

logloss_tree <- logLoss(predicted_probs.tree.test, as.numeric(datos.test$Class))
auc_tree <- roc(as.numeric(datos.test$Class), predicted_probs.tree.test[, 2])$auc

But I simply get an error and the operation is terminated.

I want to find the LogLoss and AUC metrics to compare the models. In terms of processing time is not a problem for me, since I have a powerful machine. But I am not very familiar with the R language.

0

There are 0 best solutions below