I have been trying for several days to calculate the LogLoss and AUC metrics for my GBM and Decision Trees models, but I am encountering error after error. For GBM I have been using 2 scripts, the first one is this one here:
library(gbm)
install.packages("xfun", type = "binary")
gbmWithCrossValidation = gbm(Class ~ .,
distribution = "bernoulli",
data = datos.test,
n.trees = 500,
shrinkage = 0.01,
n.minobsinnode = 100,
cv.folds = 5)
But when it reaches fold 5, R aborts and closes.
That is why now I am working with this script:
library(vtreat)
library(gbm)
set.seed(123)
k <-5
splitPlan.GBM <- kWayCrossValidation(nRows = dim(Credito_mod1)[1], k)
splitPlan.GBM
splitPlan.GBM[[1]]
testLogLossCV5.GBM=NULL
testAUCCV5.GBM=NULL
for(i in 1:k) {
split.GBM <- splitPlan.GBM[[i]]
modelo_gbm <- gbm(Class ~ ., distribution = "bernoulli",data=Credito_mod1[split.GBM$train,],n.trees=500)
yprob.GBM <- predict(modelo_gbm, newdata = Credito_mod1[split.GBM$app,], n.trees = 500, type = "response")
ypred.GBM <- factor(as.numeric(yprob >= 0.5 ), labels = levels(Credito_mod1$Class))
testLogLossCV5.GBM[i]<-MLmetrics::LogLoss(yprob.GBM,as.numeric(Credito_mod1[split.GBM$app,]$Class)-1)
testAUCCV5.GBM[i]<-MLmetrics::AUC(yprob.GBM,as.numeric(Credito_mod1[split.GBM$app,]$Class)-1)
}
gbm.iter = gbm.perf(modelo_gbm, method = "test")
testLogLossCV5.GBM
mean(testLogLossCV5.GBM)
#AUC
testAUCCV5.GBM
mean(testAUCCV5.GBM)
However when I get the results, LogLoss only has NA values and the AUC is very low compared to when the model does not use CV.
> testLogLossCV5.GBM
[1] NaN NaN NaN NaN NaN
> mean(testLogLossCV5.GBM)
[1] NaN
> testAUCCV5.GBM
[1] 0.5234419 0.5378472 0.5419651 0.4497702 0.5530294
> mean(testAUCCV5.GBM)
[1] 0.5212108
For Decision Tree I have been using 1 script, the one here:
library(caret)
set.seed(123)
# Crear el control para la validación cruzada
ctrl <- trainControl(method = "cv", number = 5)
# Entrenar el modelo de Árbol de Decisión con validación cruzada
model_tree <- train(
Class ~ .,
data = rbind(datos.entreno, datos.test),
method = "rpart",
trControl = ctrl,
metric = "LogLoss", # Puedes elegir la métrica adecuada para tu problema
verbose = FALSE
)
print(model_tree)
predicted_probs.tree.test <- predict(model_tree, newdata = datos.test, type = "prob")
logloss_tree <- logLoss(predicted_probs.tree.test, as.numeric(datos.test$Class))
auc_tree <- roc(as.numeric(datos.test$Class), predicted_probs.tree.test[, 2])$auc
But I simply get an error and the operation is terminated.
I want to find the LogLoss and AUC metrics to compare the models. In terms of processing time is not a problem for me, since I have a powerful machine. But I am not very familiar with the R language.