As a learning exercise, I am trying to manually write the code (in R) for "stacking" (ensemble) different machine learning models together (the goal is binary response classification). I have taken the popular "sonar" dataset from R : I first take some training data and feed it to the "random forest" algorithm as well as the "ada boost" algorithm. I take the output probabilities from both of these algorithms, and then feed it to the "xgboost" algorithm for the final prediction. For some reason, this is resulting in a model with 0 training error. This can not be right.
Can someone please tell me what I am doing wrong and how can I fix this problem? I have attached my code below.
library (mlbench)
library (randomForest)
library(ada)
library(xgboost)
library(caret)
data(Sonar)
index = createDataPartition(y=Sonar$Class, p=0.75, list=FALSE)
train_set = Sonar[index,]
test_set = Sonar[-index,]
########Fit Random Forest
model_rf = randomForest(Class~., train_set, mtry = 12, ntree=500, prob=TRUE)
model_rf
####### Fit ada model
model_ada = ada(train_set[,-61],train_set$Class, nu=0.01, iter = 100, type="discrete")
model_ada
######### Predict on train data
pred_train_rf = predict(model_rf,train_set[,-61], type="prob")
pred_train_ada = predict(model_ada,train_set[,-61], type="prob")
######### Append predicted probabilities to the trainset : for class "M"
train_set$pred_rf = pred_train_rf[,1]
train_set$pred_ada = pred_train_ada[,1]
############# Fit xgboost model on the predicted probabilities of earlier two models
data_matrix <- as.matrix(train_set[,c(62:63)])
output_vector = as.vector(ifelse(train_set$Class == "M",1,0))
model_xgboost <- xgboost(data = data_matrix, label = output_vector, max.depth = 2,
eta = 1, nthread = 2, nrounds = 10,objective = "binary:logistic")
#########################################
[1] train-error:0.000000
[2] train-error:0.000000
[3] train-error:0.000000
[4] train-error:0.000000
[5] train-error:0.000000
[6] train-error:0.000000
[7] train-error:0.000000
[8] train-error:0.000000
[9] train-error:0.000000
[10] train-error:0.000000
Thanks.
The behaviour here is unexpected. RandomForest is always overfitted to the training data.
You can see this by working out the accuracy of the RF model on the training data.
Almost everytime you run your code this is 100% accurate. So, if I am correct you are more or less using Class to predict Class in your xgboost code.
What you need to do is change the workflow a bit. Instead of using the training data for
data_matrix
, use the test_data. That should do the job. You'll also need to write a loop and do this a large number of times to get a more realistic idea of predictability as the small dataset likely results in a lot of noise in any accuracy measures.