I am trying to train two GBM models, the first one takes the frequency as a response variable and the second takes number of claims as a response and exposure as on offset column, however, I did not see any difference between the two best models when I make hyperparameters tuning. I get the same RMSE.
DF=data[-extreme_ind, ]
DF[,c(4:60)]<- lapply(DF[,c(4:60)], factor)
df=as.h2o(DF)
splits <- h2o.splitFrame(df, 0.8, seed=1234)
train <- h2o.assign(splits[[1]], "train.hex")
valid <- h2o.assign(splits[[2]], "valid.hex")
MOD_1_v2 <- h2o.gbm(x=c(4:56, 58:60),y = 61, training_frame = train, validation_frame =valid, ntrees=200) #100
summary(MOD_1_v2)
plot(MOD_1_v2,timestep="number_of_trees",metric="RMSE")
gbm1_parameters <- list(learn_rate = c(0.01,0.05, 0.1),
max_depth = c(3, 5, 6),
sample_rate = c(0.7, 0.75, 0.8),
col_sample_rate = c(0.2, 0.5, 1.0))
gbm1_grid <- h2o.grid("gbm", x = c(4:56, 58:60), y = 61,
grid_id = "gbm_grid",
training_frame = train,
validation_frame = valid,
ntrees=20, #30
seed = 1,
hyper_params = gbm1_parameters)
gbm1_gridp<- h2o.getGrid(grid_id = "gbm_grid",
sort_by = "rmse",
decreasing = FALSE)
print(gbm1_gridp)
best_MOD_1=h2o.getModel(gbm1_gridp@model_ids[[1]])
summary(best_MOD_1)
best_gbm_perf1 <- h2o.performance(model = best_MOD_1,newdata = valid)
best_gbm_perf1
plot(best_MOD_1,timestep="number_of_trees",metric="rmse")
h2o.varimp_plot(best_MOD_1)
MOD_2_v2 <- h2o.gbm(x=c(4:56, 58:60),y = 2,offset_column="APVI", training_frame = train, validation_frame = valid,ntrees=55)
summary(MOD_2_v2) #apres supp outliers
plot(MOD_2_v2,timestep="number_of_trees",metric="RMSE")
gbm2_parameters <- list(learn_rate = c(0.01,0.05, 0.1),
max_depth = c(3, 5),
sample_rate = c(0.7, 0.75, 0.8),
col_sample_rate = c(0.2, 0.5, 1.0))
gbm2_grid <- h2o.grid("gbm", x = c(4:56, 58:60), y = 2,
grid_id = "gbm_grid",
training_frame = train,
validation_frame = valid,
ntrees=55, #10
seed = 123,
hyper_params = gbm2_parameters)
gbm2_gridp<- h2o.getGrid(grid_id = "gbm_grid",
sort_by = "rmse",
decreasing = FALSE)
print(gbm2_gridp)
best_MOD_2=h2o.getModel(gbm2_gridp@model_ids[[1]])
summary(best_MOD_2)
best_gbm_perf2 <- h2o.performance(model = best_MOD_2,newdata = valid)
best_gbm_perf2
How Can I fix this problem ?
Could you also share the printed output, please?
My first idea is you are using the same grid_id = "gbm_grid"; please try to change the second one to be different.
Also, in your grid settings, the only difference is setting the response column (the first grid y=61, the second grid y=2). I don't see an offset column setting.
I will also try this suggestion with my generic data to see if this is the issue.
Thanks!
Edit: I tried your code with different data and got models with different RMSEs. So please check that your data makes sense.