Recursive Feature Elimination with H2O Random Forest

49 Views Asked by At

I am using h2o package in python to build a fairly complex model. It has around 1500 features, but I know that most of them are not really important and I would like to extract the subset of a given size (let's say 100) that maximizes the R squared of my model.
Is there some method that is already implementing this for h2o in python?

Otherwise I would need to code it myself, but that also implies to run the model multiple times and I am not sure i would code it in the correct way.

One possible way to code it is this one:

  • Save the R2 for the model, then remove the k less important features
  • Create a second model without the removed features
  • Calculate the R2 for the new model and compare to the previous R2. Use a metric to decide whether to keep the new model or stick with the old.
  • Iterate these steps until the previous step chooses the old model as the best one I am pretty sure this will not give me the 'best subset' of feature but I really hope it would be sufficient.

The second method I thought of is the following:

  • set the number of feature N you want in the new model and the number of iterations K
  • Save the original model R2 as reference
  • Extract N features at random from the original model, using their relative importance as probability of being extracted (more important features more likely extracted)
  • For each model save the list of features and the new R2
  • After iterating K times stop the algorithm and compare the R2
  • Choose the set of features with the closest R2 to the original one
1

There are 1 best solutions below

0
Wendy On

Actually, we have a toolbox just for this using GLM. You are looking for the best K predictors to use to build your model. The best model is chosen for the one with the highest R2. ModelSelection will build a GLM model selecting the best 1 predictor model, best 2 predictors model, ...., the best K predictor models. Please checkout our modelselection toolbox: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html?highlight=modelselection

A sample code will look like this:

import sys
import h2o
from h2o.estimators.model_selection import H2OModelSelectionEstimator

train = h2o.import_file("import your dataset")
response="response" # set your response column
predictors = train.names
predictors.remove(response)

maxrsweep_model = H2OModelSelectionEstimator(mode="maxrsweep", max_predictor_number=100)
maxrsweep_model.train(x=predictors, y=response, training_frame=train)

# to get the best predictor subset
best_100_predictors = maxrsweep_model.coef(predictor_size=100)
print(best_100_predictors) # print predictor names and GLM coefficients

I hope this helps.