Recursive Feature Elimination with H2O Random Forest

49 Views Asked by Mirko At 25 November 2023 at 23:03

I am using h2o package in python to build a fairly complex model. It has around 1500 features, but I know that most of them are not really important and I would like to extract the subset of a given size (let's say 100) that maximizes the R squared of my model.
Is there some method that is already implementing this for h2o in python?

Otherwise I would need to code it myself, but that also implies to run the model multiple times and I am not sure i would code it in the correct way.

One possible way to code it is this one:

Save the R2 for the model, then remove the k less important features
Create a second model without the removed features
Calculate the R2 for the new model and compare to the previous R2. Use a metric to decide whether to keep the new model or stick with the old.
Iterate these steps until the previous step chooses the old model as the best one I am pretty sure this will not give me the 'best subset' of feature but I really hope it would be sufficient.

The second method I thought of is the following:

set the number of feature N you want in the new model and the number of iterations K
Save the original model R2 as reference
Extract N features at random from the original model, using their relative importance as probability of being extracted (more important features more likely extracted)
For each model save the list of features and the new R2
After iterating K times stop the algorithm and compare the R2
Choose the set of features with the closest R2 to the original one

Original Q&A

There are 1 best solutions below

Wendy On 27 November 2023 at 21:01

Actually, we have a toolbox just for this using GLM. You are looking for the best K predictors to use to build your model. The best model is chosen for the one with the highest R2. ModelSelection will build a GLM model selecting the best 1 predictor model, best 2 predictors model, ...., the best K predictor models. Please checkout our modelselection toolbox: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html?highlight=modelselection

A sample code will look like this:

import sys
import h2o
from h2o.estimators.model_selection import H2OModelSelectionEstimator

train = h2o.import_file("import your dataset")
response="response" # set your response column
predictors = train.names
predictors.remove(response)

maxrsweep_model = H2OModelSelectionEstimator(mode="maxrsweep", max_predictor_number=100)
maxrsweep_model.train(x=predictors, y=response, training_frame=train)

# to get the best predictor subset
best_100_predictors = maxrsweep_model.coef(predictor_size=100)
print(best_100_predictors) # print predictor names and GLM coefficients

I hope this helps.

Recursive Feature Elimination with H2O Random Forest

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in RANDOM-FOREST

Related Questions in H2O

Related Questions in FEATURE-SELECTION

Trending Questions

Popular # Hahtags

Popular Questions