I am using h2o package in python to build a fairly complex model.
It has around 1500 features, but I know that most of them are not really important and I would like to extract the subset of a given size (let's say 100) that maximizes the R squared of my model.
Is there some method that is already implementing this for h2o in python?
Otherwise I would need to code it myself, but that also implies to run the model multiple times and I am not sure i would code it in the correct way.
One possible way to code it is this one:
- Save the R2 for the model, then remove the k less important features
- Create a second model without the removed features
- Calculate the R2 for the new model and compare to the previous R2. Use a metric to decide whether to keep the new model or stick with the old.
- Iterate these steps until the previous step chooses the old model as the best one I am pretty sure this will not give me the 'best subset' of feature but I really hope it would be sufficient.
The second method I thought of is the following:
- set the number of feature
Nyou want in the new model and the number of iterationsK - Save the original model R2 as reference
- Extract
Nfeatures at random from the original model, using their relative importance as probability of being extracted (more important features more likely extracted) - For each model save the list of features and the new R2
- After iterating
Ktimes stop the algorithm and compare the R2 - Choose the set of features with the closest R2 to the original one
Actually, we have a toolbox just for this using GLM. You are looking for the best K predictors to use to build your model. The best model is chosen for the one with the highest R2. ModelSelection will build a GLM model selecting the best 1 predictor model, best 2 predictors model, ...., the best K predictor models. Please checkout our modelselection toolbox: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html?highlight=modelselection
A sample code will look like this:
I hope this helps.