I need to get subsets of top 1, top 2, top 3, etc. features, and performance of my model on each of these subsets. Something like this:
Number of features | Features | Performance |
---|---|---|
1 | A | 0.7 |
2 | A, D | 0.72 |
3 | A, D, B | 0.75 |
I wanted to use RFE as a possible improvement over simply using feature importances from models.
In sklearn, the RFECV object has a ranking_
attribute, which would let me create the feature subsets. The problem is that all features below the number of features that RFECV found to be optimal are equal to 1, so the first k
features are not ordered by importance.
I thought of using a simple RFE instead, but it doesn't accept the scoring
parameter, and the default accuracy
is not appropriate in my case where classes are very unbalanced.
Is there a way to either somehow provide a scoring metric to sklearn RFE, or to force RFECV (that does accept a scoring
parameter) to evaluate ranking below the 'optimal' number of features?
I also considered using SFS but I have about 500 features and it takes days to run.