Is the number in RFECV gridscore equivalent to the selected features?

184 Views Asked by At

I am seeking some clarity surrounding the number associated with selector.grid_scores_ in RFECV.

I have used the following:

from sklearn.feature_selection import RFECV

estimator_RFECV = ExtraTreesClassifier(random_state=0)
estimator_RFECV = RFECV(estimator_RFECV, min_features_to_select = 20, step=1, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
estimator_RFECV = estimator_RFECV.fit(X_train, y_train)

Using estimator_RFECV.ranking_, 27 features are selected through CV, however, when I look at estimator_RFECV.grid_scores_, at 27, the value here (accuracy) is not the highest. Am I interpreting the grid_scores_ incorrect and I should not expect 27 to have the highest accuracy?

1

There are 1 best solutions below

0
On
  1. Here, estimator_RFECV.ranking_ will give you an array of feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, feature ranked 2 will be less important than ranked 1 and so on.

So estimator_RFECV.ranking_ will give us ranking of features or we can say respective importance of feature.

  1. However, estimator_RFECV.grid_scores_ will give us array based on scoring metrics, min_features_to_select and Maximum number of feature available. In the above case it should contain 8 elements each representing Accuracy with top X features where X belongs to 20 to 27.

And yes, it's always possible that model with lesser number of feature can have higher accuracy, because some features which we may have considered that were irrelevant.

Also, the RFECV documentation link from the official documentation could be helpful.