How can I use sklearn RFECV method to select the optimal features to pass to a LinearDiscriminantAnalysis(n_components=2) method for dimensionality reduction, before fitting my estimator using a KNN.
pipeline = make_pipeline(Normalizer(), LinearDiscriminantAnalysis(n_components=2), KNeighborsClassifier(n_neighbors=10))
X = self.dataset
y = self.postures
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(svc, step=1, cv=None, scoring='f1_weighted', min_features_to_select=min_features_to_select)
rfecv.fit(X, y)
print(rfecv.support_)
print(rfecv.ranking_)
print("Optimal number of features : %d" % rfecv.n_features_)
Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(min_features_to_select,
len(rfecv.grid_scores_) + min_features_to_select),
rfecv.grid_scores_)
plt.show()
I get the following error from this code. If I run this code without the LinearDiscriminantAnalysis() step then it works, but this an important part of my processing.
*** ValueError: when `importance_getter=='auto'`, the underlying estimator Pipeline should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.
Your approach has an overall problem: the
KNeighborsClassifier
does not have an intrinsic measure of feature importance. Thus, it is not compatible withRFECV
as its documentation states about the classifier:You will definitely fail with
KNeighborsClassifier
. You definitely need another classifier likeRandomForestClassifier
orSVC
.If you can shoose another classifier, your pipeline still needs to expose the feature importance of the estimator in your pipeline. For this you can refer to this answer here which defines a custom pipeline for this purpose:
Define your pipeline like:
and it should work.