I want to use RFE for feature selection in a pipeline. I have no problems getting it to work in pipelines without GridSearch. However, when I try to incorporate GridSearch, I keep getting a value error (NB. the models are fine without RFE).
I have tried to use feature_selection as was suggested in this topic: Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error, but this results in the same error.
What could be wrong?
my error:
ValueError: Invalid parameter alpha for estimator RFE(estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=True, random_state=None, solver='auto',
tol=0.001),
n_features_to_select=4, step=1, verbose=1). Check the list of available parameters with estimator.get_params().keys()
.
this works fine:
rfe=RFE(estimator=LinearRegression(), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values = np.NaN, strategy='most_frequent')),
('reg', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set:
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
print()
# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))
# Print the features that are not eliminated
print(X.columns[rfe.support_])
print()
print("R^2: {}".format(pipeline.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
this doesn't work
rfe=RFE(estimator=Ridge(normalize=True), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('ridge', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
#Define hyperparameters and range of Grid Search
parameters = {"ridge__alpha": np.linspace(0,1,100)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# run cross validation
cv = GridSearchCV(pipeline, param_grid = parameters, cv=3)
# Fit the pipeline to the training set:
cv.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = cv.predict(X_test)
# Compute and print R^2 and RMSE
print("R^2: {}".format(cv.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
print("Tuned Model Parameters: {}".format(cv.best_params_))
using feature_selection also doesn't work
selector = feature_selection.RFE(Ridge(normalize=True))
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('RFE', selector)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
The question is old, but in case someone stumbles upon it:
You can access the hyperparameter alpha or any parameter of the estimator inside feature_selection(estimator=) with the parameter '<feature_selection>__estimator__<your parameter>':