I am trying to remove some unimportant features for one of my personal projects. Certain but not all features need scaling. I decided to do select the features using SequentialFeatureSelector
from mlxtend
with cross validation. My understanding is that I need to use Pipeline
in order to scale folds properly but it is giving me an error.
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.compose import ColumnTransformer
from sklearn.utils import compute_sample_weight
import random
data = pd.DataFrame({
'a': random.sample(range(1, 1000), 100),
'b': random.sample(range(1, 1000), 100),
'c': random.sample(range(1, 1000), 100),
'd': random.sample(range(1, 1000), 100),
'e': random.sample(range(1, 1000), 100),
})
X_train = data.drop(columns=['a'], axis=1)
y_train = data['a']
preprocessor = ColumnTransformer(transformers=[('scaler', StandardScaler(), ['b', 'd'])],
remainder='passthrough')
pipeline = Pipeline([('preprocessor', preprocessor),
('estimator', Ridge())])
weights = compute_sample_weight(class_weight='balanced', y=y_train)
sfs = SFS(pipeline,
k_features='best',
forward=False,
floating=True,
verbose=2,
scoring='neg_root_mean_squared_error',
cv=5)
sfs = sfs.fit(X_train, y_train, estimator__sample_weight=weights)
The error is:
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py", line 409, in _get_column_indices
all_columns = X.columns
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 390, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 355, in _fit
**fit_params_steps[name],
File "/usr/local/lib/python3.7/dist-packages/joblib/memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/sklearn/compose/_column_transformer.py", line 672, in fit_transform
self._validate_column_callables(X)
File "/usr/local/lib/python3.7/dist-packages/sklearn/compose/_column_transformer.py", line 352, in _validate_column_callables
transformer_to_input_indices[name] = _get_column_indices(X, columns)
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py", line 412, in _get_column_indices
"Specifying the columns using strings is only "
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
warnings.warn(some_fits_failed_message, FitFailedWarning)
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
/usr/local/lib/python3.7/dist-packages/mlxtend/feature_selection/sequential_feature_selector.py:612: RuntimeWarning: Mean of empty slice
all_avg_scores.append(np.nanmean(cv_scores))
[2022-04-09 06:26:03] Features: 1/1 -- score: nan
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-5-b7df58cec335> in <module>()
33 scoring='neg_root_mean_squared_error',
34 cv=5)
---> 35 sfs = sfs.fit(X_train, y_train, estimator__sample_weight=weights)
/usr/local/lib/python3.7/dist-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y, custom_feature_names, groups, **fit_params)
566 best_subset = k
567 k_score = max_score
--> 568 k_idx = self.subsets_[best_subset]['feature_idx']
569
570 if self.k_features == 'parsimonious':
KeyError: None
I know that it something to do with ColumnTransformer
because it runs when remove this step. I've tried adding an extra step to convert to dataframe but it didn't help. Does anyone have any ideas?