Using sklearn Pipelines with mlxtend SequentialFeatureSelector gives an error

527 Views Asked by At

I am trying to remove some unimportant features for one of my personal projects. Certain but not all features need scaling. I decided to do select the features using SequentialFeatureSelector from mlxtend with cross validation. My understanding is that I need to use Pipeline in order to scale folds properly but it is giving me an error.

Here is my code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.compose import ColumnTransformer
from sklearn.utils import compute_sample_weight
import random

data = pd.DataFrame({
    'a': random.sample(range(1, 1000), 100),
    'b': random.sample(range(1, 1000), 100),
    'c': random.sample(range(1, 1000), 100),
    'd': random.sample(range(1, 1000), 100),
    'e': random.sample(range(1, 1000), 100),
    })

X_train = data.drop(columns=['a'], axis=1)
y_train = data['a']

preprocessor = ColumnTransformer(transformers=[('scaler', StandardScaler(), ['b', 'd'])], 
                                 remainder='passthrough')
pipeline = Pipeline([('preprocessor', preprocessor), 
                     ('estimator', Ridge())])
weights = compute_sample_weight(class_weight='balanced', y=y_train)

sfs = SFS(pipeline, 
          k_features='best', 
          forward=False, 
          floating=True, 
          verbose=2, 
          scoring='neg_root_mean_squared_error', 
          cv=5)
sfs = sfs.fit(X_train, y_train, estimator__sample_weight=weights)

The error is:

/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning: 
5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py", line 409, in _get_column_indices
    all_columns = X.columns
AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 355, in _fit
    **fit_params_steps[name],
  File "/usr/local/lib/python3.7/dist-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/compose/_column_transformer.py", line 672, in fit_transform
    self._validate_column_callables(X)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/compose/_column_transformer.py", line 352, in _validate_column_callables
    transformer_to_input_indices[name] = _get_column_indices(X, columns)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py", line 412, in _get_column_indices
    "Specifying the columns using strings is only "
ValueError: Specifying the columns using strings is only supported for pandas DataFrames

  warnings.warn(some_fits_failed_message, FitFailedWarning)
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
/usr/local/lib/python3.7/dist-packages/mlxtend/feature_selection/sequential_feature_selector.py:612: RuntimeWarning: Mean of empty slice
  all_avg_scores.append(np.nanmean(cv_scores))

[2022-04-09 06:26:03] Features: 1/1 -- score: nan

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-5-b7df58cec335> in <module>()
     33           scoring='neg_root_mean_squared_error',
     34           cv=5)
---> 35 sfs = sfs.fit(X_train, y_train, estimator__sample_weight=weights)

/usr/local/lib/python3.7/dist-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y, custom_feature_names, groups, **fit_params)
    566                     best_subset = k
    567             k_score = max_score
--> 568             k_idx = self.subsets_[best_subset]['feature_idx']
    569 
    570             if self.k_features == 'parsimonious':

KeyError: None

I know that it something to do with ColumnTransformer because it runs when remove this step. I've tried adding an extra step to convert to dataframe but it didn't help. Does anyone have any ideas?

0

There are 0 best solutions below