Using a Pipeline containing ColumnTransformer in SciKit's RFECV

437 Views Asked by At

I'm trying to do RFECV on the transformed data using SciKit.

For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

class CustomPipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

X = pd.DataFrame({
    'col1': [i for i in range(100)] , 
    'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])

pipe = CustomPipeline([
    ('col_transform', ct),
    ('lr', LinearRegression())
])

rfecv = RFECV(
    estimator=pipe, 
    step=1,
    cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)

Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.

I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)

0

There are 0 best solutions below