How to create pandas output for custom transformers?

1.7k Views Asked by At

There are a lot of changes in scikit-learn 1.2.0 where it supports pandas output for all of the transformers but how can I use it in a custom transformer?

In [1]: Here is my custom transformer which is a standard scaler:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

In [2]: Created a specific scale pipeline

scale_pipe = make_pipeline(StandardScalerCustom())

In [3]: Added in a full pipeline where it may get mixed with scalers, imputers, encoders etc.

full_pipeline = ColumnTransformer([
    ("imputer", impute_pipe, ['column_1'])
    ("scaler", scale_pipe, ['column_2'])
])

# From documentation
full_pipeline.set_output(transform="pandas")

Got this error:

ValueError: Unable to configure output for StandardScalerCustom() because set_output is not available.


There is a solution and it can be: set_config(transform_output="pandas")

But in case-to-case basis, how can I create a function in StandardScalerCustom() class that can fix the error above?

2

There are 2 best solutions below

0
On

My guess is that one the rationales behind the enhancement of set_config() by means of the parameter transform_output was indeed to enable also custom transformers to output pandas DataFrames.

By looking at the underlying code, I've found one hack that allows custom transformers to output pandas DataFrames without the need to explicitly set the global configuration; it is sufficient to implement a dummy .get_feature_names_out() method. However, this works just because in this way the global configuration is automatically set. Indeed, _auto_wrap_is_configured() returns True if .get_feature_names_out() is available and, if so, full_pipeline reverts to calling this .set_output() method rather than getting to this ._safe_set_output() method, where the first sets the global configuration with transform="pandas" automatically, while the second would output the ValueError that you're getting.

Here's a working example:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd

df = pd.DataFrame({'column_1': [np.nan, 1.34, 10.98, 3.34, 5.32], 'column_2': [9.78, 20.34, 43.54, 1.98, 7.85]})

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

    def get_feature_names_out(self):
        pass

impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())

full_pipeline = ColumnTransformer([
    ("imputer", impute_pipe, ['column_1']),
    ("scaler", scale_pipe, ['column_2'])
])

full_pipeline.set_output(transform="pandas")
full_pipeline.fit_transform(df)
0
On

In most case custom methods 'transform' return numpy arrays. To convert them back to pandas DataFrame you need to extract columns while fitting. After that you need to add method get_feature_names_out, which returns column names. Try to use this code:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = X.columns
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std
    
    def get_feature_names_out(self, *args, **params):
        return self.columns_