Including Scaling and PCA as parameter of GridSearchCV

1k Views Asked by At

I want to run a logistic regression using GridSearchCV, but I want to contrast the performance when Scaling and PCA is used, so I don't want to use it in all cases.

I basically would like to include PCA and Scaling as "parameters" of the GridSearchCV

I am aware I can make a pipeline like this:

mnl = LogisticRegression(fit_intercept=True, multi_class="multinomial")

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('mnl', mnl)])

params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'mnl__max_iter':[500,1000,2000,3000]}

The thing is that, in this case, the scaling would be applied in all folds, right? Is there a way to make it so it's "included" in the gridsearch?

EDIT:

I just read this answer and even though it's similar to what I want, it's not really it, because in that case the Scaler will be applied to the best estimator out of the GridSearch.

What I want to do is, for example, let's say

params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs']}

I want to run the regression with Scaler+newton-cg, No Scaler+newton-cg, Scaler+lbfgs, No Scaler+lbfgs.

2

There are 2 best solutions below

4
On BEST ANSWER

You can set up the parameters with_mean and with_std of StandardScaler() as False to represent no standerdization. In the GirdSearchCV, the parameter para_grid can be set up as

param_grid = [{'scale__with_mean': [False],
               'scale__with_std': [False],
               'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'mnl__max_iter':[500,1000,2000,3000]
              },
              {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'mnl__max_iter':[500,1000,2000,3000]}
]

Then the first dict in the list is "No Scaler+mnl" and the second is "Scaler+mnl"

Ref:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

Edit: I think it's complicated if you also considering turn on/off PCA... Maybe you need to define a customised PCA which derives the original PCA. And then define additional boolean argument which determines whether the PCA should be executed or not...

class MYPCA(PCA):
    def __init__(self, PCA_turn_on, *args):
        super().__init__(*args)
        self.PCA_turn_on = PCA_turn_on
    
    def fit(X, y=None):
        if (PCA_turn_on == True):
            return super().fit(X, y=None)
        else:
            pass

    # same for other methods defined in PCA
0
On

From the documentation for Pipeline:

A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

For example:

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('mnl', mnl),
])

params = {
    'scale': ['passthrough', StandardScaler()],
    'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'mnl__max_iter': [500, 1000, 2000, 3000],
}