when to use feature selection when doing hyperparameter optimization?

61 Views Asked by At

I am using skopt(scikit-optimize) to find best hyperparameters for random forest model. I have a lot of features. To avoid overfitting, I'd like to add feature selection like using RFE. But I am not sure when to use feature selection, during each iteration of estimating each combination of parameters or after finding best parameters?

My main question is:

  • If adding RFE in each iteration, then each iteration has different features so the model is not consistent?
  • if I move RFE after finding the best parameters, then during hyperparmeter optimization, since there are a lot of features, the model might overfit in each iteration?

For example, below is my current skopt codes without feature selection

def optimize(params,param_names,x,y):
    """
    The main optimization function.
    This function takes all the arguments from the search space
    and training features and targets. It then initializes
    the models by setting the chosen parameters and runs
    cross-validation and returns a negative accuracy score
    :param params: list of params from gp_minimize
    :param param_names: list of param names. order is important!
    :param x: training data
    :param y: labels/targets
    :return: negative accuracy after 5 folds
    """

    # convert params to dictionary
    params = dict(zip(param_names, params))
    # initialize model with current parameters
    model = ensemble.RandomForestClassifier(**params)
    # initialize stratified k-fold
    kf = model_selection.StratifiedKFold(n_splits=5)
    # initialize accuracy list
    accuracies = []
    # loop over all folds
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        xtest = x[test_idx]
        ytest = y[test_idx]
        # fit model for current fold
        model.fit(xtrain, ytrain)
        #create predictions
        preds = model.predict(xtest)
        # calculate and append accuracy
        fold_accuracy = metrics.accuracy_score(ytest,preds )
        accuracies.append(fold_accuracy)
    # return negative accuracy
    return -1 * np.mean(accuracies)


from functools import partial
from skopt import gp_minimize
from skopt import space

# define a parameter space
param_space = [
    # max_depth is an integer between 3 and 10
    space.Integer(3, 15, name="max_depth"),
    # n_estimators is an integer between 50 and 1500
    space.Integer(100, 1500, name="n_estimators"),
    # criterion is a category. here we define list of categories
    space.Categorical(["gini", "entropy"], name="criterion"),
    # you can also have Real numbered space and define a
    # distribution you want to pick it from
    space.Real(0.01, 1, prior="uniform", name="max_features")
    ]
# make a list of param names
# this has to be same order as the search space
# inside the main function
param_names = [
    "max_depth",
    "n_estimators",
    "criterion",
    "max_features"
    ]
# by using functools partial, i am creating a
# new function which has same parameters as the
# optimize function except for the fact that
# only one param, i.e. the "params" parameter is
# required. this is how gp_minimize expects the
# optimization function to be. you can get rid of this
# by reading data inside the optimize function or by
# defining the optimize function here.
optimization_function = partial(
    optimize,    # having 4 argument, so need 3 input inside this partial function
    param_names=param_names,
    x=X,
    y=y
    )
# now we call gp_minimize from scikit-optimize
# gp_minimize uses bayesian optimization for
# minimization of the optimization function.
# we need a space of parameters, the function itself,
# the number of calls/iterations we want to have
result = gp_minimize(
    optimization_function,
    dimensions=param_space,
    n_calls=15,
    n_random_starts=10,
    verbose=10
    )
# create best params dict and print it
best_params = dict(
    zip(
    param_names,
    result.x
    )
    )
print('Best parameters after skopt optimization are')
print(best_params)

Now I am thinking add RFE into the above codes, I am thinking to modify the first function of optimize() by adding RFE to model fitting for each fold, like below. I am not sure if this way is the correct way to do it. Could you please give me some suggestion? Thanks


def feature_selection_RFE(model,X,y,num_features):

    rfe=RFE(estimator=model,n_features_to_select=num_features)
    rfe.fit(X,y)
    X_transformed=rfe.transform(X)

    # summarize all features
    col_index_features_selected=[]
    for i in range(X.shape[1]):
        if rfe.support_[i]==True: # selected feature
            col_index_features_selected.append(i)

    return (X_transformed,col_index_features_selected)

def optimize(params,param_names,x,y):
    """
    The main optimization function.
    This function takes all the arguments from the search space
    and training features and targets. It then initializes
    the models by setting the chosen parameters and runs
    cross-validation and returns a negative accuracy score
    :param params: list of params from gp_minimize
    :param param_names: list of param names. order is important!
    :param x: training data
    :param y: labels/targets
    :return: negative accuracy after 5 folds
    """

    # convert params to dictionary
    params = dict(zip(param_names, params))
    # initialize model with current parameters
    model = ensemble.RandomForestClassifier(**params)
    # initialize stratified k-fold
    kf = model_selection.StratifiedKFold(n_splits=5)
    # initialize accuracy list
    accuracies = []
    # loop over all folds
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        (x_train_reduced_features,col_index_features_selected)=feature_selection_RFE(model,x_train, y_train,5)
        xtest = x[test_idx]
        ytest = y[test_idx]
        # fit model for current fold
        # model.fit(xtrain, ytrain)
        model.fit(x_train_reduced_features, ytrain)
        #create predictions
        preds = model.predict(xtest[col_index_features_selected])
        # calculate and append accuracy
        fold_accuracy = metrics.accuracy_score(ytest,preds )
        accuracies.append(fold_accuracy)
    # return negative accuracy
    return -1 * np.mean(accuracies)
0

There are 0 best solutions below