Not getting expected min_resources_ when using scikit-learn's HalvingGridSearchCV

51 Views Asked by At

I am trying to tune model hyperparameters using scikit-learn's HalvingGridSearchCV class and the iterations it uses do not appear correct to me. I am using the default min_resources="exhaust", as I want the last iteration to use the entire training set, or as much as possible. In my case when I test over 8 hyperparameters with a halving factor of 2 I get one of either two problems.

The first is that in some cases it runs the last iteration for only one candidate hyperparameter set. Something the documentation states that it shouldn't do:

the process stops at the first iteration which evaluates factor=2 candidates: the best candidate is the best out of these 2 candidates. It is not necessary to run an additional iteration

Over 1000 generated samples, running a halving search with a factor of two over a grid of 8 possible hyperparameter combinations I would expect this to be done in 3 iterations of n_candidates_=[8, 4, 2] with n_resources_=[250, 500, 1000]. What I get however is n_candidates_=[8, 4, 2, 1] with n_resources_=[125, 250, 500, 1000] where the first iteration is done with a lower than expected min_resources and the last one is unnecessary.

Another problem happens when I run the same search over 100 generated samples, instead of my expected 3 iterations of n_candidates_=[8, 4, 2] with n_resources_=[25, 50, 100], I get n_candidates_=[8, 4, 2] with n_resources_=[20, 40, 80]. Here the number of iterations seems the correct but the initial n_resources_ is too low and doesn't reach 100 in the last iteration as it should with min_resources="exhaust".

These issues appear to be data dependent, I first noticed the iterations were not what I expected when trying to run a small test on a subset of the Fashion-MNIST dataset. In that case, running the same hyperparameter search over 200 samples gives n_candidates_=[8, 4] with n_resources_=[100, 200], on generated samples I get n_candidates_=[8, 4, 2, 1] with n_resources_=[25, 50, 100, 200]. Neither of which are the expected n_candidates_=[8, 4, 2] with n_resources_=[50, 100, 200].

When I use a halving factor of 3 or 4 I get the results I expect, I've only seen the issue with 2.

I am not sure if I am misunderstanding the expected behavior of successive halving or if I am using it wrong in some way. Or if it is just a bug in scikit-learn, I am using version 1.4.0 of the library.

The following code snippet will result in n_candidates_=[8, 4, 2, 1] and n_resources_=[125, 250, 500, 1000] instead of the n_candidates_=[8, 4, 2] and n_resources_=[250, 500, 1000] I expect. It can also be changed to obtain the other cases I highlighted by changing num_samples or un-commenting the Fashion-MNIST code and commenting out the make_classification() line.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml, make_classification
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import train_test_split, HalvingGridSearchCV
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
num_samples = 1000  # or 100, or 200

# If testing on Fashion-MNIST, this will download the dataset
# data_fash = fetch_openml(name="Fashion-MNIST")
# X_train, _, y_train, _ = train_test_split(
#                              data_fash.data, data_fash.target, train_size=num_samples, random_state=rng
#                              )

X_train, y_train = make_classification(n_samples=num_samples, n_features=20, random_state=rng)

print("Train samples: ", len(X_train))

param_grid = {"criterion": ["gini", "entropy"],
              "min_samples_split": [2, 3, 4, 5]}

cls = DecisionTreeClassifier()

grid = HalvingGridSearchCV(
           cls, param_grid, factor=2, min_resources="exhaust", scoring="accuracy", cv=5
           )

grid.fit(X_train, y_train)

print(grid.n_candidates_)
print(grid.n_resources_)
0

There are 0 best solutions below