Is there a discretised version of scipy.stats.loguniform?

38 Views Asked by At

When running hyperparameter tuning on a random forest, I sometimes want to specify a large integer range of values for integer parameters like min_samples_leaf (e.g. ranging from the default value of 1 up to 100).

Whilst I could specify this range using scipy.stats.randint(1, 100), I'd prefer to use a log-uniform distribution as my range covers two orders of magnitude. SciPy has stats.loguniform for continuous rvs, but doesn't seem to have a discretised equivalent.

A quick solution for approximating the discretised space is to just sample lots of continuous values and then convert the samples to integers:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

from scipy.stats import loguniform
import numpy as np

#Draw lots of samples and discretise them, in order to approximate
# a discretised loguniform sample space
def discretised_loguniform_samples(low, high, seed=None, sample_size=100_000):
    float_rvs = loguniform(low, high, seed=seed).rvs(n_samples)
    return float_rvs.round().astype(int)

Usage:

rf_param_distributions = {
    'min_samples_leaf': discretised_loguniform_samples(low=1, high=100, seed=0),
     ...
}

#This will draw n_iter=10 samples from the fixed list of integers created above.
# The list from which the samples are drawn is fixed in advance and therefore
# can't exploit the randomness imparted by the consumable random_state argument.
RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
    param_distributions=rf_param_distributions,
    n_iter=10,
    random_state=np.random.RandomState(0),
    ...
)

The downside is that when I define the list of integers in advance, that gives me only a single static space (however large) that is fixed throughout the tuning process. I want to exploit the randomness imparted by the consumable random_seed= in RandomizedSearchCV*, rather than being limited to a pre-defined list.

How can I modify loguniform in such a way that I get a discretised version of its samples for each call to rvs()?

*RandomizedSearchCV passes down its random_state= parameter to the distribution's rvs() method. The docs seem ambiguous on this point, stating that random_state= is "used for sampling from lists of possible values instead of scipy.stats distributions".

1

There are 1 best solutions below

0
Muhammed Yunus On BEST ANSWER

The approach below simply decorates/wraps loguniform.rvs() with a float-to-int function:

def float_to_int(rvs):
    def rvs_wrapper(*args, **kwargs):
        return rvs(*args, **kwargs).round().astype(int)
    return rvs_wrapper

def int_loguniform(low, high):
    #Create a loguniform object
    lu = loguniform(low, high)

    #Wrap its rvs() with float-to-int
    lu.rvs = float_to_int(lu.rvs)

    #Return modified loguniform object
    return lu

Usage:

rf_param_distributions = {
    'min_samples_leaf': int_loguniform(low=1, high=100),
     ...
}

#Each iteration will consume the supplied random_state,
# We are no longer limited to drawing samples from a fixed list.
RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
    param_distributions=rf_param_distributions,
    n_iter=10,
    random_state=np.random.RandomState(0),
    ...
)