When running hyperparameter tuning on a random forest, I sometimes want to specify a large integer range of values for integer parameters like min_samples_leaf (e.g. ranging from the default value of 1 up to 100).
Whilst I could specify this range using scipy.stats.randint(1, 100), I'd prefer to use a log-uniform distribution as my range covers two orders of magnitude. SciPy has stats.loguniform for continuous rvs, but doesn't seem to have a discretised equivalent.
A quick solution for approximating the discretised space is to just sample lots of continuous values and then convert the samples to integers:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import loguniform
import numpy as np
#Draw lots of samples and discretise them, in order to approximate
# a discretised loguniform sample space
def discretised_loguniform_samples(low, high, seed=None, sample_size=100_000):
float_rvs = loguniform(low, high, seed=seed).rvs(n_samples)
return float_rvs.round().astype(int)
Usage:
rf_param_distributions = {
'min_samples_leaf': discretised_loguniform_samples(low=1, high=100, seed=0),
...
}
#This will draw n_iter=10 samples from the fixed list of integers created above.
# The list from which the samples are drawn is fixed in advance and therefore
# can't exploit the randomness imparted by the consumable random_state argument.
RandomizedSearchCV(
estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
param_distributions=rf_param_distributions,
n_iter=10,
random_state=np.random.RandomState(0),
...
)
The downside is that when I define the list of integers in advance, that gives me only a single static space (however large) that is fixed throughout the tuning process. I want to exploit the randomness imparted by the consumable random_seed= in RandomizedSearchCV*, rather than being limited to a pre-defined list.
How can I modify loguniform in such a way that I get a discretised version of its samples for each call to rvs()?
*RandomizedSearchCV passes down its random_state= parameter to the distribution's rvs() method. The docs seem ambiguous on this point, stating that random_state= is "used for sampling from lists of possible values instead of scipy.stats distributions".
The approach below simply decorates/wraps
loguniform.rvs()with a float-to-int function:Usage: