I have a problem with building a custom cv splitter for sklearn. But I cannot find out where the failure is. I tried to simplify the problem and the code block. So, the split function within the custom splitter GroupShuffleTwoColumnsSplit here is for illustrative purposes only and is not intended to fulfill the specific task for which it is actually intended. At the moment it is a simple random split - but also here I get an error. The target is to split the data 1) in Dev- and Testset 2) passing splitter to randomsearch to get within the randomsearch Train- and Validationset from Devset.
Following I tried (see also code below):
- defined my Custom Splitter Class with attributes n_splits, train_size, testsize, random_state
- wrote split function (here only randomly choosing indices) as iterator and yield indices_train and indices_test
- create the data
- split the data for first time to get dev and testset (initialize Custom Splitter and get dev and test_indices by next() function.
- initialize splitter the second time to pass it to randomsearch for getting train- and validationset
- defined randomsearch with linear regression
- passing custom splitter to randomsearch
Here I get an error message: indices are out-of-bounds
On the other hand I tried also the following:
- replaced my Custom Splitter with buildin GroupShuffleSplit. Result: everythin is fine, it is working!
- avoided the first Dev- Testset Splitting (so I took all X and y data to RandomSearch). Result: everything is fine, it is working.
- only initialized Custom Splitter one time (the first time, and used it for split in Dev- and Testset and also in RandomSearch). Result: Here I get also an error message: indices are out-of-bounds
Yes, I know for randomly split the data, I do not need a custom splitter - but this is only for simplicity. Later the Splitter should solve one specific task (split the data, based on two columns in one really specific way). But this is for this problem not relevant.
Can anyone help?
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV
class GroupShuffleTwoColumnsSplit:
def __init__(self, n_splits, train_size=None, test_size=None, random_state=None):
self.n_splits = n_splits
self.train_size = train_size
self.test_size = test_size
self.random_state = random_state
if self.train_size is None:
self.train_size = 1 - self.test_size
if self.test_size:
self.train_size = 1 - self.test_size
def split(self, X, y, groups=None):
series_0 = groups.iloc[:, 0]
for n in range(self.n_splits):
indices = series_0.index.tolist()
ratio = 1 - self.train_size
num_elements = int(len(indices) * ratio)
indices_test = random.sample(indices, num_elements)
indices_train = [x for x in indices if x not in indices_test]
yield indices_train, indices_test
def get_n_splits(self, X, y, groups=None):
return self.n_splits
multiplicator = 10
list_values_1 = [831, 832, 833, 834, 835]
list_values_2 = [1, 2, 3, 4, 5]
col_1 = np.repeat(list_values_1, multiplicator)
col_2 = np.tile(list_values_2, multiplicator)
y_col = np.random.rand(len(list_values_1) * multiplicator)
df = pd.DataFrame({"serie": col_1, "poles": col_2, "y": y_col})
X = df.loc[:, ["serie", "poles"]]
y = df.loc[:, "y"]
### for initial train test split
group_columns = df.loc[:, ["serie", "poles"]]
splitter = GroupShuffleTwoColumnsSplit(n_splits=1, train_size=0.8)
dev_id, test_id = next(splitter.split(X, y, group_columns))
X_dev = df.loc[dev_id, :]
X_test = df.loc[test_id, :]
y_dev = df.loc[dev_id]
y_test = df.loc[test_id]
### for paramsearch
group_columns = X_dev.loc[:, ["serie", "poles"]]
splitter = GroupShuffleTwoColumnsSplit(n_splits=5, train_size=0.8)
# Define hyperparameter search space
param_dist = {
"fit_intercept": [True, False],
}
# Create a random search object
random_search = RandomizedSearchCV(
estimator=LinearRegression(),
param_distributions=param_dist,
n_iter=10,
cv=splitter,
)
# Fit the model
random_search.fit(X_dev, y_dev, groups=group_columns)```
I think it's because you are using the
.indexattribute of a dataframe to get the indices into anumpyarray. Insplit(), ensureindiceshas values that are valid for indexing into the underlyingnumpyarrays ofXandy. In other words,indicesshould expect to work withX.to_numpy()andy.to_numpy().Example below. It doesn't perform a meaningful split, but just shows how working with array indices (as opposed to dataframe
.index) resolves the error.