Problems with building a custom cv splitter for sklearn

37 Views Asked by At

I have a problem with building a custom cv splitter for sklearn. But I cannot find out where the failure is. I tried to simplify the problem and the code block. So, the split function within the custom splitter GroupShuffleTwoColumnsSplit here is for illustrative purposes only and is not intended to fulfill the specific task for which it is actually intended. At the moment it is a simple random split - but also here I get an error. The target is to split the data 1) in Dev- and Testset 2) passing splitter to randomsearch to get within the randomsearch Train- and Validationset from Devset.

Following I tried (see also code below):

  • defined my Custom Splitter Class with attributes n_splits, train_size, testsize, random_state
  • wrote split function (here only randomly choosing indices) as iterator and yield indices_train and indices_test
  • create the data
  • split the data for first time to get dev and testset (initialize Custom Splitter and get dev and test_indices by next() function.
  • initialize splitter the second time to pass it to randomsearch for getting train- and validationset
  • defined randomsearch with linear regression
  • passing custom splitter to randomsearch

Here I get an error message: indices are out-of-bounds

On the other hand I tried also the following:

  • replaced my Custom Splitter with buildin GroupShuffleSplit. Result: everythin is fine, it is working!
  • avoided the first Dev- Testset Splitting (so I took all X and y data to RandomSearch). Result: everything is fine, it is working.
  • only initialized Custom Splitter one time (the first time, and used it for split in Dev- and Testset and also in RandomSearch). Result: Here I get also an error message: indices are out-of-bounds

Yes, I know for randomly split the data, I do not need a custom splitter - but this is only for simplicity. Later the Splitter should solve one specific task (split the data, based on two columns in one really specific way). But this is for this problem not relevant.

Can anyone help?

import pandas as pd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV


class GroupShuffleTwoColumnsSplit:

    def __init__(self, n_splits, train_size=None, test_size=None, random_state=None):
        self.n_splits = n_splits
        self.train_size = train_size
        self.test_size = test_size
        self.random_state = random_state

        if self.train_size is None:
            self.train_size = 1 - self.test_size
        if self.test_size:
            self.train_size = 1 - self.test_size

    def split(self, X, y, groups=None):

        series_0 = groups.iloc[:, 0]

        for n in range(self.n_splits):

            indices = series_0.index.tolist()

            ratio = 1 - self.train_size
            num_elements = int(len(indices) * ratio)
            indices_test = random.sample(indices, num_elements)
            indices_train = [x for x in indices if x not in indices_test]
            yield indices_train, indices_test

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits


multiplicator = 10
list_values_1 = [831, 832, 833, 834, 835]
list_values_2 = [1, 2, 3, 4, 5]
col_1 = np.repeat(list_values_1, multiplicator)
col_2 = np.tile(list_values_2, multiplicator)
y_col = np.random.rand(len(list_values_1) * multiplicator)

df = pd.DataFrame({"serie": col_1, "poles": col_2, "y": y_col})
X = df.loc[:, ["serie", "poles"]]
y = df.loc[:, "y"]

### for initial train test split

group_columns = df.loc[:, ["serie", "poles"]]
splitter = GroupShuffleTwoColumnsSplit(n_splits=1, train_size=0.8)


dev_id, test_id = next(splitter.split(X, y, group_columns))
X_dev = df.loc[dev_id, :]
X_test = df.loc[test_id, :]
y_dev = df.loc[dev_id]
y_test = df.loc[test_id]

### for paramsearch

group_columns = X_dev.loc[:, ["serie", "poles"]]
splitter = GroupShuffleTwoColumnsSplit(n_splits=5, train_size=0.8)


# Define hyperparameter search space
param_dist = {
    "fit_intercept": [True, False],
}

# Create a random search object
random_search = RandomizedSearchCV(
    estimator=LinearRegression(),
    param_distributions=param_dist,
    n_iter=10,
    cv=splitter,
)

# Fit the model
random_search.fit(X_dev, y_dev, groups=group_columns)```
1

There are 1 best solutions below

2
Muhammed Yunus On

I think it's because you are using the .index attribute of a dataframe to get the indices into a numpy array. In split(), ensure indices has values that are valid for indexing into the underlying numpy arrays of X and y. In other words, indices should expect to work with X.to_numpy() and y.to_numpy().

Example below. It doesn't perform a meaningful split, but just shows how working with array indices (as opposed to dataframe .index) resolves the error.

...
    def split(self, X, y, groups=None):
        series_0 = groups.iloc[:, 0]

        for n in range(self.n_splits):
            indices = np.arange(len(series_0)).tolist() #don't use series_0.index

            ...

            #sample num_elements without replacement
            indices_test = np.random.choice(indices, size=num_elements, replace=False)

            #remaining indices are for training
            indices_train = [x for x in indices if x not in indices_test]
            
            yield indices_train, indices_test
...