Equivalent of scikit-learn's GroupShuffleSplit in dask-ml?

149 Views Asked by karldw At 17 August 2025 at 16:51

I'd like to split my data into testing and training sets, but I have repeated observations of people over time, so I'd like to do the splitting in a way that none of the people have observations that appear in both the test and training data sets. To do this kind of splitting in scikit-learn, I'd do something like this, using GroupShuffleSplit:

import numpy as np
from sklearn.model_selection import GroupShuffleSplit

X = np.array([0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001])
y = np.array(["a", "b", "b", "b", "c", "c", "c", "a"])
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4])

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train, test = next(gss.split(X, y, groups=groups))

X_train, y_train = X[train], y[train]
X_test,  y_test  = X[test],  y[test]

How can I do this with Dask or Dask-ML?

Original Q&A

Equivalent of scikit-learn's GroupShuffleSplit in dask-ml?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in DASK

Related Questions in PANEL-DATA

Related Questions in DASK-ML

Trending Questions

Popular # Hahtags

Popular Questions