I'd like to split my data into testing and training sets, but I have repeated observations of people over time, so I'd like to do the splitting in a way that none of the people have observations that appear in both the test and training data sets. To do this kind of splitting in scikit-learn, I'd do something like this, using GroupShuffleSplit:
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.array([0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001])
y = np.array(["a", "b", "b", "b", "c", "c", "c", "a"])
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4])
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train, test = next(gss.split(X, y, groups=groups))
X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]
How can I do this with Dask or Dask-ML?