I have a dataset with columns for speaker_id, class, and sample_id. There are 7 classes of different sizes, each with a certain number of speakers. Each speaker possesses a specific number of samples, with some having many samples and others fewer. My goal is to split the samples per area into approximately 80/10/10 for the train, test, and validation sets, ensuring that each speaker with their corresponding samples appears in only one set.
If the distribution doesn't work well, I am open to delete some samples from the speakers with the most samples.
I have tried various approaches, such as splitting the dataset into the 7 classes and using GroupShuffleSplit. Is there an algorithm to achieve this?