How can I do a stratified downsampling?

359 Views Asked by Rafael Garcia At 27 July 2025 at 15:59

I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.

One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?

Here is the histogram of length_sequence for class 1:

Here is the histogram of length_sequence for class 0:

You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.

How can I down sample class 0 and make the new histogram look similar to the one in class 1?

I am trying to do stratified down sampling with scikit-learn, but I'm stuck.

Original Q&A

How can I do a stratified downsampling?

There are 0 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in DOWNSAMPLING

Trending Questions

Popular # Hahtags

Popular Questions