How can I do a stratified downsampling?

368 Views Asked by At

I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.

One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?

Here is the histogram of length_sequence for class 1: enter image description here

Here is the histogram of length_sequence for class 0: enter image description here

You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.

How can I down sample class 0 and make the new histogram look similar to the one in class 1?

I am trying to do stratified down sampling with scikit-learn, but I'm stuck.

0

There are 0 best solutions below