I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.
One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?
Here is the histogram of length_sequence for class 1:
Here is the histogram of length_sequence for class 0:
You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.
How can I down sample class 0 and make the new histogram look similar to the one in class 1?
I am trying to do stratified down sampling with scikit-learn, but I'm stuck.