I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?
This is a sample of my y_train values, which are log transformed.
3.688879 3.828641 3.401197 3.091042 4.624973
from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
1 from imblearn.over_sampling import RandomOverSampler
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
73 The corresponding label of `X_resampled`.
74 """
---> 75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'
Re-sampling strategies are not meant for regression problems. Hence, the
RandomOverSampler
will not acceptfloat
type targets. There are approaches to re-sample data with continuous targets though. One example is thereg_resample
which can be used like the following:The
resampler
object will generate pseudo-classes based on your target values and then use a classic re-sampling object from theimblearn
package to re-sample your data. Note that the data you pass to theresampler
object should contain all data, including the targets.