Will oversampling lead to an overfitted model?

1.7k Views Asked by At

The target attribute distribution is currently like this:

mydata.groupBy("Churn").count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|    1|  483|
|    0| 2850|
+-----+-----+

My questions are:

  • methods of oversampling like: manully, smote, adasyn are going to use available data to create new data points?

  • If we use such data to train a classification model, will it not be an overfitted one?

1

There are 1 best solutions below

0
On BEST ANSWER

my question is any method of oversampling (manully, smote, adasyn) will use the available data to create new data points.

  • Data imbalance problems is mostly handled in three steps:
    1. Over-sample the minority class.
    2. Under-sample the majority class.
    3. Synthesize new minority classes.

SMOTE (Synthetic Minority Over-sampling TEchnique) is coming under the third step. It’s the process of creating a new minority classes from the datasets.

The process in SMOTE is mentioned below:

enter image description here

So, this is a bit smarter than just over-sampling.

If we use such data to build a classification model, will it not be an overfitted one?

The correct answer would be PROBABLY. Give it a try!

This is why we use test sets and cross validation to try to understand if the model would be good with unseen data!