How to use dictionary in SMOTE algorithm for resampling the multi-class input data differently?

5.1k Views Asked by At

I want to perform oversampling using the SMOTE algorithm in python using the library imblearn.over_sampling. My input data has four target classes. I don't want to oversample all the minority class distribution to match with the majority class distribution. I want to oversample each of my minority classes differently.

When I am using SMOTE(sampling_strategy = 1, k_neighbors=2,random_state = 1000), I got following error.

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.

Then, as per the error, I used a dictionary for "sampling_strategy" as follows,

SMOTE(sampling_strategy={'1.0':70,'3.0':255,'2.0':50,'0.0':150},k_neighbors=2,random_state = 1000)

But, it is giving following error,

ValueError: The {'2.0', '1.0', '0.0', '3.0'} target class is/are not present in the data.

Does anyone know how we can define a dictionary to oversample the data differently using SMOTE?

1

There are 1 best solutions below

0
On

You have to specify the number of samples you want for each class and pass this dictionary to SMOTE object.

Code:

import numpy as np
from collections import Counter
from imblearn.over_sampling import SMOTE

x1 = np.random.randint(500, size =(200,13))
y1 = np.concatenate([np.array([0]*100), np.array([1]*65), np.array([2]*25), np.array([3]*10)])
np.random.shuffle(y1)
Counter(y1)

Output:

Counter({0: 100, 1: 65, 2: 25, 3: 10})

Code:

sm = SMOTE(sampling_strategy = {0: 100, 1: 70, 2: 90, 3: 40})
X_res, y_res = sm.fit_resample(x1, y1)
Counter(y_res)

Output:

Counter({0: 100, 1: 70, 2: 90, 3: 40})

For more information see the documentation here.

The error you are getting is because the labels specified in the dictionary and the actual labels don't match.