Identify the Synthetic Samples generated by SMOTE

58 Views Asked by Arindam At 24 January 2024 at 06:04

I have a labeled dataset with X shape being 7000 x 2400 and y shape being 7000. The data is heavily imbalanced, so I am trying to generate synthetic samples using SMOTE. However I want to identify the synthetic samples that SMOTE actually generated. As an example , here's a code snippet:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from imblearn.over_sampling import SMOTE

iris = load_iris()

X = iris['data']
y = iris['target']

#The data is balanced , so I intentionally remove some samples
X = X[:125,::]
y = y[:125]

oversample = SMOTE()
X_smt, y_smt = oversample.fit_resample(X, y)

The arrays X_smt and y_smt have both the original samples and the synthetic samples. Is there a simple way to identify the synthetic samples by index or some other mechanism ?

Original Q&A

There are 1 best solutions below

Arindam On 29 January 2024 at 13:53 BEST ANSWER

I really feel stupid .... the answer is that simple. It seems SMOTE just appends the new samples after the original samples. Just adding these two lines proves my point.

for i in range(X_smt.shape[0]):
  print(any(np.array_equal(X_smt[i],j) for j in X),i)

What we are doing is to find each element of X_smt in X. Since X has 125 elements (0 to 124), each of the first 125 elements of X_smt should be found in X. Whereas elements indexed from 125 onwards shouldn't be there in X. The print statement proves it. Feel free to run the notebook here

Identify the Synthetic Samples generated by SMOTE

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in IMBALANCED-DATA

Related Questions in IMBLEARN

Related Questions in SMOTE

Trending Questions

Popular # Hahtags

Popular Questions