I have a labeled dataset with X shape being 7000 x 2400 and y shape being 7000. The data is heavily imbalanced, so I am trying to generate synthetic samples using SMOTE. However I want to identify the synthetic samples that SMOTE actually generated. As an example , here's a code snippet:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from imblearn.over_sampling import SMOTE
iris = load_iris()
X = iris['data']
y = iris['target']
#The data is balanced , so I intentionally remove some samples
X = X[:125,::]
y = y[:125]
oversample = SMOTE()
X_smt, y_smt = oversample.fit_resample(X, y)
The arrays X_smt and y_smt have both the original samples and the synthetic samples. Is there a simple way to identify the synthetic samples by index or some other mechanism ?
I really feel stupid .... the answer is that simple. It seems SMOTE just appends the new samples after the original samples. Just adding these two lines proves my point.
What we are doing is to find each element of X_smt in X. Since X has 125 elements (0 to 124), each of the first 125 elements of X_smt should be found in X. Whereas elements indexed from 125 onwards shouldn't be there in X. The print statement proves it. Feel free to run the notebook here