Oversampling Using SMOTE Removes a Label Category from y_train

572 Views Asked by At

I'm using LSTM for Sentiment Analysis by using imbalanced dataset having 86% positive class and 14% negative class samples. It's a very small dataset with 472 sentences but they're in regional language. Train_test_split ratio is 0.3. I'm having two issues in implementation:

1: Training and Validation accuracy is constant throughout the process (Without SMOTE). 2: While using SMOTE for oversampling, y_train shows only 1 label in oversampled y_train.shape

from imblearn.over_sampling import SMOTE

ros = SMOTE()

X_train_oversample, y_train_oversample = ros.fit_sample(X_train_pad, y_train)

print(X_train_pad.shape)
print(X_train_oversample.shape)

print(y_train.shape)
print(y_train_oversample.shape)

The results for y_train:

X_train_pad.shape                   (331, 832)
X_train_oversample.shape            (570, 832)
y_train.shape                       (331, **2**)
y_train_oversample.shape            (570, **1**)

However, the actual shape of data is as follows:

Shape of X_train_pad tensor (331, 832)
Shape of y_train tensor (331, 2)
Shape of X_test_pad tensor (141, 832)
Shape of y_test tensor (141, 2)

Hence, LSTM training gives error message

ValueError: Error when checking target: expected dense_1 to have shape (2,) but got array with shape (1,)
1

There are 1 best solutions below

2
On

The output from SMOTE, y_train_oversample is not one-hot-encoded as your original y so you have to one-hot-encode it. You could probably figure this out if you have taken a look at the values in y_train_oversample.

You can do the one hot encoding using the following assuming it's an numpy array.

y_train_oversample = np.eye(2)[y_train_oversample.flatten()]

Note that I had flatten there because if the input y is a matrix, the output is going to be (n_row, 1) and if you directly one hot encode it, you will have some dimensionality issue, so it's batter to flatten it first.