I'm using LSTM for Sentiment Analysis by using imbalanced dataset having 86% positive class and 14% negative class samples. It's a very small dataset with 472 sentences but they're in regional language. Train_test_split ratio is 0.3. I'm having two issues in implementation:
1: Training and Validation accuracy is constant throughout the process (Without SMOTE). 2: While using SMOTE for oversampling, y_train shows only 1 label in oversampled y_train.shape
from imblearn.over_sampling import SMOTE
ros = SMOTE()
X_train_oversample, y_train_oversample = ros.fit_sample(X_train_pad, y_train)
print(X_train_pad.shape)
print(X_train_oversample.shape)
print(y_train.shape)
print(y_train_oversample.shape)
The results for y_train:
X_train_pad.shape (331, 832)
X_train_oversample.shape (570, 832)
y_train.shape (331, **2**)
y_train_oversample.shape (570, **1**)
However, the actual shape of data is as follows:
Shape of X_train_pad tensor (331, 832)
Shape of y_train tensor (331, 2)
Shape of X_test_pad tensor (141, 832)
Shape of y_test tensor (141, 2)
Hence, LSTM training gives error message
ValueError: Error when checking target: expected dense_1 to have shape (2,) but got array with shape (1,)
The output from SMOTE,
y_train_oversample
is not one-hot-encoded as your originaly
so you have to one-hot-encode it. You could probably figure this out if you have taken a look at the values iny_train_oversample
.You can do the one hot encoding using the following assuming it's an numpy array.
Note that I had flatten there because if the input
y
is a matrix, the output is going to be(n_row, 1)
and if you directly one hot encode it, you will have some dimensionality issue, so it's batter to flatten it first.