I have an imbalanced dataset called yeast4. The records divided in two classes of target "positive" and "negative". the positive class contains only 3% of the total proportion. I have used the kNN algorithm for classification, and i have not specified the k but i used 5-fold cross-validation on the training data. I have found: auc_knn_none = 0.7062473. I am interesting to add an algorithm of oversampling to improve the quality of the model. So i used SMOTE algorithm and also i have not specified the k ok kNN and i used 5-fold cross-validation on the training data. But this time, i have found: auc_knn_smote = 0.56676. Normally auc_knn_smote have to be higher than auc_knn_none so there is something rong and i do not know where is the problem. Here is my code:
library(imbalance)
data(yeast4)
Data <- yeast4
Data$Mcg <- as.numeric(as.character(Data$Mcg))
Data$Gvh <- as.numeric(as.character(Data$Gvh))
Data$Alm <- as.numeric(as.character(Data$Alm))
Data$Mit <- as.numeric(as.character(Data$Mit))
Data$Erl <- as.numeric(as.character(Data$Erl))
Data$Pox <- as.numeric(as.character(Data$Pox))
Data$Vac <- as.numeric(as.character(Data$Vac))
Data$Nuc <- as.numeric(as.character(Data$Nuc))
U <- data.frame(Data[,-9])
U <- scale(U,center = TRUE ,scale=TRUE)
U <- data.frame(U)
q <- as.factor(unlist(Data$Class))
Q <- vector()
for(i in 1: nrow(Data))
{
if(substr(q[i],1,1)=="n")
{
Q <- c(Q,0)
}
else{
Q <- c(Q,1)
}
}
Q <- as.factor(Q)
Here i have scaled and centred my data, i set any values of negative to 0 and everything else to 1. And here is the function that i have used:
library(ROCR)
library(pROC)
library(caret)
library(ROSE)
library(DMwR)
library(nnet)
AUC_KNN_SMOTE <- function(U,Q,k,M){
folds <- createFolds(Q, k)
AUC <- vector()
W <- vector()
for( i in 1:k){
s <- data.frame(folds[i])[,1]
TRAIN <- data.frame(U[-s,])
TEST <- data.frame(U[s,])
TRAIN$Class <- Q[-s]
TRAIN.smote <- SMOTE(Class~.,data = TRAIN
,perc.over = 100,perc.under = 200)
trControl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
fit <- train(make.names(Class) ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:M),
trControl = trControl,
metric = "ROC",
data = TRAIN.smote)
W <- c(W,fit[["results"]][,2])
W <- matrix(W,nrow=M,ncol = i)
J <- which.is.max(W[,i])
mod <- class::knn(cl = TRAIN.smote$Class,
test = TEST,
train = TRAIN.smote[,-9],
k = J,
prob = TRUE)
X <- roc(Q[s],attributes(mod)$prob,quiet = TRUE)
AUC <- c(AUC, as.numeric(X$auc))
}
return(mean(AUC))
}
and the result that i have mentioned above and found with this function is:
b <- 0
for(i in 1:1000)
{
m <- AUC_KNN_SMOTE(U,Q,k=5,M=100)+b
b <- m
}
auc_knn_smote <- m/1000
auc_knn_smote=0.56676
Thank you for any help!
I think there is nothing wrong with your approach. Its the result interpretation that requires clarification.
As you have stated, on the initial imbalanced dataset you got an AUC score of 0.7062473. Then you applied the SMOTE data balancing algorithm and you got an AUC score of 0.56676. In both cases, 5-fold cross validation was applied.
Explanation
Possible solution