KNN for imbalanced dataset and adding SMOTE for improve the performance but with worst result

1.3k Views Asked by At

I have an imbalanced dataset called yeast4. The records divided in two classes of target "positive" and "negative". the positive class contains only 3% of the total proportion. I have used the kNN algorithm for classification, and i have not specified the k but i used 5-fold cross-validation on the training data. I have found: auc_knn_none = 0.7062473. I am interesting to add an algorithm of oversampling to improve the quality of the model. So i used SMOTE algorithm and also i have not specified the k ok kNN and i used 5-fold cross-validation on the training data. But this time, i have found: auc_knn_smote = 0.56676. Normally auc_knn_smote have to be higher than auc_knn_none so there is something rong and i do not know where is the problem. Here is my code:

library(imbalance)
data(yeast4)
Data <- yeast4
Data$Mcg <- as.numeric(as.character(Data$Mcg))
Data$Gvh <- as.numeric(as.character(Data$Gvh))
Data$Alm <- as.numeric(as.character(Data$Alm))
Data$Mit <- as.numeric(as.character(Data$Mit))  
Data$Erl <- as.numeric(as.character(Data$Erl))
Data$Pox <- as.numeric(as.character(Data$Pox))
Data$Vac <- as.numeric(as.character(Data$Vac))
Data$Nuc <- as.numeric(as.character(Data$Nuc))
U <- data.frame(Data[,-9])
U <- scale(U,center = TRUE ,scale=TRUE)
U <- data.frame(U)
q <- as.factor(unlist(Data$Class))
Q <- vector()
for(i in 1: nrow(Data))
{
  if(substr(q[i],1,1)=="n")
  {
    Q <- c(Q,0)
  }
  else{
    Q <- c(Q,1)
  }
}
Q <- as.factor(Q)

Here i have scaled and centred my data, i set any values of negative to 0 and everything else to 1. And here is the function that i have used:

library(ROCR) 
library(pROC)
library(caret)
library(ROSE)
library(DMwR)
library(nnet)
AUC_KNN_SMOTE <- function(U,Q,k,M){
  folds <- createFolds(Q, k)
  AUC <- vector()
  W <- vector()
  for( i in 1:k){
    s <- data.frame(folds[i])[,1]
    TRAIN <- data.frame(U[-s,])
    TEST <- data.frame(U[s,])
    TRAIN$Class <- Q[-s]
    TRAIN.smote <- SMOTE(Class~.,data = TRAIN
                         ,perc.over = 100,perc.under = 200)
    trControl <- trainControl(method  = "cv",
                              number  = 5,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)
    fit <- train(make.names(Class) ~ .,
                 method     = "knn",
                 tuneGrid   = expand.grid(k = 1:M),
                 trControl  = trControl,
                 metric     = "ROC",
                 data       = TRAIN.smote)
    W <- c(W,fit[["results"]][,2])
    W <- matrix(W,nrow=M,ncol = i)
    J <- which.is.max(W[,i])
    mod <- class::knn(cl = TRAIN.smote$Class,
                      test = TEST,
                      train = TRAIN.smote[,-9],
                      k = J,
                      prob = TRUE)
    X <-  roc(Q[s],attributes(mod)$prob,quiet = TRUE)
    AUC <- c(AUC, as.numeric(X$auc))
  }
  return(mean(AUC))
}

and the result that i have mentioned above and found with this function is:

b <- 0
for(i in 1:1000)
{
  m <- AUC_KNN_SMOTE(U,Q,k=5,M=100)+b
  b <- m 
}
auc_knn_smote <- m/1000
auc_knn_smote=0.56676

Thank you for any help!

1

There are 1 best solutions below

1
On

I think there is nothing wrong with your approach. Its the result interpretation that requires clarification.

As you have stated, on the initial imbalanced dataset you got an AUC score of 0.7062473. Then you applied the SMOTE data balancing algorithm and you got an AUC score of 0.56676. In both cases, 5-fold cross validation was applied.

Explanation

  • The initial AUC score was higher because it favored the class with higher proportion.
  • To balance the dataset, oversampling technique was applied. Lets briefly understand how oversampling works. It introduces artificial data points. This introduces bias, since the new data points are generated from the old ones, they can't introduce much variance to the dataset. In most cases they are only slightly different than the original ones.
  • The train-test split aspect is not clear from your Q. Assuming, the data was oversampled before the train-test split, then it introduces bias. Its important to note here that you should perform the splitting before balancing the training set. You want your test set to be as unbiased as possible in order to get an objective evaluation of the model's performance. If balancing was performed before splitting the datasets, the model might have seen information on the test set, during training, through the generated data points.

Possible solution

  • Focus on removing the bias introduced by oversampling. One method is to resample the data.
  • Remember, your focus must be on achieving a low bias low variance model. This will help improve the performance evaluation metric.