Data leakage when feature scaling with K-fold cross validation in R

735 Views Asked by At

I am performing K-Folds cross validation to evaluate my SVM model performance. However with the nature of the data, I want to use feature scaling to scale my data. Here is a snippet of the data;

# IMPORTING THE DATASET    
dataset <- read.csv("imported dataset.csv")


# ENCODING THE DEPENDENT VARIABLE AS A FACTOR  
dataset$Purchased <- factor(dataset$Purchased, levels = c(0, 1))


# DATASET
    Age EstimatedSalary Purchased
1  19           19000         0
2  35           20000         0
3  26           43000         0
4  27           57000         0
5  19           76000         0
6  27           58000         0

And here is the rest of the code;

# TRAIN TEST SPLIT
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)


# K-FOLD CV WITH FEATURE SCALING
trCtrl <- trainControl(method = "repeatedcv",
                           number = 10,  #10-fold CV
                           repeats = 10,
                           savePredictions = TRUE)
model <- train(Purchased ~ ., 
                   data=train_set, 
                   method="svmRadial",
                   trControl = trCtrl,
                   preProcess = c("center","scale"))
                   

I know that feature scaling and then running K-folds CV on the original training set will cause data leakage since both the inner training and validation sets have been scaled together, hence causing overfitting.

I would like to know does the preProcess function in the caret package scale the data in a way that avoids this and scales the inner training sets and validation sets separately?

1

There are 1 best solutions below

1
On

As you know, the cross validation technique will train data using parts of it to validate the model, so if your data was already scaled, the data to validate the model in cross validation have been biased, causing the data leakage.

You probably can use pipelines to solve your problem, passing the scaler as a pre processing step of you pipe. You can see more in this article.