I have a project in my ML course about anomaly/novelty detection and decided to study the One-class SVM algorithm as described in this paper: http://research.microsoft.com/pubs/69731/tr-99-87.pdf. In the package e1071
in R there is an svm
function that seems to support one-class classification. However, when I try to use it the predictor always returns false (even on the training set, which is the weirdest thing). Here is my code :
library(e1071) # for svm classifier
library(IMIFA) # for USPS dataset
library(caret) # for confusion matrices
data(USPSdigits)
digits.train <- USPSdigits$train
digits.train <- digits.train[order(digits.train$V1), ]
digits.train$is.zero[digits.train$V1 == 0] <- "TRUE"
digits.train$is.zero[digits.train$V1 != 0] <- "FALSE"
digits.test <- USPSdigits$test
digits.test <- digits.test[order(digits.test$V1), ]
digits.test$is.zero[digits.test$V1 == 0] <- "TRUE"
digits.test$is.zero[digits.test$V1 != 0] <- "FALSE"
digits.train.features <- digits.train[digits.train$V1 == 0, -c(1, 258)]
digits.train.labels <- digits.train[digits.train$V1 == 0, 258]
digits.train.nu <- 0.5
digits.train.bandwith <- 0.5*256
digits.train.model <- svm(x = digits.train.features, type = 'one-classification', kernel = 'radial', nu = digits.train.nu, gamma = digits.train.bandwith)
digits.train.fitted <- predict(digits.train.model, digits.train.features)
digits.train.confusionMatrix <- table(Predicted = digits.train.fitted, Reference = digits.train.labels)
print(digits.train.confusionMatrix)
digits.test.features <- subset(digits.test, select = -c(is.zero, V1))
digits.test.labels <- digits.test$is.zero
digits.test.fitted <- predict(digits.train.model, digits.test.features)
digits.test.confusionMatrix <- table(Predicted = digits.test.fitted, Reference = digits.test.labels)
print(digits.test.confusionMatrix)
and my output is :
> print(digits.train.confusionMatrix)
Reference
Predicted TRUE
FALSE 1194
> print(digits.test.confusionMatrix)
Reference
Predicted FALSE TRUE
FALSE 1648 359
What am I doing wrong?
I created a working example. Adjust naming conventions to your own. I have used a very expressive naming convention to show exactly what I did.
Training on only 1 number results in a lot of columns with the same values. These return as errors in
svm
and should be removed. The functionnearZeroVar
from caret is an ideal function for this. If you ever use the recipes package, it is calledstep_nzv
.