Problem ROC curves SVM simulated data

129 Views Asked by At

I'm working on simulated data and I have some problems. I'm trying to fix parameters.

library(e1071)  
library(ROCR)  
set.seed(10)  

#function to generate data  
generate.data <- function(n){  
 x2 <- runif(n)  
 x1 <- runif(n)  
 y <- as.factor(ifelse((x2>2*x1)|(x2>(2-2*x1)),-1,1))  
 return(data.frame(x1,x2,y))  
}  

#Training and test: n = 500  
dtrain <- generate.data(500)  
dtest <- generate.data(200)  

I performed a cross validation on the training set and I had with the radial kernel, a parameter cost=1000 and gamma=0.1.

tune.out = tune(svm, y~x1+x2, data=dtrain, kernel="radial",
                ranges=list(cost=c(0.1,1,10,100,1000), gamma=c(0.01,0.1,1,10,100)))  
svmbestmod = svm(y~x1+x2, data=dtrain, kernel="radial", cost=1000, gamma=0.1,
                 probability=TRUE)  

I wanted to predict on my test set but I have 0 error. I don't understand.

yrad.test <- predict(svmbestmod, dtest)  

#confusion matrix  
mc.rad <- table(dtest$y, yrad.test)  
print(mc.rad)  

#Error 
err.rad <- 1-sum(diag(mc.rad))/sum(mc.rad)  
print(err.rad)

If someone could help me understand my errors or what's wrong, it would be nice.

1

There are 1 best solutions below

1
On BEST ANSWER

I've put 20000 points in the test set

# First I isolate any misclassified points in the test set
library(dplyr)
errors <- cbind(dtest,yrad.test) %>% dplyr::filter(y != yrad.test)

# Then I plot all the points in the train set, 
# coloured based on thier respective class,
# while misclassified entries in the test set are shown in black

library(ggplot2)
p <- ggplot2::ggplot(data = dtrain, aes(x1,x2)) +
 geom_point(aes(colour = factor(y)) )+ 
 geom_point(data = errors,colour = "black")`

In black misclassified points

It seems to me that your data is completely separable, basically your data is too good to be true and your model is able to make perfect predictions, maybe you can add some noise to the formula that generates it.

Also if your test data contains only 200 entries is quite possible that none of them is close enough to the decision boundaries to be misclassified, as I mentioned I had to generate a test set of 20000 points to get the about 200 misclassified points you see in the picture.