I know svm
model needs preprocessing that converts categorical variables into dummy variables. However, when I am using e1071
's svm
function to fit a model with unconverted data (see train
and test
), no error pops up. I am assuming the function automatically converts them.
However, when I am using the converted data (see train2
and test2
) to fit a svm model, this function gives me a different result (as indicated, p1
and p2
are not the same).
Could anyone let me know what happened to the unconverted data? Does the function just ignore the categorical variables, or something else happened?
library(e1071)
library(dummies)
set.seed(0)
x = data.frame(matrix(rnorm(200, 10, 10), ncol = 5)) #fake numerical predictors
cate = factor(sample(LETTERS[1:5], 40, replace=TRUE)) #fake categorical variables
y = rnorm(40, 50, 10) #fake response
data = cbind(y,cate,x)
ind = sample(40, 30, replace=FALSE)
train = data[ind, ]
test = data[-ind, ]
#without dummy
data = cbind(y,cate,x)
svm.model = svm(y~., train)
p1 = predict(svm.model, test)
#with dummy
train2 = cbind(train[,-2], dummy(train[,2]))
colnames(train2) = c('y', paste0('X',1:5), LETTERS[1:4])
test2 = cbind(test[,-2], dummy(test[,2]))
colnames(test2) = c('y', paste0('X',1:5), LETTERS[1:4])
svm.model2 = svm(y~., train2)
p2 = predict(svm.model2, test2)
What you're observing is indeed as you stated, that dummies are converted automatically. In fact we can reproduce both
svm.model1
andsvm.model2
quite easily.Note that i did not use
svm(formula, data)
butsvm(x, y)
. Now which model did we actually recreate? Lets compare withp1
andp2
It seems we've recreated model 2, with our manual dummies. Now the reason why this reproduces
svm.model2
and notsvm.model1
is that due to thescale
parameter. Fromhelp(svm)
(note the part in bold)From this we can see that likely the difference (and issue really) comes from
svm
not correctly identifying binary columns as dummies, but apparently being smart enough to do this when performing automatic conversion. We can test this theory by setting thescale
parameter manuallySo what we see is, that
1)
svm
as stated converts factors into dummy variables automatically.2) It does however, in the case dummies are provided, not check for these, causing possibly unexpected behaviour if one manually creates these.