I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code:
svc_model = SVC()
svc_model.fit(X_train, y_train)
print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])
print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])
and it returns:
supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]
For spark am doing:
model_svm = SVMWithSGD.train(trainingData, iterations=100)
print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))
which returns:
supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1
have tried to keep my positive-negative classes balanced my test data contain 3521 records and my training data 8356 records. For the evaluation, cross-validation applied on the scikit-learn model gives 98% accuracy and for spark the area under ROC is 0.5, the are under PR is 0.74 and 0.47 training error.
I have also tried to clear the threshold and set it back to 0.5, but this did not return any better results. Sometimes when I am changing the train-test splitting I might get i.e. all zeros except for one correct prediction or all ones except for one correct zero prediction. Does anyone know how to approach this problem?
As I said I have checked multiple times that my dataset is exactly the same in both cases.
You're using different classifiers and so getting different results. Sklearn's SVC is a SVM with RBF kernel; SVMWithSGD is an SVM with a linear kernel trained using SGD. They are totally different.
If you want to match the results then I think the way to go is to use
sklearn.linear_model.SGDClassifier(loss='hinge')
and try to match other parameters (regularization, whether to fit intercept, etc.) because defaults are not the same.