scikit-learn and mllib difference in predictions python

Question

scikit-learn and mllib difference in predictions python

1.4k Views Asked by Kratos At 04 June 2025 at 16:33

I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code:

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])

and it returns:

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

For spark am doing:

model_svm = SVMWithSGD.train(trainingData, iterations=100)

print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))

print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))

which returns:

supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1

have tried to keep my positive-negative classes balanced my test data contain 3521 records and my training data 8356 records. For the evaluation, cross-validation applied on the scikit-learn model gives 98% accuracy and for spark the area under ROC is 0.5, the are under PR is 0.74 and 0.47 training error.

I have also tried to clear the threshold and set it back to 0.5, but this did not return any better results. Sometimes when I am changing the train-test splitting I might get i.e. all zeros except for one correct prediction or all ones except for one correct zero prediction. Does anyone know how to approach this problem?

As I said I have checked multiple times that my dataset is exactly the same in both cases.

Original Q&A

There are 2 best solutions below

maxymoo On 21 December 2016 at 02:06

Your call to clearThreshold, is causing the classifier to return the raw prediction scores:

clearThreshold() Note Experimental Clears the threshold so that predict will output raw prediction scores. It is used for binary classification only.

New in version 1.4.0.

If you want just the prediction class, remove this function call.

**Mikhail Korobov** · Accepted Answer

You're using different classifiers and so getting different results. Sklearn's SVC is a SVM with RBF kernel; SVMWithSGD is an SVM with a linear kernel trained using SGD. They are totally different.

If you want to match the results then I think the way to go is to use sklearn.linear_model.SGDClassifier(loss='hinge') and try to match other parameters (regularization, whether to fit intercept, etc.) because defaults are not the same.

scikit-learn and mllib difference in predictions python

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in SCIKIT-LEARN

Related Questions in APACHE-SPARK-MLLIB

Related Questions in PREDICTION

Trending Questions

Popular # Hahtags

Popular Questions