I want to use a sklearn model in pyspark.
I have trained a lightGBM model (in sklearn) which gives out probabilities for a propensity problem. Then converted this model to PMML format like this:
from sklearn2pmml import sklearn2pmml
sklearn2pmml(trained_model, 'prod_trained_model.pmml')
And then in pyspark, I read the PMML model like this:
from pypmml_spark import ScoreModel
model_pipeline = ScoreModel.fromFile('prod_trained_model.pmml')
Then I make predictions like this ('features_df' is a pyspark dataframe):
predictions_df = model_pipeline.transform(features_df)
Now the problem is that the model predictions do not match with those of original model. There is a shift of 5% to 10% in the predicted probabilities.
Also for around 5% of the rows in input dataframe, the output probabilities by PMML model are NaN. Whereas for the exact same rows, the original model is predicting fine.
This is a common problem with missing /
null
values.PMML models do not process nulls the same way as your model.
This is the reason why you should be using
.fillna()
to impute missing values.