Error generating PMML pipeline for SKLearn Text classification Pipeline

666 Views Asked by At

I am trying to generate a PMML file with sklearn2pmml library in python for an SKLearn pipeline. This pipeline only consists of a CountVectorizer and SVC model. Very simple pipeline, but can't get it to output as a PMML file.

SKLearn pipeline:

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('model',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='auto', kernel='linear', max_iter=-1,
                     probability=True, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

Script:

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pmml_pipe = make_pmml_pipepline(sklearn_pipeline, 'text', 'label')
sklearn2pmml(pmml_pipe, 'outputs/pipeline.pmml')

Error:

Standard output is empty
Standard error:
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 217 ms.
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Converting..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

I'm not sure what I am doing wrong. Looking for a solution.

1

There are 1 best solutions below

0
On

You need to use a PMML-compatible text tokenizer.

Right now, you are tokenizing sentences using a free-form regex (CountVectorizer(tokenizer = None, token_pattern = ...)). You'd need to switch to sklearn2pmml.feature_extraction.text.Splitter tokenizer implementation (CountVectorizer(tokenizer = Splitter(), token_pattern = None)).

Working example in the SkLearn2PMML/JPMML-SkLearn integration testing suite: https://github.com/jpmml/jpmml-sklearn/blob/1.6.4/src/test/resources/main.py#L537