Error generating PMML pipeline for SKLearn Text classification Pipeline

Question

Error generating PMML pipeline for SKLearn Text classification Pipeline

685 Views Asked by Vinh Nguyen At 25 September 2020 at 19:58

I am trying to generate a PMML file with sklearn2pmml library in python for an SKLearn pipeline. This pipeline only consists of a CountVectorizer and SVC model. Very simple pipeline, but can't get it to output as a PMML file.

SKLearn pipeline:

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('model',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='auto', kernel='linear', max_iter=-1,
                     probability=True, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

Script:

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pmml_pipe = make_pmml_pipepline(sklearn_pipeline, 'text', 'label')
sklearn2pmml(pmml_pipe, 'outputs/pipeline.pmml')

Error:

Standard output is empty
Standard error:
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 217 ms.
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Converting..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

I'm not sure what I am doing wrong. Looking for a solution.

Original Q&A

There are 1 best solutions below

**user1808924** · Answer 1 · 2020-09-26T07:08:54.680000

You need to use a PMML-compatible text tokenizer.

Right now, you are tokenizing sentences using a free-form regex (CountVectorizer(tokenizer = None, token_pattern = ...)). You'd need to switch to sklearn2pmml.feature_extraction.text.Splitter tokenizer implementation (CountVectorizer(tokenizer = Splitter(), token_pattern = None)).

Working example in the SkLearn2PMML/JPMML-SkLearn integration testing suite: https://github.com/jpmml/jpmml-sklearn/blob/1.6.4/src/test/resources/main.py#L537

Error generating PMML pipeline for SKLearn Text classification Pipeline

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in PIPELINE

Related Questions in PMML

Trending Questions

Popular # Hahtags

Popular Questions