Text field concatenation in sklearn pipeline

812 Views Asked by At

I have a multi line json dataset that contains multiple fields that can or cannot exists and can contain textual data in either string, list of strings or more complicated mapping (list of dicts)

eg.:

{"yvalue":1.0,"field1":"Some text", "field2":"More Text", "field3": ["text","items","in","list"], "field4":[{"id":3,"name":"text"},{"id":4,"name":"text"}]}
{"yvalue":2.0,"field2":"More Text2", "field3": ["text2","items2","in2","list2"], "field4":[{"id":4,"name":"text"},{"id":4,"name":"text"}], "field5":"extra text"}
...

This dataset is needed as input for a sklearn pipeline

First of all I'm reading the file via pandas

df = pandas.read_json(args.input_file, lines=True)

But I'd like to use a pipeline transformer like DataframeMapper to concat all text fields (even the nested ones) to one huge text field. Taking into account that certain fields may be missing, are part of nested structures etc.

The output would look something like:

yvalue | text

1.0 | Some text More Text text items in list text text

2.0 | More Text2 text2 items2 in2 list2 text text extra text

Of course I can use a custom transformer, but since I'm also interested converting the pipeline to mleap or pmml format, I'd rather refrain from using custom transformers as much as possible.

Is there a best practise or even easy way to do this without getting too hacky?


Update

Apparently what I want is maybe a bit too much, but maybe something easier: Is there a way to concat just 2 (or more) string-like fields using a transformer like in pandas:

df[['field1', 'field2']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

2

There are 2 best solutions below

0
On

It's reasonable to use pipelines and transformers for easier model interpretability (e.g. shap values), rather than straight Pandas-only preprocessing.

Assuming a dataframe X of text columns:


    class StringConcatTransformer(TransformerMixin, BaseEstimator):
        """Concatenate multiple string fields into a single field.
        """
        
        def __init__(self, missing_indicator=''):
            self.missing_indicator = missing_indicator

        def fit(self, X, y=None, **fit_params):
            return self

        def transform(self, X, y=None):
            return X.fillna(self.missing_indicator).agg(' '.join, axis=1)
1
On

Consider refactoring your data pre-processing. The Scikit-Learn pipeline is not a place to do low-level data sanitization/preparation work such as unpacking collections, and (conditionally-) concatenating text fields into a text document.

This is a regular programming task, not a machine learning task. Therefore, you should use regular programming tools, not machine learning tools (eg. Scikit-Learn transformers), to accomplish it. Neither PMML nor MLeap is suited for low-level text processing.