Text field concatenation in sklearn pipeline

Question

Text field concatenation in sklearn pipeline

804 Views Asked by Tom Lous At 17 August 2025 at 23:47

I have a multi line json dataset that contains multiple fields that can or cannot exists and can contain textual data in either string, list of strings or more complicated mapping (list of dicts)

eg.:

{"yvalue":1.0,"field1":"Some text", "field2":"More Text", "field3": ["text","items","in","list"], "field4":[{"id":3,"name":"text"},{"id":4,"name":"text"}]}
{"yvalue":2.0,"field2":"More Text2", "field3": ["text2","items2","in2","list2"], "field4":[{"id":4,"name":"text"},{"id":4,"name":"text"}], "field5":"extra text"}
...

This dataset is needed as input for a sklearn pipeline

First of all I'm reading the file via pandas

df = pandas.read_json(args.input_file, lines=True)

But I'd like to use a pipeline transformer like DataframeMapper to concat all text fields (even the nested ones) to one huge text field. Taking into account that certain fields may be missing, are part of nested structures etc.

The output would look something like:

yvalue | text

1.0 | Some text More Text text items in list text text

2.0 | More Text2 text2 items2 in2 list2 text text extra text

Of course I can use a custom transformer, but since I'm also interested converting the pipeline to mleap or pmml format, I'd rather refrain from using custom transformers as much as possible.

Is there a best practise or even easy way to do this without getting too hacky?

Update

Apparently what I want is maybe a bit too much, but maybe something easier: Is there a way to concat just 2 (or more) string-like fields using a transformer like in pandas:

df[['field1', 'field2']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

Original Q&A

There are 2 best solutions below

**Brian Bien** · Answer 1

It's reasonable to use pipelines and transformers for easier model interpretability (e.g. shap values), rather than straight Pandas-only preprocessing.

Assuming a dataframe X of text columns:


    class StringConcatTransformer(TransformerMixin, BaseEstimator):
        """Concatenate multiple string fields into a single field.
        """
        
        def __init__(self, missing_indicator=''):
            self.missing_indicator = missing_indicator

        def fit(self, X, y=None, **fit_params):
            return self

        def transform(self, X, y=None):
            return X.fillna(self.missing_indicator).agg(' '.join, axis=1)

**user1808924** · Answer 2

Consider refactoring your data pre-processing. The Scikit-Learn pipeline is not a place to do low-level data sanitization/preparation work such as unpacking collections, and (conditionally-) concatenating text fields into a text document.

This is a regular programming task, not a machine learning task. Therefore, you should use regular programming tools, not machine learning tools (eg. Scikit-Learn transformers), to accomplish it. Neither PMML nor MLeap is suited for low-level text processing.

Text field concatenation in sklearn pipeline

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in SKLEARN-PANDAS

Related Questions in PMML

Related Questions in MLEAP

Trending Questions

Popular # Hahtags

Popular Questions