I have a multi line json dataset that contains multiple fields that can or cannot exists and can contain textual data in either string, list of strings or more complicated mapping (list of dicts)
eg.:
{"yvalue":1.0,"field1":"Some text", "field2":"More Text", "field3": ["text","items","in","list"], "field4":[{"id":3,"name":"text"},{"id":4,"name":"text"}]}
{"yvalue":2.0,"field2":"More Text2", "field3": ["text2","items2","in2","list2"], "field4":[{"id":4,"name":"text"},{"id":4,"name":"text"}], "field5":"extra text"}
...
This dataset is needed as input for a sklearn pipeline
First of all I'm reading the file via pandas
df = pandas.read_json(args.input_file, lines=True)
But I'd like to use a pipeline transformer like DataframeMapper
to concat all text fields (even the nested ones) to one huge text field. Taking into account that certain fields may be missing, are part of nested structures etc.
The output would look something like:
yvalue | text
1.0 | Some text More Text text items in list text text
2.0 | More Text2 text2 items2 in2 list2 text text extra text
Of course I can use a custom transformer, but since I'm also interested converting the pipeline to mleap or pmml format, I'd rather refrain from using custom transformers as much as possible.
Is there a best practise or even easy way to do this without getting too hacky?
Update
Apparently what I want is maybe a bit too much, but maybe something easier: Is there a way to concat just 2 (or more) string-like fields using a transformer like in pandas:
df[['field1', 'field2']].apply(lambda x: ' '.join(x.astype(str)), axis=1)
It's reasonable to use pipelines and transformers for easier model interpretability (e.g. shap values), rather than straight Pandas-only preprocessing.
Assuming a dataframe
X
of text columns: