How do I return a column of all matching terms or substrings found within a string? I suspect there's a way to do it with pl.any_horizontal()
as suggested in these comments but I can't quite piece it together.
import re
terms = ['a', 'This', 'e']
(pl.DataFrame({'col': 'This is a sentence'})
.with_columns(matched_terms = pl.col('col').map_elements(lambda x: list(set(re.findall('|'.join(terms), x)))))
)
The column should return: ['a', 'This', 'e']
EDIT:
The winning solution here: .str.extract_all('|'.join(terms)).list.unique()
is different from this closely related question's winning solution: pl.col('col').str.split(' ').list.set_intersection(terms)
because .set_intersection()
doesn't get sub-strings of list elements (such as partial, not full, words).
I've included the accompanying term-matching columns, but the each_term column with
pl.col('a').str.extract_all('|'.join(terms))
was the best solution for me.