Remove punctuations from pandas column but keep original list of lists structure

96 Views Asked by At

I know how to do it for a single list in a cell but I need to keep the structure of multiple list of lists as in [["I","need","to","remove","punctuations","."],[...],[...]] -> [["I","need","to","remove","punctuations"],[...],[...]]

All methods I know turn into this -> ["I","need","to","remove","punctuations",...]

data["clean_text"] = data["clean_text"].apply(lambda x: [', '.join([c for c in s if c not in string.punctuation]) for s in x])
data["clean_text"] = data["clean_text"].str.replace(r'[^\w\s]+', '')
...

What's the best way to do that?

2

There are 2 best solutions below

0
Timeless On BEST ANSWER

Following your approach, I would just add a listcomp with a helper function :

import string

def clean_up(lst):
    return [[w for w in sublist if w not in string.punctuation] for sublist in lst]

data["clean_text"] = [clean_up(x) for x in data["text"]]

​ Output :

print(data) # -- with two different columns so we can see the difference

                                                                                                    text  \
0  [[I, need, to, remove, punctuations, .], [This, is, another, list, with, commas, ,, and, periods, .]]   

                                                                                     clean_text  
0  [[I, need, to, remove, punctuations], [This, is, another, list, with, commas, and, periods]]  
0
Ynjxsjmh On

If your dataframe is not that big, you can try explode the list of list to rows then filter out the rows that contain punctuation and finally group the rows back.

df_ = df[['clean_text']].copy()

out = (df_.assign(g1=range(len(df)))
       .explode('clean_text', ignore_index=True)
       .explode('clean_text')
       .loc[lambda d: ~d['clean_text'].isin([',', '.'])]  # remove possible punctuation
       .groupby(level=0).agg({'clean_text': list, 'g1': 'first'})
       .groupby('g1').agg({'clean_text': list}))
print(df_)

                                                   clean_text
0  [[I, need, to, remove, punctuations, .], [Play, games, .]]


print(out)

                                             clean_text
g1
0   [[I, need, to, remove, punctuations], [Play, games]]