Removing specific words from pandas dataframe

77 Views Asked by Popeye At 22 June 2023 at 20:49

Sample table:

a
ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale
cpu,ryzen 9 7800x,available,computer for ryzen,new

df = pd.DataFrame({'a' : ['ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale',
                         'cpu,ryzen 9 7800x,available,computer for ryzen,new']})

from nltk.corpus import stopwords, wordnet
stop = stopwords.words('english')
b = ['best', 'sale', 'new','available']
c = stop + b
x = []
for i in df['a'].str.split(','):
    for j in i:
        if j not in b:
            x.append(j)
print(x)

I am trying to remove the stopwords and other specific words as mentioned above, even though the other words are getting removed, but stopwords are not.

This is the output I am getting:

['ryzen cpu', 'ryzen 5 5600x', 'amd ryzen', 'cpu', 'ryzen 9 7800x', 'computer for ryzen']

Also I am not able to get it in table format, I have tried to use the following list comprehension but it is not working:

df['a'] = df['a'].apply(lambda x: ''.join([j for i in x.split(' , ') for j in i if j not in c]))
df['a']

the output it is giving seems completely off (some of the letters are completely gone, such as "ryzen" has become "rzen" and "sale" has become "le" etc):

a
rzen cpu,rzen 5 5600x,be, rzen,le
cpu,rzen 9 7800x,vlble,cpuer fr rzen,new

If anyone can please help me understand what exactly I am doing wrong, and how to proceed further with this ?

the expected output looks something like this:

a
ryzen cpu,ryzen 5 5600x,amd ryzen
cpu,ryzen 9 7800x,computer ryzen

Original Q&A

There are 3 best solutions below

Maria K On 22 June 2023 at 21:11

You're not using the stop words really, only your additional words, look at this line

if j not in b:

Change b to c here.

It really, really helps debugging if you name your variables so that the names mean what they contain.

As for the loops here, the code looks very hard to read, let's understand it step by step

df['a'] = df['a'].apply(lambda x: ''.join([j for i in x.split(' , ') for j in i if j not in c]))

i - is word in x (in part for i in x.split(' , '))
j - is a symbol in i (for j in i)
then some filtering of j is performed (if j not in c)
then j are joined together

That's how you get random symbols in the end. I think a right thing to do is to get rid of lambda in apply, because there's too much logic in it - it's easy to get confused. It may be better to write a separate function and apply it.

(by the way don't split by " , ", split by ","!)

def filter_stop_words(values, stop_words=c):
    words = values.split(",")
    return "".join(filter(lambda x: x not in stop_words, words))

df['a'].apply(filter_stop_words)

user19077881 On 22 June 2023 at 21:30

You can use:

df['a'] = df['a'].str.replace('|'.join(c), "", regex = True).replace(',,', ',', regex = True)

The last replace is to remove ,, where the double comma marks a removed item

Timus On 26 June 2023 at 15:59

You could do the following:

from nltk.corpus import stopwords

STOPS = set(stopwords.words('english') + ['best', 'sale', 'new','available'])
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

df['a'] = (df['a'].str.split(',').explode().str.split()
           .map(remove_stops).dropna()
           .str.join(' ').groupby(level=0).agg(','.join))

Result for your sample dataframe:

                                   a
0  ryzen cpu,ryzen 5 5600x,amd ryzen
1   cpu,ryzen 9 7800x,computer ryzen

Removing specific words from pandas dataframe

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in NLTK

Related Questions in STOP-WORDS

Trending Questions

Popular # Hahtags

Popular Questions