Removing specific words from pandas dataframe

77 Views Asked by At

Sample table:

a
ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale
cpu,ryzen 9 7800x,available,computer for ryzen,new
df = pd.DataFrame({'a' : ['ryzen cpu,ryzen 5 5600x,best,amd ryzen,sale',
                         'cpu,ryzen 9 7800x,available,computer for ryzen,new']})
from nltk.corpus import stopwords, wordnet
stop = stopwords.words('english')
b = ['best', 'sale', 'new','available']
c = stop + b
x = []
for i in df['a'].str.split(','):
    for j in i:
        if j not in b:
            x.append(j)
print(x)

I am trying to remove the stopwords and other specific words as mentioned above, even though the other words are getting removed, but stopwords are not.

This is the output I am getting:

['ryzen cpu', 'ryzen 5 5600x', 'amd ryzen', 'cpu', 'ryzen 9 7800x', 'computer for ryzen']

Also I am not able to get it in table format, I have tried to use the following list comprehension but it is not working:

df['a'] = df['a'].apply(lambda x: ''.join([j for i in x.split(' , ') for j in i if j not in c]))
df['a']

the output it is giving seems completely off (some of the letters are completely gone, such as "ryzen" has become "rzen" and "sale" has become "le" etc):

a
rzen cpu,rzen 5 5600x,be, rzen,le
cpu,rzen 9 7800x,vlble,cpuer fr rzen,new

If anyone can please help me understand what exactly I am doing wrong, and how to proceed further with this ?

the expected output looks something like this:

a
ryzen cpu,ryzen 5 5600x,amd ryzen
cpu,ryzen 9 7800x,computer ryzen
3

There are 3 best solutions below

1
Maria K On

You're not using the stop words really, only your additional words, look at this line

if j not in b:

Change b to c here.

It really, really helps debugging if you name your variables so that the names mean what they contain.

As for the loops here, the code looks very hard to read, let's understand it step by step

df['a'] = df['a'].apply(lambda x: ''.join([j for i in x.split(' , ') for j in i if j not in c]))
  • i - is word in x (in part for i in x.split(' , '))
  • j - is a symbol in i (for j in i)
  • then some filtering of j is performed (if j not in c)
  • then j are joined together

That's how you get random symbols in the end. I think a right thing to do is to get rid of lambda in apply, because there's too much logic in it - it's easy to get confused. It may be better to write a separate function and apply it.

(by the way don't split by " , ", split by ","!)

def filter_stop_words(values, stop_words=c):
    words = values.split(",")
    return "".join(filter(lambda x: x not in stop_words, words))

df['a'].apply(filter_stop_words)
2
user19077881 On

You can use:

df['a'] = df['a'].str.replace('|'.join(c), "", regex = True).replace(',,', ',', regex = True)

The last replace is to remove ,, where the double comma marks a removed item

0
Timus On

You could do the following:

from nltk.corpus import stopwords

STOPS = set(stopwords.words('english') + ['best', 'sale', 'new','available'])
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

df['a'] = (df['a'].str.split(',').explode().str.split()
           .map(remove_stops).dropna()
           .str.join(' ').groupby(level=0).agg(','.join))

Result for your sample dataframe:

                                   a
0  ryzen cpu,ryzen 5 5600x,amd ryzen
1   cpu,ryzen 9 7800x,computer ryzen