How to extract keywords using TFIDF for each row in python?

2.1k Views Asked by At

I have a column which has text only. I need to extract top keywords from each row using TFIDF.

Example Input:

df['Text']
'I live in India',
'My favourite colour is Red', 
'I Love Programming'

Expected output:

 df[Text]                            df[Keywords]
'I live in India'                  'live','India'
'My favourite colour is Red'       'favourite','colour','red'
'I Love Programming'               'love','programming'

How do i get this? I tried writing the below code

tfidf = TfidfVectorizer(max_features=300, ngram_range = (2,2))
Y = df['Text'].apply(lambda x: tfidf.fit_transform(x))

I am getting the below error Iterable over raw text documents expected, string object received.

3

There are 3 best solutions below

1
On BEST ANSWER

Try below code if you want to tokenize your sentences:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = pd.DataFrame({'Text':['I live in India', 'My favourite colour is Red', 'I Love Programming']})
df['Keywords'] = df.Text.apply(lambda x: nltk.word_tokenize(x))
stops =  list(stopwords.words('english'))
df['Keywords'] = df['Keywords'].apply(lambda x: [item for item in x if item.lower() not in stops])
df['Keywords'] = df['Keywords'].apply(', '.join)

print(df)

                         Text                Keywords
0             I live in India             live, India
1  My favourite colour is Red  favourite, colour, Red
2          I Love Programming       Love, Programming
1
On

TfidfVectorizer fit_transform function expects an iterable type (e.g set, list, etc.) of sentences \ documents to fit the TfIdf score on.

So what you should do is actually -

Y = tfidf.fit_transform(df['Text'])
0
On

As some people have pointed out already, there are several issues with your code and approach, first of them is the fact that you should not use TfIdf for this task (TfIdf is not meant to be used on small corpora). You'll be better of using RAKE or flashtext KeywordExtractor .

Another issue with your code is that you are trying to get 'unigrams' from your text, yet you have set up the ngram_range in your vectorizer to (2,2), meaning it will only find 'bigrams' (phrases consisting of two words).

If you insist on doing this with your chosen approach, firstly you need to split sentences in your df['text'] to one per row (you can use part of @ManojK solution for that), then pass the text from each row as a list:

Y = df['Text'].apply(lambda x: tfidf.fit_transform([x]))

However, if you want to extract feature names (what are essentially your keywords), you'll need to write a function to get_feature_names() after each iteration of your vectorizer (lambda x:) function.