how do you get the frequency of the terms generated by tfidf.get_feature_names_out()

379 Views Asked by At

After fitting with tfidf, I'm looking at the features that were generated:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

but I want to get the frequency of each term as well

1

There are 1 best solutions below

0
rickhg12hs On

One way to "count the number of sentences a particular word appears in" is to use sklearn.feature_extraction.text.CountVectorizer.

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

from sklearn.feature_extraction.text import CountVectorizer

# since we're counting sentences and not words, use binary=True
cv = CountVectorizer(binary=True)

X = cv.fit_transform(corpus)

print(cv.vocabulary_)  # all the words in the corpus with their column index
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

# show occurrences (not count) of vocabulary words in sentences (each line/row) in corpus
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
#  [0 1 0 1 0 1 1 0 1]
#  [1 0 0 1 1 0 1 1 1]
#  [0 1 1 1 0 0 1 0 1]]

# So, for example the word "this" is at column index 8 in the matrix above

# How many sentences in the corpus have the word "this"?
print(sum(X[:,cv.vocabulary_["this"]])[0,0])
# 4

# How many sentences in the corpus have the word "document"?
print(sum(X[:,cv.vocabulary_["document"]])[0,0])
# 3