I found this answer about the model and specific outputs (How to get top n terms with highest tf-idf score - Big sparse matrix). It was great. I would like to know how to transform the prints in dataframe:
'''
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
top_n = 3
print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores :
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]
print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
key = lambda x: x[1], reverse=True)[:top_n])
# idf values:
# [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
'''
Thanks in advance!
The following gives you a
DataFrame
with the tf_idf, idf and frequencies, sorted by the tf_idf statistic (descending).If you only want the top n words by the tf_idf statistic you can do: