I have a PoS column that has labelled words as nouns, adjectives or verbs. My current code extracts all the noun words and stores them in a new column of the dataframe:
import pandas as pd
from nltk.tag import pos_tag
data = {'comments':['Daniel is really cool', 'Daniel is the most amazing host!', 'Daniel is highly recommended']}
df = pd.DataFrame(data)
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
df['tokenized_text'] = df['comments'].apply(word_tokenize)
df['tagged'] = df['tokenized_text'].apply(pos_tag)
def get_vocab(df):
nouns = []
for (word, pos) in df:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["tagged"].apply(get_vocab)
However, I'd like to store all the noun words in a list instead of the dataframe and only include the top 100 frequent noun words. How would I go about this? My desired list would look like this:
['Daniel', 'is']
As this is a small example I'd only have those two, but in my actual dataset I would have thousands of repeated nouns and so I'd like to only store the top 100 most common nouns.