Finding X number of most frequent Nouns in part-of-speech (PoS) column in dataframe

294 Views Asked by At

I have a PoS column that has labelled words as nouns, adjectives or verbs. My current code extracts all the noun words and stores them in a new column of the dataframe:

import pandas as pd
from nltk.tag import pos_tag

data = {'comments':['Daniel is really cool', 'Daniel is the most amazing host!', 'Daniel is highly recommended']}

df = pd.DataFrame(data)
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

df['tokenized_text'] = df['comments'].apply(word_tokenize) 
    
df['tagged'] = df['tokenized_text'].apply(pos_tag)
def get_vocab(df):   
    
    nouns = []
    
    for (word, pos) in df:
        if pos.startswith("NN"):
            nouns.append(word)
    
    return nouns

df["nouns"] = df["tagged"].apply(get_vocab)

However, I'd like to store all the noun words in a list instead of the dataframe and only include the top 100 frequent noun words. How would I go about this? My desired list would look like this:

['Daniel', 'is']

As this is a small example I'd only have those two, but in my actual dataset I would have thousands of repeated nouns and so I'd like to only store the top 100 most common nouns.

0

There are 0 best solutions below