I would like to create a list of semantic entities (nouns, verbs, punct, etc.) using pos tagging. I am currently running the following code
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def fun(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['S']= df.Text.apply(fun)
to create the structure of sentences. So, for example, if I have the column Text (see below), this code generate the column S which contains all the information about semantic structure:
Text S
0 “I will meet quite a few people, it’s well... PUNCT NOUN VERB VERB DET DET ADJ NOUN PUNCT PR...
1 Says “Cristiano Ronaldo’s family still owns”... VERB PUNCT PROPN PROPN PART NOUN ADV VERB PUNC...
2 Joe Biden plagiarized Donald Trump in his... PROPN PROPN VERB PROPN PROPN ADP DET PROP...
I am wondering if I can create a vocabulary of nouns, verbs, det, adj, ... by editing the code above or if I need to consider a different approach. To take all the entities (nouns, verbs,...) in the dataframe, I would look at selecting only unique values, in order to creat a list for each of them.
Example of output (it can be also in lists rather than in a dataframe)
PUNCT NOUN VERB ....
“ I will
, people meet
” family says
owns
plagiarized
You can try:
Note, you don't need pandas dependence at all:
These will collect all the tokens under their
POS
.If you only need list of unique tokens: