How to get consolidated words post tagging?

40 Views Asked by At

I am working on a dataset that requires extracting all the words that are adjectives, verbs, and adverbs from each sentence of a data frame column.

This is a sample I was working on to figure out how I could get the desired output.

list1=['good','excellent','was','not']
for i in list1:
  x=nltk.pos_tag([i])
  #print(x)
  if (x[0][1] == "JJ" or x[0][1] == "JJS" or x[0][1] == "RB" or x[0][1] == "VB" or x[0][1] == "RBR" or x[0][1] == "RBS" or x[0][1] == "VBN" or x[0][1] == "VBP"):
    print(x)

The output it is giving me is:

[('good','JJ')]
[('not','RB')] 

The output I need to get is something like this:

good not

Can anyone please help?

1

There are 1 best solutions below

0
On

You have to be a little more specific about what you want to really extract:


But here's an attempt.

It seems you're trying to extract verb phrases with adjective/adverbs, if so you can try:

from nltk import pos_tag, word_tokenize
from nltk import ngrams


text = "this is not good."
tagged_text = pos_tag(word_tokenize(text))


focus_tags = set(['JJ', 'JJS', 'RB', 'RBR', 'RBS', 'VB', 'VBN', 'VBP'])



for (token1, tag1), (token2, tag2) in ngrams(tagged_text, 2):
    if tag1 in focus_tags and tag2 in focus_tags:
        print(token1 + ' ' + token2)

But that outputs: is not and is not good!!

Hmmm, in that case, do you want to exact not good or is not good?

If it's the is not good trigram, then try:

for (token1, tag1), (token2, tag2), (token3, tag3) in ngrams(tagged_text, 3):
    if tag1 in focus_tags and tag2 in focus_tags and tag3 in focus_tags:
        print(token1 + ' ' + token2 + ' ' + token3)

What if I just want not good?

Maybe try removing the verbs? E.g.

from nltk import pos_tag, word_tokenize
from nltk import ngrams


text = "this is not good."
tagged_text = pos_tag(word_tokenize(text))


focus_tags = set(['JJ', 'JJS', 'RB', 'RBR', 'RBS'])



for (token1, tag1), (token2, tag2) in ngrams(tagged_text, 2):
    if tag1 in focus_tags and tag2 in focus_tags:
        print(token1 + ' ' + token2)