NLTK Regex Chunker Not Processing multiple Grammar Rules in one command

1.4k Views Asked by user3778289 At 10 January 2018 at 11:30

I am trying to extract phrases from my corpus for this i have defined two rules one is noun followed by multiple nouns and other is adjective followed by noun, here i want that if same phrase is extracted from both rules the program should ignore second one, the problem I am facing is that the phrases are extracted form the first rule only and the second rule is not being applied. below is the code:

PATTERN = r"""
      NP: {<NN><NN>+}
      {<ADJ><NN>*}

       """
    MIN_FREQ = 1
    MIN_CVAL = -13 # lowest cval -13
    def __init__(self):
        corpus_root = os.path.abspath('../multiwords/test')
        self.corpus = nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*')
        self.word_count_by_document = None
        self.phrase_frequencies = None

def calculate_phrase_frequencies(self):
        """
       extract the sentence chunks according to PATTERN and calculate
       the frequency of chunks with pos tags
       """

        # pdb.set_trace()
        chunk_freq_dict = defaultdict(int)
        chunker = nltk.RegexpParser(self.PATTERN)

        for sent in self.corpus.tagged_sents():

            sent = [s for s in sent if s[1] is not None]

            for chk in chunker.parse(sent).subtrees():

                if str(chk).startswith('(NP'):                  

                    phrase = chk.__unicode__()[4:-1]

                    if '\n' in phrase:
                        phrase = ' '.join(phrase.split())

                    just_phrase = ' '.join([w.rsplit('/', 1)[0] for w in phrase.split(' ')])
                   # print(just_phrase)
                    chunk_freq_dict[just_phrase] += 1
        self.phrase_frequencies = chunk_freq_dict
        #print(self.phrase_frequencies)

Original Q&A

There are 1 best solutions below

Snow bunting On 11 January 2018 at 23:00 BEST ANSWER

First of all, Python and especially multi-line strings are indent dependant. Make sure you have no preceding spaces inside the string (as they will be treated as characters) and make sure the patterns (brackets) align visually.

Moreover I think you might want to have <ADJ><NN>+ as your second pattern. + means 1 or more, whereas * means 0 or more.

I hope this solves the issue.

#!/usr/bin/env python
import nltk

PATTERN = r"""
NP: {<NN><NN>+}
    {<ADJ><NN>+}
"""

sentence = [('the', 'DT'), ('little', 'ADJ'), ('yellow', 'ADJ'),
            ('shepherd', 'NN'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'),
            ('the', 'DT'), ('silly', 'ADJ'), ('cat', 'NN')]

cp = nltk.RegexpParser(PATTERN)
print(cp.parse(sentence))

Result:

(S
  the/DT
  little/ADJ
  yellow/ADJ
  (NP shepherd/NN dog/NN)
  barked/VBD
  at/IN
  the/DT
  (NP silly/ADJ cat/NN))

Reference: http://www.nltk.org/book/ch07.html

NLTK Regex Chunker Not Processing multiple Grammar Rules in one command

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PYTHON-3.X

Related Questions in NLTK

Related Questions in TEXT-CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions