Not condition in NLTK Regex Parser

Question

Not condition in NLTK Regex Parser

1.5k Views Asked by Ram G Athreya At 11 March 2017 at 04:14

I need to create a not condition as part of my grammar in NLTK's regex parser. I would like to chunk those words which are of structure 'Coffee & Tea' but it should not chunk if there is a word of type <IN> before the sequence. For example 'in London and Paris' should not be chunked by the parser.

My code is as follows:

grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''

I tried the above grammar to solve the problem but it is not working could someone please tell me what I am doing wrong.

Example:

def parse_sentence(sentence):
    pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
    grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
    parser = nltk.RegexpParser(grammar)
    result = parser.parse(pos_sentence)
    print result

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

Result for sentence 1 is:
(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)

Result for sentence2 is:
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (NP London/NNP and/CC Paris/NNP)
  ?/.)

As can be seen in both sentence1 and sentence2 the phrases Coffee & Tea and London and Paris get chunked as a group although I do not wish to chunk London and Paris. One way of doing that is to ignore those patterns which are preceded by a <IN> POS Tag.

In a nutshell I need to know how to add NOT(negation) conditions for POS tags in a regex parser's grammar. Standard syntax of using '^' followed by the tag definition does not seem to work

Original Q&A

There are 3 best solutions below

**alexis** · Answer 1 · 2017-03-11T09:50:10.663000

What you need is a "negative lookbehind" expression. Unfortunately, it doesn't work in the chunk parser, so I suspect that what you want cannot be specified as a chunking regexp.

Here is an ordinary negative lookbehind: Match "Paris", but not if preceded by "and ".

>>> re.findall(r"(?<!and) Paris", "Search in London and Paris etc.")
[]

Unfortunately, the corresponding lookbehind chunking rule does not work. The nltk's regexp engine tweaks the regexp you pass it in order to interpret the POS types, and it gets confused by lookbehinds. (I'm guessing the < character in the lookbehind syntax is misinterpreted as a tag delimiter.)

>>> parser = nltk.RegexpParser(r"NP: {(?<!<IN>)<NNP>+<CC><NN.*>+}")
...
ValueError: Illegal chunk pattern: {(?<!<IN>)<NNP>+<CC><NN.*>+}

**Luda** · Answer 2 · 2017-07-05T00:45:29.733000

cp.2.5 "Chinking"

"We can define a chink to be a sequence of tokens that is not included in a chunk"

http://www.nltk.org/book/ch07.html

See inverse curly braces for exclusion

grammar = 
        r"""
          NP:
            {<.*>+}          # Chunk everything
            }<VBD|IN>+{      # Chink sequences of VBD and IN

         """

**Alejandro García** · Answer 3 · 2020-11-30T22:21:40.877000

NLTK's Tag chunking Documentation is a bit confusing, and not easy reachable, so I struggled a lot in order to accomplish something similar.

Check following links:

NLTK How To Chunk
nltk.chunk.regexp
NLTK book - Chapter 07
- ↑ go to 2.3 to 2.5

Following @Luda's answer, I found an easy solution:

Chunk what you want: <IN>*<other tags> tags. This will create chunks starting with any word with 0 or more <IN> tag.
Chink <IN><other tags> tags from the previous chunk expression. This will remove all chunks starting with one <IN> tagged word.(We removed the asterisk).

Example (taking @Ram G Athreya's question):

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    NP: {<IN>*<NNP>+<CC><NN.*>+}
        }<IN><NNP>+<CC><NN.*>+{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)


 (S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  London/NNP
  and/CC
  Paris/NNP
  ?/.)

Now it chunks "coffee & TV" but it doesn't chunk "London and Paris"

Moreover, this is useful to build lookbehind assertions, in RegExp normally is ?<= , but this creates conflict with the < and > symbols used in chunk_tag grammar regex.

So, in order to build a lookbehind, we could try the following:

Chunk what you want, including <IN> tag at the beginning, followed by other tags you want. This will create chunks starting with any word with 0 or more <IN> tag.
Chink <IN> tag from the previous chunk expression. This will remove all <IN> tagged words from chunks.

Example 2 - Chunk all words preceded by an <IN> tagged word:

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    CHUNK: {<IN>+<.*>}
        }<IN>{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  (CHUNK the/DT)
  band/NN
  that/WDT
  wrote/VBD
  Coffee/NNP
  &/CC
  TV/NN
  ?/.)
(S
  Who/WP
  of/IN
  (CHUNK those/DT)
  resting/VBG
  in/IN
  (CHUNK Westminster/NNP)
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (CHUNK London/NNP)
  and/CC
  Paris/NNP
  ?/.)

As we can see, it chunked "the" from sentence1; "those", "Westminster" and "London" from sentence2

Not condition in NLTK Regex Parser

There are 3 best solutions below

cp.2.5 "Chinking"

Related Questions in PARSING

Related Questions in NLP

Related Questions in NLTK

Related Questions in TEXT-CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions