I need to create a not condition as part of my grammar in NLTK's regex parser. I would like to chunk those words which are of structure 'Coffee & Tea' but it should not chunk if there is a word of type <IN> before the sequence. For example 'in London and Paris' should not be chunked by the parser.
My code is as follows:
grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''
I tried the above grammar to solve the problem but it is not working could someone please tell me what I am doing wrong.
Example:
def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print result
sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)
sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)
Result for sentence 1 is:
(S
Who/WP
is/VBZ
the/DT
front/JJ
man/NN
of/IN
the/DT
band/NN
that/WDT
wrote/VBD
(NP Coffee/NNP &/CC TV/NN)
?/.)
Result for sentence2 is:
(S
Who/WP
of/IN
those/DT
resting/VBG
in/IN
Westminster/NNP
Abbey/NNP
wrote/VBD
a/DT
book/NN
set/VBN
in/IN
(NP London/NNP and/CC Paris/NNP)
?/.)
As can be seen in both sentence1 and sentence2 the phrases Coffee & Tea and London and Paris get chunked as a group although I do not wish to chunk London and Paris. One way of doing that is to ignore those patterns which are preceded by a <IN> POS Tag.
In a nutshell I need to know how to add NOT(negation) conditions for POS tags in a regex parser's grammar. Standard syntax of using '^' followed by the tag definition does not seem to work
What you need is a "negative lookbehind" expression. Unfortunately, it doesn't work in the chunk parser, so I suspect that what you want cannot be specified as a chunking regexp.
Here is an ordinary negative lookbehind: Match "Paris", but not if preceded by "and ".
Unfortunately, the corresponding lookbehind chunking rule does not work. The nltk's regexp engine tweaks the regexp you pass it in order to interpret the POS types, and it gets confused by lookbehinds. (I'm guessing the
<character in the lookbehind syntax is misinterpreted as a tag delimiter.)