I need to create a not condition as part of my grammar in NLTK's regex parser. I would like to chunk those words which are of structure 'Coffee & Tea'
but it should not chunk if there is a word of type <IN>
before the sequence. For example 'in London and Paris'
should not be chunked by the parser.
My code is as follows:
grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''
I tried the above grammar to solve the problem but it is not working could someone please tell me what I am doing wrong.
Example:
def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print result
sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)
sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)
Result for sentence 1 is:
(S
Who/WP
is/VBZ
the/DT
front/JJ
man/NN
of/IN
the/DT
band/NN
that/WDT
wrote/VBD
(NP Coffee/NNP &/CC TV/NN)
?/.)
Result for sentence2 is:
(S
Who/WP
of/IN
those/DT
resting/VBG
in/IN
Westminster/NNP
Abbey/NNP
wrote/VBD
a/DT
book/NN
set/VBN
in/IN
(NP London/NNP and/CC Paris/NNP)
?/.)
As can be seen in both sentence1 and sentence2 the phrases Coffee & Tea
and London and Paris
get chunked as a group although I do not wish to chunk London and Paris
. One way of doing that is to ignore those patterns which are preceded by a <IN>
POS Tag.
In a nutshell I need to know how to add NOT(negation) conditions for POS tags in a regex parser's grammar. Standard syntax of using '^' followed by the tag definition does not seem to work
NLTK's Tag chunking Documentation is a bit confusing, and not easy reachable, so I struggled a lot in order to accomplish something similar.
Check following links:
Following @Luda's answer, I found an easy solution:
Example (taking @Ram G Athreya's question):
Now it chunks "coffee & TV" but it doesn't chunk "London and Paris"
Moreover, this is useful to build lookbehind assertions, in RegExp normally is ?<= , but this creates conflict with the < and > symbols used in chunk_tag grammar regex.
So, in order to build a lookbehind, we could try the following:
Example 2 - Chunk all words preceded by an <IN> tagged word:
As we can see, it chunked "the" from sentence1; "those", "Westminster" and "London" from sentence2