I have written the following regex to tag certain phrases pattern
pattern = """
P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*}
P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+}
P3: {<NP1><IN><NP2>}
P4: {<NP2><IN><NP1>}
"""
This pattern would correctly tag a phrase such as:
a = 'The pizza was good but pasta was bad'
and give the desired output with 2 phrases:
- pizza was good
- pasta was bad
However, if my sentence is something like:
a = 'The pizza was awesome and brilliant'
matches only the phrase:
'pizza was awesome'
instead of the desired:
'pizza was awesome and brilliant'
How do I incorporate the regex pattern for my second example as well?
Firstly, let's take a look at the POS tags that NLTK gives:
(Note: The above are the outputs from NLTK v3.1
pos_tag
, older version might differ)What you want to capture is essentially:
So let's catch them with these patterns:
So that's "cheating" by hardcoding!!!
Let's go back to the POS patterns:
Can be simplified to:
So you can use the optional operators in the regex, e.g.:
Most probably you're using the old tagger, that's why your patterns are different but I guess you see how you could capture the phrases you need using the example above.
The steps are:
pos_tag
RegexpParser