So I am trying to extract some info from text and i am using NLTK chunking.
Here is my input
The stocks show 67% rise, last year it was 12% fall
i want to capture
67% rise
and 12% fall
POS Tagging the above sentence shows
('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
Now, i came up with a simple rule
Stat: {<CD><NN>(<NN>+|<VBN>|JJ)?}
which works well and captures
('67', 'CD'), ('%', 'NN'), ('rise', 'NN')
('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
but in my data set, i have stuff like
5 million dollars
which is
('5', 'CD'), ('man', 'NN'), ('stock', 'NN')
and is also incorrectly captured. So i thought of including the %
sign in my rule
Stat: {<CD><%>(<NN>+|<VBN>|JJ)?}
but this rule does not match anything now. How do i escape/include %
in my chunk rule?
Update
So, what i do not understand is that i can match other special characters. For example, if i have a rule as
XYZ:{<:>}
this matches all the :
s in the input. So all I am trying to do is
XYZ:{<%>}
and this does not work. I have tried to escape the %
by
XYZ:{<\%>}
but this does not work either. I tried \\
but to no avail. I really do not want to modify the input string as once i have matched, i want to find out the indices of the matched strings. So if i modify the input string, that will throw off my indices unless i do a reverse transformation first
Use a pattern like <CD><NN>+