NLTK Chunk Parser: How to escape special characters

1k Views Asked by At

So I am trying to extract some info from text and i am using NLTK chunking.

Here is my input

The stocks show 67% rise, last year it was 12% fall

i want to capture

67% rise and 12% fall

POS Tagging the above sentence shows

('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')

Now, i came up with a simple rule

Stat: {<CD><NN>(<NN>+|<VBN>|JJ)?}

which works well and captures

('67', 'CD'), ('%', 'NN'), ('rise', 'NN')

('12', 'CD'), ('%', 'NN'), ('fall', 'NN')

but in my data set, i have stuff like

5 million dollars

which is

('5', 'CD'), ('man', 'NN'), ('stock', 'NN')

and is also incorrectly captured. So i thought of including the % sign in my rule

Stat: {<CD><%>(<NN>+|<VBN>|JJ)?}

but this rule does not match anything now. How do i escape/include % in my chunk rule?

Update

So, what i do not understand is that i can match other special characters. For example, if i have a rule as

XYZ:{<:>}

this matches all the :s in the input. So all I am trying to do is

XYZ:{<%>}

and this does not work. I have tried to escape the % by

XYZ:{<\%>}

but this does not work either. I tried \\ but to no avail. I really do not want to modify the input string as once i have matched, i want to find out the indices of the matched strings. So if i modify the input string, that will throw off my indices unless i do a reverse transformation first

2

There are 2 best solutions below

0
On

Use a pattern like <CD><NN>+

4
On

Well, since its a regular expression you could just escape it.

Stat: {<CD><\%>(<NN>+|<VBN>|JJ)?}

You could also have a list of keywords you want to replace, so that your chunk rules don't become excessively long.

e.g.

s = '56% rise and 75% fall'
gen_replacements = [('%', 'PERCENTAGE'), ('perc.', 'PERCENTAGE'), etc]
for ndl, rpl in gen_replacements:
    s = s.replace(' %s ' % ndl, ' %s ' % rpl)

Stat: {<CD><PERCENTAGE>(<NN>+|<VBN>|JJ)?}