I'm trying to parse sentences in python- for any sentence I get I should take only the words that appear after the words 'say' or 'ask' (if the words doesn't appear, I should take to whole sentence) I simply did it with regular expressions:
sen = re.search('(?s)(?<=say|Say).*$', current_game_row["sentence"], re.M | re.I)
(this is only for 'say', but adding 'ask' is not a problem...)
The problem is that if I get a sentence with punctuations like comma, colon (,:) after the word 'say' it takes it too. Someone suggested me to use nltk tokenization in order to define it, but I'm new in python and don't understand how to use it. I see that nltk has the function RegexpParser but I'm not sure how to use it. Please help me :-)
** I forgot to mention that- I want to recognize 'said'/ asked etc. too and don't want to catch word that include the word 'say' or 'ask' (I'm not sure there are such words...). In addition, if where are multiply 'say' or 'ask' , I only want to catch the first token in in the sentence. **
Everything after a Keyword
We can deal with the unwanted punctuation by using
\w
to eat up all non-unicode.Output:
case-sensitive: You have case-insensitive flag
re.I
, so we can removeSay
permutation.multi-line: You have
re.M
option which directs^
to not only match at the start of your string, but also right after every\n
within that string. We can drop this since we do not need to use^
.dot-matches-all: You have
(?s)
which directs.
to match everything including\n
. This is the same as applyingre.S
flag.I'm not sure what the net effect of having both
re.M
andre.S
is. I think your sentence might be a text blob with newlines inside, so I removedre.M
and kept(?s)
asre.S