Regular expression to match all non-alphanumerics except apostropes in contractions

674 Views Asked by At

I'm trying to tokenize a string of English text such that I can get a sequence of the words without any punctuation, but at the same time I want to leave contractions (like don't and won't) and possessive nouns (like Steve's and Drew's) intact. I'm trying to pull this off using regular expressions, but I'm still new to them.

Basically, I want a regular expression that will match all sequences of non-alphanumeric characters except for apostrophes which are surrounded by alphanumeric characters such as in the examples mentioned previously. Is it possible to do this with regular expressions?

2

There are 2 best solutions below

0
On

I don't understand what your regex is trying to match, but I think this will match what you want:

(?i)(?<=^|\s)([a-z]+('[a-z]*)?|'[a-z]+)(?=\s|$)

This matches "words" that may optionally end with an apostrophe followed by 0-n letters, or an apostrophe followed by letters, which matches the following edge cases:

  • Thing
  • Jack's
  • Ross'
  • 'tis
1
On

Your question not very clear to me. But If I interpreted correctly, following regex should do the job

\b[\w']+\b

regex101 demo