Regular expression for tokenization in java

93 Views Asked by At

All I'm trying to do tokenization for the following text

"I am a cat, Cat Cat ah. Are you a C.A.T?"

But I have to follow the rules:

1.Two or more words separated by whitespace, all of which begin with a capital letter, must be preserved as a single token.

2.Acronym should be preserved as a single token with or without full stop or period (e.g. C.A.T can result in CAT or C.A.T).

The correct output should be like this: I, am, a, cat, Cat Cat, Are, you, C.A.T.

I just wonder how to write regular expressions to represent the scenarios if I wanna use Matcher in java to do the work?

0

There are 0 best solutions below