All I'm trying to do tokenization for the following text
"I am a cat, Cat Cat ah. Are you a C.A.T?"
But I have to follow the rules:
1.Two or more words separated by whitespace, all of which begin with a capital letter, must be preserved as a single token.
2.Acronym should be preserved as a single token with or without full stop or period (e.g. C.A.T can result in CAT or C.A.T).
The correct output should be like this: I, am, a, cat, Cat Cat, Are, you, C.A.T.
I just wonder how to write regular expressions to represent the scenarios if I wanna use Matcher in java to do the work?