I'm splitting a string on all punctuation and whitespace characters. Rather than build a complicated (?) regex to match what C# considers "punctuation" and "whitespace" characters, I'm using the char.IsPunctuation and char.IsWhiteSpace methods to get the characters from the string that are punctuation/whitespace.
Basically, this is what I'm doing - building an array of punctuation and whitespace characters, which I later use to split the string.
return text.Where(c => char.IsPunctuation(c) || char.IsWhiteSpace(c))
.Distinct()
.ToArray();
I did it this way originally because I couldn't find anywhere there was a static list/array of chars that C# considers punctuation or whitespace. In the MSDN documentation for char.IsPunctuation, it lists the Unicode code points it considers punctuation, but my question is: does that list exist anywhere in the .NET code? That I could reference instead of building it from the input string every time?
Instead of using
String.Splitwith an endless list of characters and determining them with LINQ before you split, which is really not efficient, you could use a different approach using aStringBuilderand enumerate the characters just once. For example:Demo: https://dotnetfiddle.net/3eLI9d