National characters in java replaceAll

860 Views Asked by At

I have big problem with replacing some characters in Java. I would like to remove all characters that are not letters, numbers or special national characters such as "ę, ą". When I use the function replaceAll("\W", " ") special characters are also removed.

Example string: "Jest źle, ale będzie lepiej."

How it's replaced: "Jest le ale b dzie lepiej "

How it should be: "Jest źle ale będzie lepiej "

Sorry for my not very good english :)

1

There are 1 best solutions below

0
On

Your English is better than Java's Polish. Java's regex does not speak Polish, and so it considers only a..z "national characters" (plus digits and the underscore -- GREP was obviously designed by programmers). That's fair, actuslly: the "normal" character for one language is "weird" for another.

You can sum up the few extra non-ASCII characters in a custom negated character class:

replace ("[^\wźę ]", " ");

(you should add the other accented characters as well, and perhaps remove non-Polish characters such as Q and X).