My goal is to protect my web site from attacks by creating a strict whitelist of allowed characters for any and all POST data recieved from the client side.
This is a piece of cake when staying within ASCII characters. Something like:
if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
// Battle stations!!
}
However, I need to be able to allow any and all utf-8 characters, especially asian character sets like Japanese, Chinese, and Korean. But I don't want to exclude anybody with wacky characters, like Arabic or Russian, or whatever. One world, one love! ;)
How can I allow people to input the characters of their native language while excluding the nasties used in evil scripts, like *, ?, angle brackets, and so on?
\wwill give you word characters (letters, digits, and underscores), which is probably what you're after\sfor whitespace.e.g.
regular-expressions.info is an excellent reference for this stuff - here and here are a couple of relevant pages :)
edit: some more clarification needed, sorry!
here's what I usually use for CJK:
To get everything that's could be a problem for escaping and other black-hat stuff, use:
/[^\p{Punctuation}]/( ==/[^\p{P}]/)or
/[^\32-\151]/( ==/[^!-~]/)another good link