My goal is to protect my web site from attacks by creating a strict whitelist of allowed characters for any and all POST data recieved from the client side.
This is a piece of cake when staying within ASCII characters. Something like:
if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
{
// Battle stations!!
}
However, I need to be able to allow any and all utf-8 characters, especially asian character sets like Japanese, Chinese, and Korean. But I don't want to exclude anybody with wacky characters, like Arabic or Russian, or whatever. One world, one love! ;)
How can I allow people to input the characters of their native language while excluding the nasties used in evil scripts, like *, ?, angle brackets, and so on?
\w
will give you word characters (letters, digits, and underscores), which is probably what you're after\s
for whitespace.e.g.
regular-expressions.info is an excellent reference for this stuff - here and here are a couple of relevant pages :)
edit: some more clarification needed, sorry!
here's what I usually use for CJK:
To get everything that's could be a problem for escaping and other black-hat stuff, use:
/[^\p{Punctuation}]/
( ==/[^\p{P}]/
)or
/[^\32-\151]/
( ==/[^!-~]/
)another good link