How do I blacklist characters on a utf-8 string?

5.5k Views Asked by At

I have an HTML text input where users can write in a name for themselves. The name is just a user-friendly display name, it's not used to identify the user in the database or for anything on the back end.

I want to allow utf-8 characters, so that people can input characters of their native langugage, whether it's Chinese or Swedish or whatever.

However, I want to blacklist certain characters, like <,>, [, ], ?, *, and so on, to stop any potential script kiddies trying to exploit the input to make an SQL injection or whatever.

I thought this would be straightforward code that there would be lots of examples of on the web, but the answer, if it's out there, is buried among examples of how to use a whitelist to validate email addresses (only English alphanumeric characters, no Asian or other language specific characters), or, oddly enough, how to stop key presses for certain characters entirely.

I don't want to stop key presses entirely, as I think that might confuse the user in my case. Instead I'll output an error saying they can't use character X if they input a blacklisted one.

So, for a guy like me who sucks totally at regex, is there a straightforward way of blacklisting characters in Javascript?

I would also go for a whitelist solution if it didn't inhibit the ability for users to put in the funky characters from whatever language they're using.

2

There are 2 best solutions below

0
On BEST ANSWER

To start you would want to clean the input data on both the client side and the server side. Anyone clever enough to be creating attacks will be clever enough to disable javascript long enough to get the data they want into your forms.

Now as far as a javascript regex to prevent entry - there are lots of questions on SO that talk about this.

//some technical stuff

javascript regexp remove all special characters

Does this set of regular expressions FULLY protect against cross site scripting?

//simple js example to stop entry of unwanted characters.

http://www.sitepoint.com/forums/showthread.php?t=142118

From what I read the consensus seems to be that you need to whitelist and not blacklist. Perhaps someone from SO with more experience can point you towards the best way to handle your use case.

1
On

It would really depend on how you're doing validation in general (I like jQuery validate as a framework), but the basic check might look something like:

// set up a handler function
function checkBlacklist() {
    // check against a regex of bad chars
    // many ([]*? etc) may need to be escaped to work in regex
    if (/[\[\]<>\*\?]/.test(this.value)) {
        // take evasive action of some kind
    }
}
// assign it to your input
document.getElementById('mynameinput').onchange = checkBlacklist;

But, as Mike Daniels noted, don't do this if your main concern is with script kiddies. Anyone who wants to mess with your input form can easily circumvent Javascript validation. Consider the Javascript validation a nice UI feature - it gives your user helpful feedback without making them reload the page. But it's not security of any kind - you have to check and sanitize the input on the server side no matter what Javascript you use for validation.