RegExp: check for at least one (unicode) character above code point 0x7f

943 Views Asked by At

I'm trying to test whether a string contains at least one (unicode) character above code point 0x7f (i.e. a non-ascii character).

I've tried the following ideas (and a few others), but they don't seem to work:

var rx:RegExp;

rx = /[^\\x00-\\x7f]/; // negate ascii code point 0 to 127
trace( rx.test( '\u0080' ) ); // true (expected true)
trace( rx.test( 'b' ) ); // true (expected false)

rx = /[^\u0000-\u007f]/; // negate unicode code point 0 to 127
trace( rx.test( '\u0080' ) ); // false (expected true)
trace( rx.test( 'b' ) ); // false (expected false)

Can somebody help me understand why this is not working as expected and how to do it properly?

2

There are 2 best solutions below

2
On

I'm not sure if AS3 supports unicode RegExp like, for example, Python does. I can suggest following solution, that will help you to do what you want, but I'm sure it's slow for long strings.

function containsUnicode(text:String):Boolean
{
    for (var i:int = text.length - 1; i >= 0; i--)
    {
        if (text.charCodeAt(i) > 127)
            return true;
    }

    return false;
}
2
On
/[^\\x00-\\x7f]/;

The double-backslash means a literal backslash, so you are looking for a character group that excludes backslash, x, 0, all the characters between 0 and backslash, x, 7 and f.

You would only use double-backslashes if the regex were in a string literal (as in new RegExp('[^\\x00-\\x7F]')); pretty much the entire purpose of the regex literal syntax /.../ is to allow you to type backslash-heavy expressions without the extra layer of escaping.

'foo'.search(/[^\x00-\x7F]/)!==-1  // false
'bär'.search(/[^\x00-\x7F]/)!==-1  // true

However:

rx = /[^\u0000-\u007f]/; // negate unicode code point 0 to 127
trace( rx.test( '\u0080' ) ); // false (expected true)

true for me in browser JavaScript. If not in ActionScript that would appear to be a non-ECMA-conformant bug.