Check if string is in the BMP range

1k Views Asked by At

So I was searching for a proper way in PHP to detect if a string is in the BMP range (Basic Multilingual Plane) but I found nothing. Even mb-check-encoding and mb_detect_encoding do not offer any help in this particular case.

So I wrote my own code

<?php

function is_bmp($string) {
    $str_ar = mb_str_split($string);
    foreach ($str_ar as $char) {
        /*Check if there's any character's code point outside the BMP range*/
        if (mb_ord($char) > 0xFFFF)
            return false;
    }
    return true;
}

/*String containing non-BMP Unicode characters*/
$string = 'blah blah';
var_dump(is_bmp($string));
?>

Outputs:

bool(false)

Now my question is:

Is there a better approach? and are there any flaws in it?

2

There are 2 best solutions below

0
daxim On
var_dump(
    !preg_match('/[^\x0-\x{ffff}]/u', 'blah blah')
);
3
AterLux On

If you have an correct UTF-8 encoded input string, you can just check its bytes to figure out does it have symbols out of BMP or not.

Literally, you need to detect: does the input string contains any symbol, which codepoint is greater than 0xFFFF (i.e. longer than 16 bits)

Note on how UTF-8 encoding works:

  • Codepoints with codes 0 thru 0x7F are encoded as is. By one byte.
  • All other codepoints have a code within range 0xC0 ... 0xFF as the first byte, which also encodes how many additional bytes folows. And codes 0x80...0xBF as additional bytes.

To encode code points 0x10000 and greater, UTF-8 requires a sequence of 4 bytes, and the first byte of that sequence will be 0xF0 or greater. In all other cases the whole string will contain bytes less than 0xF0.

In short your task just to find: does the binary representation of the string contanin any byte of range 0xF0...0xFF?

function is_bmp($string) {
   return preg_match('#[\xF0-\xFF]#', $string) != 0;
}

OR

even simpler (but probably less effective on speed), you can use ability of PCRE to work with UTF-8 sequences (see option PCRE_UTF8):

function is_bmp($string) {
   return preg_match('#[^\x00-\x{FFFF}]#u', $string) != 0;
}