Regex returning inexplicable results (to me)

48 Views Asked by At

I want to return entries from a table that match the format:

prefix + optional spaces + Thai digit

Testing using ยก as the prefix I use the following SQL

SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"

This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.

ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล

It doesn't return

ยก ๑
ยก ๒

which I know are in the database and are the entries I want.

I'm very new to all this. What am I doing wrong?

FWIW, this is against a MySQL database and everything is in Unicode.

Thanks

2

There are 2 best solutions below

1
On

As quoted from the MySQL docs:

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.

0
On

I use utf8_general_ci and try.I matched

ยกนะ 

with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to

"^ยก[ ]*[๐-๙]+$"

,and it can match

ยกนะ 
ยก ๑

Maybe the problem is character encoding.