Unable to differentiate between words consisting of different Unicode letters in mysql and phpMyAdmin

60 Views Asked by At

I am inserting a word using the following sql query from a php script.

$sql = "INSERT IGNORE into cb_words (word, user_id) VALUES('$word', $user_id)";

The problem arises in this particular scenario - The cb_words table contains words in Bangla language.

  1. word - আর which encoded to Unicode format is %u0986%u09B0
  2. Another word - আঁর which encoded to Unicode format is %u0986%u0981%u09B0

As one may notice from the Unicode encoding for both the words they are almost similar with an additional letter in the second word.

Now, the table - cb_words already contains the second word, but when I am trying to insert the first word it fails. It does not insert the word and simply says 0 rows inserted.

INSERT IGNORE into cb_words (word, user_id) VALUES('আর', 2)

When tried from PHPMyAdmin Insert query result

Shockingly enough, when I search in the table for the word as -

SELECT * FROM `cb_words` where `cb_words`.`word` = 'আর'

The returned result is -

The result of the SELECT SQL statement

N.B: The table only has the word (2) as described above. and I am trying to insert (1)

So, despite the fact, these two words are not 100% identical, why does mysql fail to differentiate between the two?

I am adding the table structure in case it may be necessary: Table structure for cb_words

2

There are 2 best solutions below

3
On BEST ANSWER

Insert being ignored, is at my guess because your user_id is set to unique? can't tell from your screenshots and you're inserting with IGNORE supressing any errors. Remove the unique index from user_id.

You can change your collation to utf8_bin.

Created your table, locally:

table entries result from your select

similar question with some interesting reading: How can I enter data using non English (Bangla) language into this database table?

0
On
mysql> select 'আর'='আর' COLLATE utf8mb4_0900_ai_ci;
+-------------------------------------------------+
| 'আর'='আঁর' COLLATE utf8mb4_0900_ai_ci           |
+-------------------------------------------------+
|                                               1 |
+-------------------------------------------------+

That is, declare the column to have COLLATE utf8mb4_0900_ai_ci;

The two item you mention have these HEX values: E0A686 E0A6B0 and E0A686 E0A681 E0A6B0

E0A681 is a non-spacing-modifier "BENGALI SIGN CANDRABINDU". When collating with "Accent Insensitive" and (or?) "Case Insensitive", the two are considered equal.

If you want them to be treated as not-equal, use a different collation.