Okay, not the clearest title ever; feel free to improve.
I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci
as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.
I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.
I figured this should be a fairly easy task; just do:
SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin
But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.
No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:
For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.
Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.
Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?
Comes back with
aha i
without any complaints about numeric stuff.
I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.
What I am doing differently than your case?
When adding an explicate
COLLATE
clause, put it on the component of the query that needs it. (COLLATE
does not apply to the query as a whole; different parts can be collated differently.)