I'm working on a database that has tables with different charsets. Since it's a big database, I was wondering if it could lead to a performance issue. Yes, the usual value comparison a DB is usually doing is the JOIN and is done comparing integers, but are there any other performance problems we could experience having tables with different charsets, other than the bigger space taken by some charsets?
Can we have performance issues when using mixed table charsets in MySQL or in Postgres?
502 Views Asked by sekmo At
2
There are 2 best solutions below
0

MySQL:
For zip_code (postal_code), stored as a string (CHAR
or VARCHAR
), most charsets work equally well. However, when JOINing
on such a column, the collation must be the same.
- If it is the same, an index on that column can be used.
- If it is not, then the index is useless, and the query must scan the entire table.
Since the collation includes the charset, that forces the charset to be the same, also.
The choice of collation is rather minor. However, if there can be letters in the string (postal_code, country_code, etc), you need to decide whether to force the tables (and user queries) to use a particular case.
- Collation
..._bin
treats cases as different: 'de' won't match 'DE' (for Germany). - Collation
..._ci
is "Case Insensitive", so those would match.
If you do string comparisons with incompatible collations, those comparisons cannot use an index on the string column. I've seen this happen when doing a JOIN on a string column, and the tables joined had different collations (naturally if they also have different character sets, they are also different collations).
But you said your joins are on integer columns, not string columns. So joins shouldn't be a problem in your case.
You can also have performance problems when doing lookups against string columns if your table character set doesn't match your session character set.
Example: My table is defined with utf8mb4, but I set my session to utf8, so string literals will be utf8. Seems like a harmless change, right?
I guess the utf8 string 'abc123' has a clear way to be promoted to utf8mb4 to match the column it compares to.
But if I force a specific collation that is not supported by utf8mb4, I see it has to do a table-scan and compare to rows one by one, instead of an indexed lookup:
There's a difference between implicit collations and explicit collations. Suppose I set my session to use something that doesn't have a clear path to utf8mb4:
So far so good, but if I am explicit about the collation:
The bottom line is that you should use the same character set and collation to make your life easier. Use it for all tables and for the session too.
In these modern times, it's hard to think of a reason to use anything other than utf8mb4.
P.S. Space shouldn't be a problem. UTF-8 character sets allow multibyte characters, but they don't expand the size of characters that fit in a single byte. UTF-8 is a variable-width character encoding. So characters in the ASCII range (0-127) are stored in one byte anyway. Read UTF-8 on wikipedia for details, it has a nice explanation.