How to display Arabic or Spanish Mojibake characters correctly on the UI

238 Views Asked by At

I have a web-based project written in Perl and MySQL. we use template toolkit for rendering the HTML page. we have written a custom wrapper on DBI class to interact with MySQL.

Since it is a legacy project, most of our tables are latin1 charset. Recently we started supporting Multi-language UI ( Spanish and Arabic ). Because of this, Users are entering the data in their native languages. The data is not encoded and it is saved into the DB. Now we have a ETL process which moves data from our primary DB to star schema DB which are UTF8 tables. This DB is used only for reports. Now in one of the reports we are getting unreadable characters.

enter image description here

I am changing the query to retrieve this data as

SET NAMES latin1;
select 
CONVERT(CAST(CONVERT( label USING  latin1) AS BINARY) USING utf8) AS label
from report_table
where report_id = '999';

I am using sqlyug to execute the query. If I do not set the SET NAMES latin1;, then the value is still not readable. But when I am trying the same change in the code level, I am not able to get the application display the data correctly. In our custom DB wrapper we are not setting any particular settings.

we are not setting SET character_set_results. we are not even setting enable_utf8 => 0,

I am not sure what needs to be done.
I know it is a better solution to correct the data in the primary DB before ETL but it is a long term plan. I need some solution which can fix the issue temporarily and not block the customers.

Let me know if you need any more information.

2

There are 2 best solutions below

0
Loukan ElKadi On

Try enabling UTF-8 for the DB connection as such: $dbh->{'mysql_enable_utf8'} = 1;

As I remember, using SET NAMES is not the most recommended way to do this, but in case you want to use it, should you try SET NAMES utf8 instead of latin1

0
Rick James On

A little bit shorter:

CONVERT(BINARY('انننانعن') USING utf8) --> انننانعن 

But... You are asking for more trouble if you don't stop and get everything into utf8 now.

  • Bytes in client;
  • Connection SET NAMES utf8mb4 or by other means -- for both writing and reading;
  • Column / table declared CHARACTER SET utf8mb4;
  • HTML meta tag with charset=UTF-8.

Mojibake usually occurs from

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8mb4.)
  • The column in the tables may or may not have been CHARACTER SET utf8mb4, but it should have been that.

If you need to fix the data it takes a "2-step ALTER", something like

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8mb4 ...;