I have received a broken string from another piece of software. I would have liked to fix its encoding in JavaScript but I feel I am missing something.
Here's an exemple of broken string: Détecté à lors ôù
And the expected output would be: Détecté àlors ôùi
I don't know the encoding used to send me the string.
My idea is to use the TextDecoder API; convert the string to bytes, and then reencode it in UTF8 or UTF16.
Here's the piece of code I used to detect the charset used:
const str = 'Détecté àlors ôùi';
const str2 = 'Détecté à lors ôù';
const charsets = [
'utf-8',
"ibm866",
"iso-8859-2",
"iso-8859-3",
"iso-8859-4",
"iso-8859-5",
"iso-8859-6",
"iso-8859-7",
"iso-8859-8",
"iso-8859-8-i",
"iso-8859-10",
"iso-8859-13",
"iso-8859-14",
"iso-8859-15",
"iso-8859-16",
"koi8-r",
"koi8-u",
"macintosh",
"windows-874",
"windows-1250",
"windows-1251",
"windows-1252",
"windows-1253",
"windows-1254",
"windows-1255",
"windows-1256",
"windows-1257",
"windows-1258",
"x-mac-cyrillic",
"gbk",
"gb18030",
"hz-gb-2312",
"big5",
"euc-jp",
"iso-2022-jp",
"shift-jis",
"euc-kr",
"iso-2022-kr",
"utf-16be",
"utf-16le",
"iso-2022-cn"
];
const encoder = new TextEncoder();
const view = encoder.encode(str2);
console.log('__________________')
charsets.forEach((charset) => {
try {
const decoder = new TextDecoder(charset);
const fixedStr = decoder.decode(view, {
fatal: false,
ignoreBOM: true,
});
console.log(charset, fixedStr);
} catch (e) {
console.log(charset, 'invalid');
}
})
(the code can be tested here: https://jsfiddle.net/tashebwj/ )
The output is the following:
__________________
?editor_console=true:57 utf-8 Détecté à lors ôù
?editor_console=true:57 ibm866 D├Г┬йtect├Г┬й ├Г┬аlors ├Г┬┤├Г┬╣
?editor_console=true:57 iso-8859-2 DĂŠtectĂŠ Ă lors Ă´Ăš
?editor_console=true:57 iso-8859-3 D�Âİtect�Âİ � lors �´�Âı
?editor_console=true:57 iso-8859-4 DÊtectÊ àlors ôÚ
?editor_console=true:57 iso-8859-5 DУТЉtectУТЉ УТ lors УТДУТЙ
?editor_console=true:57 iso-8859-6 Dأآ�tectأآ� أآ lors أآ�أآ�
?editor_console=true:57 iso-8859-7 DΓΒ©tectΓΒ© ΓΒ lors ΓΒ΄ΓΒΉ
?editor_console=true:57 iso-8859-8 D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-8-i D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-10 DÃÂĐtectÃÂĐ ÃÂ lors ÃÂīÃÂđ
?editor_console=true:57 iso-8859-13 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ“ĆĀ¹
?editor_console=true:57 iso-8859-14 Détecté àlors ÃÂṀÃÂṗ
?editor_console=true:57 iso-8859-15 Détecté àlors ÃŽù
?editor_console=true:57 iso-8859-16 DĂ©tectĂ© Ă lors ĂÂŽĂÂč
?editor_console=true:57 koi8-r Dц┐б╘tectц┐б╘ ц┐б═lors ц┐б╢ц┐б╧
?editor_console=true:57 koi8-u Dц┐б╘tectц┐б╘ ц┐б═lors ц┐бЄц┐б╧
?editor_console=true:57 macintosh Détecté àlors ôù
?editor_console=true:57 windows-874 Dรยฉtectรยฉ รย lors รยดรยน
?editor_console=true:57 windows-1250 DĂ©tectĂ© Ă lors Ă´ĂÂą
?editor_console=true:57 windows-1251 DГѓВ©tectГѓВ© ГѓВ lors ГѓВґГѓВ№
?editor_console=true:57 windows-1252 Détecté àlors ôù
?editor_console=true:57 windows-1253 Détecté àlors ôù
?editor_console=true:57 windows-1254 Détecté àlors ôù
?editor_console=true:57 windows-1255 Dֳƒֲ©tectֳƒֲ© ֳƒֲ lors ֳƒֲ´ֳƒֲ¹
?editor_console=true:57 windows-1256 Dأƒآ©tectأƒآ© أƒآ lors أƒآ´أƒآ¹
?editor_console=true:57 windows-1257 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ´ĆĀ¹
?editor_console=true:57 windows-1258 DĂƒÂ©tectĂƒÂ© ĂƒÂ lors ĂƒÂ´ĂƒÂ¹
?editor_console=true:57 x-mac-cyrillic D√Г¬©tect√Г¬© √Г¬†lors √Г¬і√Г¬є
?editor_console=true:57 gbk D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 gb18030 D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 hz-gb-2312 invalid
?editor_console=true:57 big5 D�穢tect�穢 ��饊ors �織�繒
?editor_console=true:57 euc-jp D�息tect�息 ��lors �卒�孫
?editor_console=true:57 iso-2022-jp D����tect���� ����lors ��������
?editor_console=true:57 shift-jis Dテδゥtectテδゥ テδ�lors テδエテδケ
?editor_console=true:57 euc-kr D횄짤tect횄짤 횄혻lors 횄쨈횄쨔
?editor_console=true:57 iso-2022-kr invalid
?editor_console=true:57 utf-16be 䓃菂ꥴ散瓃菂ꤠ쎃슠汯牳菂듃菂�
?editor_console=true:57 utf-16le 썄슃璩捥썴슃₩菃ꃂ潬獲쌠슃쎴슃�
?editor_console=true:57 iso-2022-cn invalid
Why this method does not work? Is it possible to fix the string with this method or another way?