How to fix the encoding of a string in JavaScript

410 Views Asked by At

I have received a broken string from another piece of software. I would have liked to fix its encoding in JavaScript but I feel I am missing something.

Here's an exemple of broken string: Détecté àlors ôù
And the expected output would be: Détecté àlors ôùi

I don't know the encoding used to send me the string.

My idea is to use the TextDecoder API; convert the string to bytes, and then reencode it in UTF8 or UTF16.

Here's the piece of code I used to detect the charset used:

const str = 'Détecté àlors ôùi';
const str2 = 'Détecté àlors ôù';

const charsets = [
  'utf-8',
  "ibm866",
  "iso-8859-2",
  "iso-8859-3",
  "iso-8859-4",
  "iso-8859-5",
  "iso-8859-6",
  "iso-8859-7",
  "iso-8859-8",
  "iso-8859-8-i",
  "iso-8859-10",
  "iso-8859-13",
  "iso-8859-14",
  "iso-8859-15",
  "iso-8859-16",
  "koi8-r",
  "koi8-u",
  "macintosh",
  "windows-874",
  "windows-1250",
  "windows-1251",
  "windows-1252",
  "windows-1253",
  "windows-1254",
  "windows-1255",
  "windows-1256",
  "windows-1257",
  "windows-1258",
  "x-mac-cyrillic",
  "gbk",
  "gb18030",
  "hz-gb-2312",
  "big5",
  "euc-jp",
  "iso-2022-jp",
  "shift-jis",
  "euc-kr",
  "iso-2022-kr",
  "utf-16be",
  "utf-16le",
  "iso-2022-cn"
];

const encoder = new TextEncoder();
const view = encoder.encode(str2);

console.log('__________________')

charsets.forEach((charset) => {
  try {
    const decoder = new TextDecoder(charset);
    const fixedStr = decoder.decode(view, {
      fatal: false,
      ignoreBOM: true,
    });

    console.log(charset, fixedStr);
  } catch (e) {
    console.log(charset, 'invalid');
  }
})

(the code can be tested here: https://jsfiddle.net/tashebwj/ )

The output is the following:

__________________
?editor_console=true:57 utf-8 Détecté àlors ôù
?editor_console=true:57 ibm866 D├Г┬йtect├Г┬й ├Г┬аlors ├Г┬┤├Г┬╣
?editor_console=true:57 iso-8859-2 DĂŠtectĂŠ Ă lors Ă´Ăš
?editor_console=true:57 iso-8859-3 D�Âİtect�Âİ � lors �´�Âı
?editor_console=true:57 iso-8859-4 DÊtectÊ àlors ôÚ
?editor_console=true:57 iso-8859-5 DУТЉtectУТЉ УТ lors УТДУТЙ
?editor_console=true:57 iso-8859-6 Dأآ�tectأآ� أآ lors أآ�أآ�
?editor_console=true:57 iso-8859-7 DΓΒ©tectΓΒ© ΓΒ lors ΓΒ΄ΓΒΉ
?editor_console=true:57 iso-8859-8 D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-8-i D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-10 DÃÂĐtectÃÂĐ Ã lors ÃÂīÃÂđ
?editor_console=true:57 iso-8859-13 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ“ĆĀ¹
?editor_console=true:57 iso-8859-14 Détecté àlors ÃÂṀÃÂṗ
?editor_console=true:57 iso-8859-15 Détecté àlors ÃŽù
?editor_console=true:57 iso-8859-16 DĂ©tectĂ© Ă lors ĂÂŽĂÂč
?editor_console=true:57 koi8-r Dц┐б╘tectц┐б╘ ц┐б═lors ц┐б╢ц┐б╧
?editor_console=true:57 koi8-u Dц┐б╘tectц┐б╘ ц┐б═lors ц┐бЄц┐б╧
?editor_console=true:57 macintosh Détecté àlors ôù
?editor_console=true:57 windows-874 Dรยฉtectรยฉ รย lors รยดรยน
?editor_console=true:57 windows-1250 DĂ©tectĂ© Ă lors Ă´ĂÂą
?editor_console=true:57 windows-1251 DГѓВ©tectГѓВ© ГѓВ lors ГѓВґГѓВ№
?editor_console=true:57 windows-1252 Détecté àlors ôù
?editor_console=true:57 windows-1253 Détecté àlors ôù
?editor_console=true:57 windows-1254 Détecté àlors ôù
?editor_console=true:57 windows-1255 Dֳƒֲ©tectֳƒֲ© ֳƒֲ lors ֳƒֲ´ֳƒֲ¹
?editor_console=true:57 windows-1256 Dأƒآ©tectأƒآ© أƒآ lors أƒآ´أƒآ¹
?editor_console=true:57 windows-1257 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ´ĆĀ¹
?editor_console=true:57 windows-1258 DĂƒÂ©tectĂƒÂ© ĂƒÂ lors ĂƒÂ´ĂƒÂ¹
?editor_console=true:57 x-mac-cyrillic D√Г¬©tect√Г¬© √Г¬†lors √Г¬і√Г¬є
?editor_console=true:57 gbk D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 gb18030 D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 hz-gb-2312 invalid
?editor_console=true:57 big5 D�穢tect�穢 ��饊ors �織�繒
?editor_console=true:57 euc-jp D�息tect�息 ��lors �卒�孫
?editor_console=true:57 iso-2022-jp D����tect���� ����lors ��������
?editor_console=true:57 shift-jis Dテδゥtectテδゥ テδ�lors テδエテδケ
?editor_console=true:57 euc-kr D횄짤tect횄짤 횄혻lors 횄쨈횄쨔
?editor_console=true:57 iso-2022-kr invalid
?editor_console=true:57 utf-16be 䓃菂ꥴ散瓃菂ꤠ쎃슠汯牳⃃菂듃菂�
?editor_console=true:57 utf-16le 썄슃璩捥썴슃₩菃ꃂ潬獲쌠슃쎴슃�
?editor_console=true:57 iso-2022-cn invalid

Why this method does not work? Is it possible to fix the string with this method or another way?

0

There are 0 best solutions below