iconv not TRANSLIT from UTF-8 to US-ASCII properly

617 Views Asked by At

I need to eliminate special characters in a large .xml file. So, I need a file to go from UTF-8 to US-ASCII. I believe I should be able to use iconv to do this with the following command:

iconv -f UTF-8 -t US-ASCII//TRANSLIT//IGNORE sample1.xml -o sample2.xml

Here are a few lines of the input file:

  • ...from regjsparser’s AST...
  • ...returning “symbol” for...
  • ...foo-barfooBar...
  • ...André Cruz...
  • ...Kat Marchán...

And here is the output of those snippets:

  • ...from regjsparser's AST... (replaced RIGHT SINGLE QUOTE with APOSTROPHE )
  • ...returning "symbol" for... (replaced LEFT/RIGHT DOUBLE QUOTES with regular QUOTES )
  • ...foo-bar -> fooBar... (replaced RIGHTWARDS ARROW with DASH and GREATER THAN )
  • ...Andr? Cruz... (failed to identify/replace ACUTE E / U+00E9 with regular E )
  • ...Kat March?n... (failed to identify/replace ACUTE A / U+00E1 with regular A )

Clearly the tool is working because it replaces some of the chars, but it can never replace accented letters. These files are BOM files generated by CycloneDX, so they should just be UTF-8 encoded originally. The iconv installed on the machine comes from Debian 2.31 GLIBC library.

I have no idea why it is struggling with accented chars.

EDIT: Here is the printout of the locale and locale -a commands. Not sure if these values are relevant to this problem or not.

locale

+ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

locale -a

+ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
C.UTF-8
POSIX

I'm struggling to understand what these LC values mean and how they work.

1

There are 1 best solutions below

0
Achiral On

Fixed this by running

export LC_ALL="C.UTF-8"
iconv -f UTF-8 -t US-ASCII//TRANSLIT//IGNORE sample1.xml -o sample2.xml

It seems that the original value of locale parameters is en-US.UTF-8 by default, even if it does not exist on the machine. So you need to run locale -a to determine what options you have and choose one that closely fits your needs. Seems that most anything label xx.UTF-8 will work for TRANSLITERATION purposes.

I've read that this exported value is applied only during your current session, and would need to be reset every time you start a new session. If you want to permanently set the locale values, you will need to do something like this: https://www.tecmint.com/set-system-locales-in-linux/