I need to eliminate special characters in a large .xml file. So, I need a file to go from UTF-8 to US-ASCII. I believe I should be able to use iconv to do this with the following command:
iconv -f UTF-8 -t US-ASCII//TRANSLIT//IGNORE sample1.xml -o sample2.xml
Here are a few lines of the input file:
- ...from regjsparser’s AST...
- ...returning “symbol” for...
- ...
foo-bar→fooBar... - ...André Cruz...
- ...Kat Marchán...
And here is the output of those snippets:
- ...from regjsparser's AST... (replaced RIGHT SINGLE QUOTE with APOSTROPHE )
- ...returning "symbol" for... (replaced LEFT/RIGHT DOUBLE QUOTES with regular QUOTES )
- ...
foo-bar->fooBar... (replaced RIGHTWARDS ARROW with DASH and GREATER THAN ) - ...Andr? Cruz... (failed to identify/replace ACUTE E / U+00E9 with regular E )
- ...Kat March?n... (failed to identify/replace ACUTE A / U+00E1 with regular A )
Clearly the tool is working because it replaces some of the chars, but it can never replace accented letters.
These files are BOM files generated by CycloneDX, so they should just be UTF-8 encoded originally.
The iconv installed on the machine comes from Debian 2.31 GLIBC library.
I have no idea why it is struggling with accented chars.
EDIT: Here is the printout of the locale and locale -a commands. Not sure if these values are relevant to this problem or not.
locale
+ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
locale -a
+ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
C.UTF-8
POSIX
I'm struggling to understand what these LC values mean and how they work.
Fixed this by running
It seems that the original value of
localeparameters isen-US.UTF-8by default, even if it does not exist on the machine. So you need to runlocale -ato determine what options you have and choose one that closely fits your needs. Seems that most anything labelxx.UTF-8will work for TRANSLITERATION purposes.I've read that this exported value is applied only during your current session, and would need to be reset every time you start a new session. If you want to permanently set the
localevalues, you will need to do something like this: https://www.tecmint.com/set-system-locales-in-linux/