I have a big text file with different entries, some are just plain ASCII, some are UTF-8 and some are like double-UTF-8.
Here's the content of the file as cat
shows it:
'Böker'
'für'
And here's what less
would show:
'BÃ<U+0083>¶ker'
'für'
This is what I would like to get (clean ISO-8859-1):
'Böker'
'für'
Using iconv --from-code=UTF-8 --to-code=ISO-8859-1
this is the result:
'Böker'
'für'
Using iconv --from-code=UTF-8 --to-code=ISO-8859-1
twice (with the same parameters), it gives the correct ö, but interprets the ü as well (output from less
):
'Böker'
'f<FC>r'
One approach would be to test every string in bash
which unicode format it currently is in. I searched quite a lot for this, but couldn't find a suitable answer.
Another approach would be to have a program that converts the strings directly to the correct format, but I couldn't find another program like iconv, and since <FC> is a perfectly valid character in ISO-8859-1, neither using "-c" nor adding "//IGNORE" to the -to-code change the output.
It's impossible to solve this in a general way (what if 'Böker' as well as 'Böker' could be valid input?) but usually you can find a heuristic that works for your data. Since you seem to have only or mostly German-language strings, the problematic characters are
ÄÖÜäöüß
. One approach would be to search every entry for these characters in ISO-8859-1, UTF-8 and double encoded UTF-8. If a match is found, simply assume that this is correct encoding.If you're using
bash
, you can grep for the byte sequences using the$'\xnn'
syntax. You only have to make sure thatgrep
uses theC
locale. Here's an example for the characterö
(output from a UTF-8 console):But it's probably easier to solve this with a scripting language like Perl or Python.