Convert different unicode styles / Test for unicode-string in bash

693 Views Asked by At

I have a big text file with different entries, some are just plain ASCII, some are UTF-8 and some are like double-UTF-8.

Here's the content of the file as cat shows it:

'Böker'
'für'

And here's what less would show:

'BÃ<U+0083>¶ker'
'für'

This is what I would like to get (clean ISO-8859-1):

'Böker'
'für'

Using iconv --from-code=UTF-8 --to-code=ISO-8859-1 this is the result:

'Böker'
'für'

Using iconv --from-code=UTF-8 --to-code=ISO-8859-1 twice (with the same parameters), it gives the correct ö, but interprets the ü as well (output from less):

'Böker'
'f<FC>r'

One approach would be to test every string in bash which unicode format it currently is in. I searched quite a lot for this, but couldn't find a suitable answer.

Another approach would be to have a program that converts the strings directly to the correct format, but I couldn't find another program like iconv, and since <FC> is a perfectly valid character in ISO-8859-1, neither using "-c" nor adding "//IGNORE" to the -to-code change the output.

1

There are 1 best solutions below

0
On

It's impossible to solve this in a general way (what if 'Böker' as well as 'Böker' could be valid input?) but usually you can find a heuristic that works for your data. Since you seem to have only or mostly German-language strings, the problematic characters are ÄÖÜäöüß. One approach would be to search every entry for these characters in ISO-8859-1, UTF-8 and double encoded UTF-8. If a match is found, simply assume that this is correct encoding.

If you're using bash, you can grep for the byte sequences using the $'\xnn' syntax. You only have to make sure that grep uses the C locale. Here's an example for the character ö (output from a UTF-8 console):

$ cat test.txt
B▒ker ISO-8859-1
Böker UTF-8
Böker Double encoded UTF-8
$ LC_ALL=C grep $'\xF6' test.txt
B▒ker ISO-8859-1
$ LC_ALL=C grep $'\xC3\xB6' test.txt
Böker UTF-8
$ LC_ALL=C grep $'\xC3\x83\xC2\xB6' test.txt
Böker Double encoded UTF-8

But it's probably easier to solve this with a scripting language like Perl or Python.