I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
shows as ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε
-Ε
Λ
-Λ
And so on. So it's almost always Œ
and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε')
and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is
macintosh
, and its Windows code page number is 100000.Here's a Python function that will decode
macintosh
toutf-8
strings:My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?