I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ shows as ŒïŒõŒõŒóŒùŒôŒöŒë ŒùŒïŒë Œ§Œ©Œ°Œë. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε-ŒïΛ-Œõ
And so on. So it's almost always Œ and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε') and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is
macintosh, and its Windows code page number is 100000.Here's a Python function that will decode
macintoshtoutf-8strings:My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?