Why is UTF-7 interpreting umlauts correct and UTF-8 not?

918 Views Asked by At

I have a Base64 string which I want to convert and decode to UTF-8 like this:

byte[] encodedDataAsBytes = System.Convert.FromBase64String(vcard);
return Encoding.UTF8.GetString(encodedDataAsBytes);

This because Umlauts in the string need to be displayed correctly. The problem I face is that when I use UTF-8 as encoding the umlauts are NOT handled correctly. But when I use UTF-7

return Encoding.UTF7.GetString(encodedDataAsBytes);

everything works fine.

Why's that? Should'nt UTF-8 be able to handle umlauts??

2

There are 2 best solutions below

0
On

Your vcard is UTF-7 encoded.

This is why Encoding.UTF7.GetString(encodedDataAsBytes); gives you the right result.

After it is encoded, you can't decide on another encoding.

To use UTF-8 encoding you would need access to the string before variable vcard got its value.

2
On

I had a similar problem. In my case, I used javaScript btoa() to encode a filename to Base64 within the Web UI, and send it over to the server. On the server side .net core, I used the code below to decode it back to a string filename.

// Note: encodedFilename is the result of btoa() from the client web UI.
var raw = Convert.FromBase64String(encodedFilename);
var filename = Encoding.UTF8.GetString(raw);

It failed to decode ä. However it worked when I used Encoding.UTF7(), but I think it is not the right solution. I believe that this due to the different encode/decode type. btoa() is binary to ASCII. What I really need is b64EncodeUnicode().

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode('0x' + p1);
    }));
}

Code Reference: https://developer.mozilla.org/en-US/docs/Glossary/Base64