Encoding.UTF7.GetBytes does not reverse Encoding.UTF7.GetString()

1k Views Asked by At

I guess I'm missing something fundamental but I'm really confused by this one and searching has failed to find me anything.

I have the following...

byte[] bytes1;
string string1;
byte[] bytes2;

Then I do the following

bytes1 = { 64, 55, 121, 54, 36, 72, 101, 118, 38, 40, 100, 114, 33, 110, 85, 94, 112, 80, 163, 36, 84, 103, 58, 126 };
string1 = System.Text.Encoding.UTF7.GetString(bytes1);
bytes2 = System.Text.Encoding.UTF7.GetBytes(string1);

Bytes2 ends up as 54 instead of 24 bytes and they are completely different bytes.

Now of course this is pointless code anyway, but I've put it in while diagnosing why the bytes I'm getting from Encoding.UTF7.GetString are not the bytes I'm expecting. I have got down to the fact that this is the reason my code is not giving expected results.

Now I'm confused. I know if I don't use encoding then the result of GetBytes from a string can't be relied on to be a particular set of bytes, but I'm using encoding and still getting this difference.

Can anyone enlighten me to what I'm missing?

EDIT: Conclusion is that it's not UTF7. The original byte array is being written to a varbinary in a database by an application I'm programming in a high level language. I have no control of how the original strings are being encoded to varbinaries in that language. I'm trying to read them and handle them in a small C# add-on to the main app which is where I hit this problem. Other encodings I've tried also don't give the right results.

3

There are 3 best solutions below

3
On BEST ANSWER

UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters. (C) Wikipedia

Your byte array contain incorrect sequences for UTF7. For example, number "163" not may encoding by 7 bits.

4
On

What you're seeing is two different ways of encoding the same text in UTF-7.

Your original text is:

@7y6$Hev&(dr!nU^pP£$Tg:~

The ASCII version of bytes2 is

+AEA-7y6+ACQ-Hev+ACY-(dr+ACE-nU+AF4-pP+AKMAJA-Tg:+AH4-

In other words, it's encoding everything other than A-Z, a-z, 0-9 as +A...-. That's unnecessary, but I suspect it's valid.

From the UTF-7 wikipedia entry:

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

0
On

It wasn't UTF7 and I had made errors in the first place in coming to the conclusion it was. Thanks everyone who advised this.

I have spoken to someone who works for the people who write the high level language the main part of the application is programmed in (and happens to be in our building today).

He couldn't tell me what encoding it was using between the entered string and the varbinary, but was able to tell me that there was a way to force unicode. As this is a new option in both applications I know that no production data has been written in the old way so will update both sides to use unicode encoding for this process. It all seems to be working so far.