When decoding a byte array into a string using the .net ASCIIEncoding class do I need to write some code to detect and remove the byte order mark, or is it possible to tell ASCIIEncoding to not decode byte order mark into the string?
Here's my problem, when I do this:
string someString = System.Text.ASCIIEncoding.Default.GetString(someByteArray)
someString will look like this:
<?xml version="1.0"?>.......
Then when I call this:
XElement.Parse(someString)
an exception is thrown because of the first three bytes: EF BB BF - the UTF8 byte order mark. So I thought that if I specify UTF8 encoding, rather than Default, like this:
System.Text.ASCIIEncoding.UTF8.GetString(someByteArray)
ASCIIEncoding would not attempt to decode the byte order mark into the string. When I copy the returned string into notepad++, I can see a ? character in front of the XML tag. So now the byte order mark is being decoded into a single garbage character. What is the best way to stop the byte order mark being decoded in this case?
Please don't use
That's really just
It's not using
ASCIIEncoding
at all. It just looks like it in your source code.Fundamentally, the problem is that your file is UTF-8, it's not ASCII. That's why it's got a UTF-8 byte order mark. I strongly suggest that you use
Encoding.UTF8
to read the UTF-8 file, one way or the other.If you read the file with
File.ReadAllText
, I suspect it'll remove the BOM automatically. Or you could just trim it afterwards, before callingXElement.Parse
. Using the wrong encoding (either ASCII or Encoding.Default) is not the right approach. Likewise it's not a garbage character. It's a perfectly useful character, giving a very strong indication that it really is a UTF-8 file - it's just you don't want it in this particular context. "Garbage" gives the impression that it's corrupt data which shouldn't be present in the file, and that's definitely not the case.Another approach would be to avoid converting it into text at all. For example:
That way the encoding will be auto-detected.