Encoding-free String class for handling bytes? (Or alternative approach)

297 Views Asked by At

I have an application converted from Python 2 (where strings are essentially lists of bytes) and I'm using a string as a convenient byte buffer.

I am rewriting some of this code in the Boo language (Python-like syntax, runs on .NET) and am finding that the strings have an intrinsic encoding type, such as ASCII, UTF-8, etc. Most of the information dealing with bytes refer to arrays of bytes, which are (apparently) fixed length, making them quite awkward to work with.

I can obviously get bytes from a string, but at the risk of expanding some characters into multiple bytes, or discarding/altering bytes above 127, etc. This is fine and I fully understand the reasons for this - but what would be handy for me is either (a) an encoding that guarantees no conversion or discarding of characters so that I can use a string as a convenient byte buffer, or (b) some sort of ByteString class that gives the convenience of the string class. (Ideally the latter as it seems less of a hack.) Do either of these already exist? (Or are trivial to implement?)

I am aware of System.IO.MemoryStream, but the prospect of creating one of those each time and then having to make a System.IO.StreamReader at the end just to get access to ReadToEnd() doesn't seem very efficient, and this is in performance-sensitive code.

(I hope nobody minds that I tagged this as C# as I felt the answers would likely apply there also, and that C# users might have a good idea of the possible solutions.)

EDIT: I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?

2

There are 2 best solutions below

8
On BEST ANSWER

Use the Latin-1 encoding as described in this answer. It maps values in the range 128-255 unchanged, useful when you want to roundtrip bytes to chars.

UPDATE

Or if you want to manipulate bytes directly, use List<byte>:

List<byte> result = ...
...
// Add a byte at the end
result.Add(b);
// Add a collection of bytes at the end
byte[] bytesToAppend = ...
result.AddRange(bytesToAppend);
// Insert a collection of bytes at any position
byte[] bytesToInsert = ...
int insertIndex = ...
result.InsertRange(insertIndex, bytesToInsert);
// Remove a range of bytes
result.RemoveRange(index, count);
... etc ...

I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?

The StringBuilder class is needed because regular strings are immutable, and a List<byte> gives you everything you might expect from a "StringBuilder for bytes".

9
On

I would suggest that you use MemoryStream combined with the GetBuffer() operator to retrieve the end result. Strings are actually fixed length and immutable, and to add or replace one byte to a string requires you to copy the whole thing into a new string, which is quite slow. To avoid this you would need to use a StringBuilder which allocates memory and doubles the capacity when needed, but then you might just as well use MemoryStream instead which does a similar thing but on bytes.

Each element in the string is a char and are actually two bytes because .NET strings are always UTF-16 in memory, which means you will also be wasting memory if you decide to store only one byte in each element.