.NET uses UTF-16 to represent strings, which is usually 2 bytes per character.
Many debugging tricks (including my own answers) will use the output of !do
to get the address of the first character and then use the string length*2 in order to get the end address of the string.
Some examples where this can be useful:
du
to dump the string, because!do
will not dump the complete string.writemem
to write strings to a file so that it can be processed by other toolss
to search for strings containing specific substrings
However, UTF-16 also has 4 bytes characters (U+10000 to U+10FFFF), which might screw up everything.
- string length is counted in characters and a 4 byte character is probably only counted as 1 character, so any length*2 calculations are incorrect
du
might stop at characters which end on00 00
So, how safe is it to use such scripts debugging .NET applications in WinDbg?
Short version: yes, it is safe to do string calculations in WinDbg using
String.Length
and it is safe to usedu
to dump them.UTF-16 4 byte characters ending on 00 00
The unicode specification defines that the first 6 bits of byte 1 are 110110 and the first 6 bits of byte 3 are 110111. This means that the first nibble (4 bits) is always a
D
so that a 4 byte UTF-16 character always looks like this:D? ?? D? ??
and will never end with00 00
.Therefore it is safe to use
du
commands on UTF-16 strings.Using string.Length for calculating the range
Before answering my own question, I wanted to try the behavior in C# and therefore asked the question about how to create 4-byte characters in C#.
Unexpectedly, this already pointed me to the answer: string.Length is the string length in code units, not characters. To get the Unicode character length, we should use the
System.Globalization.StringInfo
class.