Strange behavior of move with strings

2k Views Asked by At

I am testing some enhanced string related functions with which I am trying to use move as a way to copy strings around for faster, more efficient use without delving into pointers.

While testing a function for making a delimited string from a TStringList, I encountered a strange issue. The compiler referenced the bytes contained through the index when it was empty and when a string was added to it through move, index referenced the characters contained.

Here is a small downsized barebone code sample:-

unit UI;

interface

uses
  System.SysUtils, System.Types, System.UITypes, System.Rtti, System.Classes,
  System.Variants, FMX.Types, FMX.Controls, FMX.Forms, FMX.Dialogs, FMX.Layouts,
  FMX.Memo;

type
  TForm1 = class(TForm)
    Results: TMemo;
    procedure FormCreate(Sender: TObject);
  end;

var
  Form1: TForm1;

implementation

{$R *.fmx}

function  StringListToDelimitedString
          ( const AStringList: TStringList; const ADelimiter: String ): String;
var
  Str           : String;
  Temp1         : NativeInt;
  Temp2         : NativeInt;
  DelimiterSize : Byte;

begin

  Result        := ' ';
  Temp1         := 0;
  DelimiterSize := Length ( ADelimiter ) * 2;

  for Str in AStringList do
    Temp1 := Temp1 + Length ( Str );

  SetLength ( Result, Temp1 );
  Temp1     := 1;

  for Str in AStringList do
  begin

    Temp2 := Length ( Str ) * 2;

    // Here Index references bytes in Result
    Move  ( Str [1],        Result [Temp1], Temp2 );

    // From here the index seems to address characters instead of bytes in Result
    Temp1 := Temp1 + Temp2;
    Move  ( ADelimiter [1], Result [Temp1], DelimiterSize );    
    Temp1 := Temp1 + DelimiterSize;

  end;

end;

procedure TForm1.FormCreate(Sender: TObject);
var
  StrList : TStringList;
  Str     : String;

begin

  // Test 1 : StringListToDelimitedString

  StrList := TStringList.Create;
  Str     := '';

  StrList.Add ( 'Hello1' );
  StrList.Add ( 'Hello2' );
  StrList.Add ( 'Hello3' );
  StrList.Add ( 'Hello4' );

  Str := StringListToDelimitedString ( StrList, ';' );
  Results.Lines.Add ( Str ); 
  StrList.Free;

end;

end.

Please devise a solution and if possible, some explanation. Alternatives are welcome too.

2

There are 2 best solutions below

6
On BEST ANSWER

Let's look at the crucial bit of code:

// Here Index references bytes in Result
Move  ( Str [1],        Result [Temp1], Temp2 );

// From here the index seems to address characters instead of bytes in Result
Temp1 := Temp1 + Temp2;
Move  ( ADelimiter [1], Result [Temp1], DelimiterSize );    

Now, some explanations. When you index a string, you are always indexing characters. You are never indexing bytes. It looks to me as though you wish to index bytes. In which case using the string index operator makes life hard. So I suggest that you index bytes as follows.

First of all initialise Temp1 to 0 rather than 1 since we will be using zero-based indexing.

When you need to index Result using a zero-based byte index, do so like this:

PByte(Result)[Temp1]

So your code becomes:

Temp1 := 0;
for Str in AStringList do
begin
  Temp2 := Length(Str)*2;
  Move(Str[1], PByte(Result)[Temp1], Temp2);
  Temp1 := Temp1 + Temp2;
  Move(ADelimiter[1], PByte(Result)[Temp1], DelimiterSize);    
  Temp1 := Temp1 + DelimiterSize;
end;

In fact I think I'd write it like this, avoiding all string indexing:

Temp1 := 0;
for Str in AStringList do
begin
  Temp2 := Length(Str)*2;
  Move(Pointer(Str)^, PByte(Result)[Temp1], Temp2);
  Temp1 := Temp1 + Temp2;
  Move(Pointer(ADelimiter)^, PByte(Result)[Temp1], DelimiterSize);    
  Temp1 := Temp1 + DelimiterSize;
end;

I'd suggest better names than Temp1 and Temp2. I also question the use of NativeInt here. I'd normally expect to see Integer. Not least because a Delphi string is indexed by signed 32 bit values. You cannot have a string with length greater than 2GB.

Note also that you are not allocating enough memory. You forgot to account for the length of the delimiter. Fix that and your function looks like this:

function StringListToDelimitedString(const AStringList: TStringList;
  const ADelimiter: String): String;
var
  Str: String;
  Temp1: Integer;
  Temp2: Integer;
  DelimiterSize: Integer;
begin
  Temp1 := 0;
  DelimiterSize := Length(ADelimiter) * SizeOf(Char);

  for Str in AStringList do
    inc(Temp1, Length(Str) + DelimiterSize);

  SetLength(Result, Temp1);
  Temp1 := 0;
  for Str in AStringList do
  begin
    Temp2 := Length(Str) * SizeOf(Char);
    Move(Pointer(Str)^, PByte(Result)[Temp1], Temp2);
    inc(Temp1, Temp2);
    Move(Pointer(ADelimiter)^, PByte(Result)[Temp1], DelimiterSize);
    inc(Temp1, DelimiterSize);
  end;
end;

If you want to avoid pointers, then write it like this:

function StringListToDelimitedString(const AStringList: TStringList;
  const ADelimiter: String): String;
var
  Str: String;
  StrLen: Integer;
  ResultLen: Integer;
  DelimiterLen: Integer;
  ResultIndex: Integer;
begin
  DelimiterLen := Length(ADelimiter);

  ResultLen := 0;
  for Str in AStringList do
    inc(ResultLen, Length(Str) + DelimiterLen);

  SetLength(Result, ResultLen);

  ResultIndex := 1;
  for Str in AStringList do
  begin
    StrLen := Length(Str);
    Move(Pointer(Str)^, Result[ResultIndex], StrLen*SizeOf(Char));
    inc(ResultIndex, StrLen);
    Move(Pointer(ADelimiter)^, Result[ResultIndex], DelimiterLen*SizeOf(Char));
    inc(ResultIndex, DelimiterLen);
  end;
end;
17
On

System.Move works with untyped pointers and counter of bytes. System.Copy and SysUtils.StrLCopy work with strings (Pascal strings and C strings respectively) and counter of chars. But char and byte are different types, so when you move from string/char context to pointers/bytes context - you should re-calculate length in chars to length in bytes. By the way, same is about indices, Result [Temp1] calculates in characters, not in bytes. And always did.

Correct solution is not mixing citizens of different planets. If you want pointers - use pointers. If you want characters and strings - use characters and strings. But do not mix them! divide and conquer and always separate and make clear when you're using raw piinters and where you use typed strings! Otherwise you're misleading yourself;

function  StringListToDelimitedString
          ( const AStringList: TStrings; const ADelimiter: String ): String;
var
  Str           : array of String;
  Lengths       : array of Integer;
  Temp1         : NativeInt;
  Count, TotalChars : Integer;

  PtrDestination: PByte;
  PCurStr: ^String;
  CurLen: Integer;

  Procedure  Add1(const Source: string);
  var count: integer; // all context is in bytes, not chars here!
      Ptr1, Ptr2: PByte; 
  begin
    if Source = '' then exit;
    Ptr1 := @Source[ 1 ];
    Ptr2 := @Source[ Length(Source)+1 ];
    count := ptr2 - ptr1;

    Move( Source[1], PtrDestination^, count);
    Inc(PtrDestination, count);
  end;

begin // here all context is in chars and typed strings, not bytes
  Count := AStringList.Count;
  if Count <= 0 then exit('');

  SetLength(Str, Count); SetLength(Lengths, Count);
  TotalChars := 0; 
  for Temp1 := 0 to Count - 1 do begin
      PCurStr  := @Str[ Temp1 ]; 
      PCurStr^ := AStringList[ Temp1 ]; // caching content, avoiding extra .Get(I) calls
      CurLen := Length ( PCurStr^ ); // caching length, avoind extra function calls
      Lengths[ Temp1 ] := CurLen;
      Inc(TotalChars,  CurLen);
  end;

  SetLength ( Result, TotalChars + ( Count-1 )*Length( ADelimiter ) );

  PtrDestination := Pointer(Result[1]); 
  // Calls UniqueString to get a safe pointer - but only once 

  for Temp1 := Low(Str) to High(Str) do
  begin
    Add1( Str[ Temp1 ] );
    Dec( Count );
    if Count > 0 // not last string yet
       then Add1( Delimeter );
  end;
end;

Now, the correct solution i believe would be stopping inventing bicycles and using ready-made and tested libraryes, for example.

 Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4']).Join(';');

Or, if you really need to add a delimiter PAST THE LAST string (which is usually carefully avoided) then

 Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4', '']).Join(';');

The original claim of squeezing single percents of CPU power just does not hold with the original code. The illusion of fast pointer operations is just shadowed by a suboptimal code caring not about performance at all.


function  StringListToDelimitedString
          ( const AStringList: TStringList; const ADelimiter: String ): String;

TStringList is a class. Class instance creation and deletion are expensive (slow) operations. Delphi made a flexible framework of those classes - but the speed suffers. So if you want to get few extra percents of speed and pay with sacrificing reliability and flexibility - the don't use classes.

DelimiterSize : Byte;

It should be NativeInt instead just as the rest of muneric variables there. You think you just saved few bytes - but you forced CPU to use non-native datatype and insert typecasts every now and then. It is nothing but an explicitly introduced delay. Ironically, you even did not saved those bytes, cause Delphi would just pad three bytes more to allocate next variable on 32-bits boundary. That is a typical "memory alignment" optimization.

Result        := ' ';

This value would never be used. So it is just a loss of time.

for Str in AStringList do

This construction, requiring instantiating TInterfacedObject and calling its virtual methods and then reference-counting it with global locking - is expensive (slow) operation. And twice slow in multithreading taskloads. If you need to squeeze few percetns of speed - you should avoid loosing tens of percents on for-in loops. Those hi-level loops are handy and reliable and flexible - but they pay with speed for that.

 for Str in AStringList do

Then You do it twice. But you DON'T KNOW how it that stringlist implemented. How efficiently does it gets the string ? It may even pass messages to another process like TMemo.Lines do! So you should minimize all accesses to that class and its multitude of internal virtual members. Cache all the strings ONCE in some local variable, do not fetch TWICE every of those!

 Move  ( Str [1],        Result [Temp1], Temp2 );

Now we come to a really interesting question - is there even hypothetical place to gain any speed advantage by usiong pointers and bytes? Open CPU window and look how that line is actually implemented!

Strings are reference-counted! When you do Str2 := Str1; no data is copied but only pointers do. But as you start accessing the real memory buffer inside the string - that Str[1] expression - the compiler can not more count references, so Delphi is forced here to decrease reference coutner to ONE SINGLE. That is, Delphi is forced here to call UniqueString over Str and over Result; the System.UniqueString checks the refcounter and if it is >1 makes a special local copy of string (copying all the data into a newly allocated special buffer). Then you do a Move - just like Delphi RTL does itself. I cannto get where any advantage of speed may come from?

 Move  ( ADelimiter [1], Result [Temp1], DelimiterSize )

And here the same operations are done yet again. And they are costly operations! At least an extra procedure is called, at worst the new buffer is allocated and all the content being copied.


Resume:

  1. The boundary between reference-counted strings and raw pointers is a costly one and every time you cross it - you force Delphi to pay a price.

  2. Mixing those boundaries in the same code make the price paid again and again and again. It also confuses yourself where your counters and indices refer to bytes and where they refer to chars.

  3. Delphi optimized casual string operations for years. And did a pretty good job there. Outperforming Delphi is possible - but you would need to understand in very very fine details - up to each CPU assembler instruction - what goes under the curtains of Pascal sources in your program. That is a dirty and tedious work. There will be no mopre that luxury of using those reliable and flexible things as for-in loop and TStrings classes.

  4. In the end you would most probably get a few percents of speed gain, that no one would ever notice. But you will pay for that with a code much harder to understand, write, read and test. Will those few percents of speed worth unmaintainable code ? I doubt it.

So unless you are forced to do so, my advice is to skip wasting your time and just do a usual Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4']).Join(';'); Reliability and flexibility is almost always more preferable than sheer speed.

And sorry to tell that, while i do not know a lot about speed optimizations, i easily saw a speed-damaging issues in your code, that you intended to be faster than Delphi itself. My experience is miles and miles away of even trying to outperform Delphi in strings field. And i do not think you have any chances other but wasting a lot of time to finally get performance worse than stock one.