Copying of data in a tight loop where the function "Move" can not be used

737 Views Asked by At

In one of my software projects I am in "the need of" the best possible speed for a tight loop which copies indexed data from an array into a different part of the same array.

Please consider this freepascal code:

TYPE
  tIndex : ARRAY OF BYTE;
  tValues: ARRAY OF CARDINAL;

VAR
  Index : tIndex;
  Values: tValues;

CONST
  AnyCardinalSize = 65535;   

PROCEDURE InitializeArrays;
  BEGIN
    SetLength(Index, AnyCardinalSize);
    SetLength(Values,AnyCardinalSize);
  END;

PROCEDURE MoveData(CONST StartPosition,EndPosition,TargetPosition : CARDINAL);
  VAR {Note: TargetPosition is ALWAYS larger than EndPosition.}
    x, InsertOffset : CARDINAL;
  BEGIN
    InsertOffset := TargetPosition-StartPosition;
    FOR x := StartPosition TO EndPosition DO BEGIN
      Values[x+InsertOffset] := Values[Index[x]];
    END;
  END;

My routine MoveData work as intended, but I would like to investigate the possibility of replacing it with assembler code in order to gain a little bit of speed, if at all possible.

If somebody has examples of how this can be done in assembler language then I would be most grateful. I did some ASM on the Commodore 64, but that is a lifetime ago.

2

There are 2 best solutions below

4
On BEST ANSWER

Most optimizations come from hoisting out invariant parts of code in the expression, and then performing strength reduction to decrease the number of variants. Learn to optimize Pascal before you go to assembler, even if only because disassembled Pascal gives you a skeleton for further assembler optimization and can act as comparison when testing the assembler code.

To better identify constant parts, we convert to a 0 based for loop with n=0..endposition-startposition, the expression then becomes:

FOR n := 0 TO EndPosition-StartPosition  DO 
  BEGIN
    Values[n+InsertOffset+StartPosition ] := Values[Index[n+StartPosition]];
  END;

Then we factor out the constant parts from the parts that are loop invariant, and introduce pointer syntax. T is the type of the values array, and PT is a pointer to it. Similarly, TI is the type of the index array and PTI is the pointer to it.

var StartValue : PT;
    StartIndex : PTI;  

StartValue:=@Values[InsertOffset+StartPosition]
StartIndex:=@Index[StartPosition];

FOR n := 0 TO EndPosition-StartPosition  DO 
    Startvalue[n]:=Values[StartIndex[n]]; // slightly depended on FPC dialect mode

Now, startvalue[n] is calculated as addressof(startvalue[0])+n*sizeof(T); However if we memorize the value of the address in the previous loop (in Startvalue) then the difference becomes PT:=PT+sizeof(T);

We transformed a multiplication into an add, but more importantly, we eliminated one variable from the expression (instead of startvalue and x (or n), we now only have startvalue to remember). This is called strength reduction, and we can also apply this to the index array:

FOR n := 0 TO EndPosition-StartPosition  DO 
  begin
    Startvalue^:=Values[StartIndex^];
    inc(StartValue);   // inc increments with element size and we declared as PT, so this means inc(startvalue,sizeof(T));
    inc(StartIndex);
  end;

Unfortunately this doesn't eliminate the need for the values array altogether, but this loop has three values used in the loop (startvalue,startindex and values[]), while the original had many (index, values,x, startposition,insertoffset). In addition both have something to identify the end of iteration.

In our case that could be when startvalue reaches the end of the array, but for that we have to use a while:

 var StartValue,EndValue : PT;
     StartIndex : PTI;  

 StartValue:=@Values[InsertOffset+StartPosition]
 EndValue:=@Values[InsertOffset+EndPosition]
 StartIndex:=@Index[StartPosition];

 while (StartValue<=Endvalue) do
    begin
      Startvalue^:=Values[StartIndex^];
      inc(StartValue);   // inc increments with element size and we declared as PT, so this means inc(pbyte(startvalue),sizeof(T));
      inc(StartIndex);
    end;

In theory compilers can do these optimizations. C compilers often do when properly instructed, but most Pascal compilers are not yet up to that level. There however is no reason why they couldn't.

The best way to start assembler is to first optimize Pascal a lot and try to comprehend the generated code.

I've been doing image analysis in Delphi for years now, and going to assembler is only to manually correct the compiler not getting it (e.g. not hoisting or picking a suboptimal invariant), or when register allocation goes awol.

The only other case is whole image operations in SSE, but that doesn't apply here.

5
On

Firstly, before continuing, I think you would have to establish that the copying of data in this way is indeed the bottleneck and that there is no alternative strategy available that removes the need for copying altogether.

If you are certain that you need to have a faster copying routine, it is worth noting that these sort of optimisations can be very tricky and depend on hardware as well as the memory layout/alignment/data organisation and size. Without considerable experience or time, it is highly unlikely you would come up with a superior solution that is worth the effort.

However, optimised memory copy/move routines are available. In particular, you might want to look at Agner Fog's asmlib, which contains optimised mem copy/move routines that can be called from compiled code.