In one of my software projects I am in "the need of" the best possible speed for a tight loop which copies indexed data from an array into a different part of the same array.
Please consider this freepascal code:
TYPE
tIndex : ARRAY OF BYTE;
tValues: ARRAY OF CARDINAL;
VAR
Index : tIndex;
Values: tValues;
CONST
AnyCardinalSize = 65535;
PROCEDURE InitializeArrays;
BEGIN
SetLength(Index, AnyCardinalSize);
SetLength(Values,AnyCardinalSize);
END;
PROCEDURE MoveData(CONST StartPosition,EndPosition,TargetPosition : CARDINAL);
VAR {Note: TargetPosition is ALWAYS larger than EndPosition.}
x, InsertOffset : CARDINAL;
BEGIN
InsertOffset := TargetPosition-StartPosition;
FOR x := StartPosition TO EndPosition DO BEGIN
Values[x+InsertOffset] := Values[Index[x]];
END;
END;
My routine MoveData
work as intended, but I would like to investigate the possibility of replacing it with assembler code in order to gain a little bit of speed, if at all possible.
If somebody has examples of how this can be done in assembler language then I would be most grateful. I did some ASM on the Commodore 64, but that is a lifetime ago.
Most optimizations come from hoisting out invariant parts of code in the expression, and then performing strength reduction to decrease the number of variants. Learn to optimize Pascal before you go to assembler, even if only because disassembled Pascal gives you a skeleton for further assembler optimization and can act as comparison when testing the assembler code.
To better identify constant parts, we convert to a 0 based for loop with n=0..endposition-startposition, the expression then becomes:
Then we factor out the constant parts from the parts that are loop invariant, and introduce pointer syntax. T is the type of the values array, and PT is a pointer to it. Similarly, TI is the type of the index array and PTI is the pointer to it.
Now, startvalue[n] is calculated as addressof(startvalue[0])+n*sizeof(T); However if we memorize the value of the address in the previous loop (in Startvalue) then the difference becomes PT:=PT+sizeof(T);
We transformed a multiplication into an add, but more importantly, we eliminated one variable from the expression (instead of startvalue and x (or n), we now only have startvalue to remember). This is called strength reduction, and we can also apply this to the index array:
Unfortunately this doesn't eliminate the need for the values array altogether, but this loop has three values used in the loop (startvalue,startindex and values[]), while the original had many (index, values,x, startposition,insertoffset). In addition both have something to identify the end of iteration.
In our case that could be when startvalue reaches the end of the array, but for that we have to use a while:
In theory compilers can do these optimizations. C compilers often do when properly instructed, but most Pascal compilers are not yet up to that level. There however is no reason why they couldn't.
The best way to start assembler is to first optimize Pascal a lot and try to comprehend the generated code.
I've been doing image analysis in Delphi for years now, and going to assembler is only to manually correct the compiler not getting it (e.g. not hoisting or picking a suboptimal invariant), or when register allocation goes awol.
The only other case is whole image operations in SSE, but that doesn't apply here.