I am trying to get into optimization and i wanted to ask if i got the difference between loop vectorization and interleaving by bringing this examples:
void add_arrays(int* a, int* b, int n){
for (int i = 0; i < n; ++i){
a[i] += b[i];
}
}
void add_arrays(int* a, int* b, int n){
#pragma clang loop vectorize_width(64)
for (int i = 0; i < n; ++i){
a[i] += b[i];
}
}
When i compile the first with clang 5.0 using the flags -O3 -march=core-avx2 the optimization profiler (like the one from compiler explorer) i get
Passed - vectorized loop (vectorization width: 8, interleaved count: 4)
while for the second, in which i instructed to vectorize with a width greater that the one enabled for AVX2 with unsigned, i get
Analysis - the cost-model indicates that interleaving is not beneficial
Passed - vectorized loop (vectorization width: 64, interleaved count: 1)
In order to reproduce the results i'm linking the compiler explorer pages:
If i get correctly what's going on, when a loop is just vectorized it means that vectorized instruction will be executed by the cpu while if it is also interleaved with a certain degree 'n' these instruction will also be executed in parallel, like in the first code snippet, and so when i try to vectorize with a big vectorization width this is no longer optimal since it would require too many resources to run vectorized instruction with width 64 in parallel, is it right (i also see that the vectorized instruction are more without interleaving)? Or are there more subtleties?
No. This is significantly more complex than that. To understand why, we need to understand how modern mainstream processors execute instructions.
Modern mainstream processors are super-scalar : they can decode, schedule and execute multiple instructions (of 1 thread) in parallel on one single core (not to mention these steps are pipelined). More specifically, instructions are decoded to micro-instructions (µops), then µops are scheduled to multiple processing units called ports. For example, let's focus on the i5-9600KF CPU (Intel Coffee Lake architecture) which has 4 ALU, 2 load ports, 1 store port, and 3 port capable of executing integer AVX-2 additions. The port 0, 1 and 2 can execute both scalar operations and SIMD ones, but not simultaneously. This means this CPU can load 2 value from memory, add them, and store the result in parallel assuming there is no dependences (which is the case here).
On this CPU (like most most modern mainstream processors), the instructions of the loop are first decoded by the front-end and then put into a µops cache (so not to re-decode them over and over again). The µops of the loop (having possibly multiple iterations) are then sent to the back-end part of the CPU which schedules them on available ports. The flow of µops between units is adjusted thanks to (bounded) queues. The CPU scheduler is smart enough to detect the dependencies between µops, and then schedule them on ports only when the dependencies are fulfilled. Registers can even be renamed so to increase the amount of parallelism (e.g. by removing some false dependencies). This is actually even a bit more complex than that, but the point is that µops coming from multiple iterations of a loop can be executed in parallel (as long as they are independent).
As a result, the target CPU can for example do the following steps in parallel in only 1 cycle (of throughput):
i+2from memory;i+1;iin memory.In practice, one need to consider the latency of each instruction. This is the job of the µop scheduler. It is very hard for humans to guess how instructions will be scheduled and executed on non-trivial loops, not to mention this is very dependent of the target CPU architecture. That being said, there are tools for that (e.g. LLVM-MCA, uiCA). This is what compilers (like Clang) often do to evaluate the cost of an assembly code and generate an efficient one.
The first code is already pretty well optimized : it is unrolled so most CPU should be bound by the back-end (saturated), not the front-end. In fact, the uiCA tool reports the first code saturates the load/store ports of my i5-9600KF CPU. This means it is already optimal (it might not be on other CPUs, but it looks like it is on all relatively-recent Intel architectures : from Sandy Bridge (2011-2013) to Rocket Lake (2021-2024). Thus, the second code should not be faster. Changing the order of the instructions should not impact the performance of the code (at least not the use of the in-core resources, but possibly the memory subsystem). I think this is what the Clang optimizer reports here.
Note that using too high SIMD width put a lot of pressure on SIMD registers. Indeed the number of available AVX-2 registers in the ISA is limited as well as the number of physical SIMD registers. When there are not enough SIMD register available, the compiler needs to temporary store them in memory (and reload them later) which is expensive, especially on this code. This is called register spilling. In such a case, it can be useful to reorder instructions so to reduce the register pressure. In practice, Clang seems smart enough so not to generate such a code in the first place in this case (i.e. with a SIMD width >64, Clang decides not to unroll more the loop). This typically happens for more complex codes computing temporary values (i.e. requiring more registers per item).