Vectorizing vs interleaving loop C++

Question

Vectorizing vs interleaving loop C++

286 Views Asked by Niccolò Tiezzi At 04 January 2024 at 14:55

I am trying to get into optimization and i wanted to ask if i got the difference between loop vectorization and interleaving by bringing this examples:

void add_arrays(int* a, int* b, int n){
  for (int i = 0; i < n; ++i){ 
    a[i] += b[i];
  }
}

void add_arrays(int* a, int* b, int n){
  #pragma clang loop vectorize_width(64)
  for (int i = 0; i < n; ++i){ 
    a[i] += b[i];
  }
}

When i compile the first with clang 5.0 using the flags -O3 -march=core-avx2 the optimization profiler (like the one from compiler explorer) i get

Passed - vectorized loop (vectorization width: 8, interleaved count: 4)

while for the second, in which i instructed to vectorize with a width greater that the one enabled for AVX2 with unsigned, i get

Analysis - the cost-model indicates that interleaving is not beneficial Passed - vectorized loop (vectorization width: 64, interleaved count: 1)

In order to reproduce the results i'm linking the compiler explorer pages:

first snippet

second snippet

If i get correctly what's going on, when a loop is just vectorized it means that vectorized instruction will be executed by the cpu while if it is also interleaved with a certain degree 'n' these instruction will also be executed in parallel, like in the first code snippet, and so when i try to vectorize with a big vectorization width this is no longer optimal since it would require too many resources to run vectorized instruction with width 64 in parallel, is it right (i also see that the vectorized instruction are more without interleaving)? Or are there more subtleties?

Original Q&A

There are 3 best solutions below

**Jérôme Richard** · Answer 1 · 2024-01-04T17:42:11.417000

No. This is significantly more complex than that. To understand why, we need to understand how modern mainstream processors execute instructions.

Modern mainstream processors are super-scalar : they can decode, schedule and execute multiple instructions (of 1 thread) in parallel on one single core (not to mention these steps are pipelined). More specifically, instructions are decoded to micro-instructions (µops), then µops are scheduled to multiple processing units called ports. For example, let's focus on the i5-9600KF CPU (Intel Coffee Lake architecture) which has 4 ALU, 2 load ports, 1 store port, and 3 port capable of executing integer AVX-2 additions. The port 0, 1 and 2 can execute both scalar operations and SIMD ones, but not simultaneously. This means this CPU can load 2 value from memory, add them, and store the result in parallel assuming there is no dependences (which is the case here).

On this CPU (like most most modern mainstream processors), the instructions of the loop are first decoded by the front-end and then put into a µops cache (so not to re-decode them over and over again). The µops of the loop (having possibly multiple iterations) are then sent to the back-end part of the CPU which schedules them on available ports. The flow of µops between units is adjusted thanks to (bounded) queues. The CPU scheduler is smart enough to detect the dependencies between µops, and then schedule them on ports only when the dependencies are fulfilled. Registers can even be renamed so to increase the amount of parallelism (e.g. by removing some false dependencies). This is actually even a bit more complex than that, but the point is that µops coming from multiple iterations of a loop can be executed in parallel (as long as they are independent).

As a result, the target CPU can for example do the following steps in parallel in only 1 cycle (of throughput):

load 2 items of the iteration i+2 from memory;
compute the sum of the two item of the iteration i+1;
store the resulting value of the iteration i in memory.

In practice, one need to consider the latency of each instruction. This is the job of the µop scheduler. It is very hard for humans to guess how instructions will be scheduled and executed on non-trivial loops, not to mention this is very dependent of the target CPU architecture. That being said, there are tools for that (e.g. LLVM-MCA, uiCA). This is what compilers (like Clang) often do to evaluate the cost of an assembly code and generate an efficient one.

The first code is already pretty well optimized : it is unrolled so most CPU should be bound by the back-end (saturated), not the front-end. In fact, the uiCA tool reports the first code saturates the load/store ports of my i5-9600KF CPU. This means it is already optimal (it might not be on other CPUs, but it looks like it is on all relatively-recent Intel architectures : from Sandy Bridge (2011-2013) to Rocket Lake (2021-2024). Thus, the second code should not be faster. Changing the order of the instructions should not impact the performance of the code (at least not the use of the in-core resources, but possibly the memory subsystem). I think this is what the Clang optimizer reports here.

Note that using too high SIMD width put a lot of pressure on SIMD registers. Indeed the number of available AVX-2 registers in the ISA is limited as well as the number of physical SIMD registers. When there are not enough SIMD register available, the compiler needs to temporary store them in memory (and reload them later) which is expensive, especially on this code. This is called register spilling. In such a case, it can be useful to reorder instructions so to reduce the register pressure. In practice, Clang seems smart enough so not to generate such a code in the first place in this case (i.e. with a SIMD width >64, Clang decides not to unroll more the loop). This typically happens for more complex codes computing temporary values (i.e. requiring more registers per item).

**KevinZ** · Answer 2 · 2024-01-04T18:22:39.827000

Jerome has a good direct answer. I will offer something else.

Instead of fancy pragmas, have you tried the old fashion method of telling the compiler that a and b are non-overlapping with __restrict?

void add_arrays(unsigned* __restrict a, const unsigned* __restrict b, unsigned n)

With that definition and the same function body, the vectorized loop part of the asm is the same (vectorized with 32-byte = 8-element vectors, unrolled by 4 vectors with their loads/math/stores interleaved). But the cleanup is simpler because it doesn't have to be a fallback for huge partially-overlapping arrays. And the intro code is simpler because it doesn't have to check for overlap.

https://godbolt.org/z/8hKjY6Ghn

The rolled-up (1 element per iteration) scalar cleanup does have to run up to 31 iterations because clang didn't make a 2-stage cleanup, with either unrolled-scalar like before (which could help more on recent Intel or Zen 3 which can do 2 stores per clock), or with one YMM or one XMM vector to get within 7 or 3 of the end which would be much better than unrolled scalar.

**Peter Cordes** · Answer 3 · 2024-01-05T03:12:26.343000

Just for the record, vectorize_width counts in elements not bytes, so 8 is 8x uint32_t = one 256-bit YMM vector. Interleave=4 is just an unroll count of that many logical vectors.

TL:DR: bumping up the vectorize_width() beyond the HW / asm vector register width it's willing to use is effectively just a way to make it unroll more the way it already unrolls. At least for simple cases; I'd be worried about it making inefficient asm if it had to widen or narrow elements, like if you were using a uint8_t[] array with a uint32_t[] array.

Out-of-order exec can already interleave independent work across loop iterations, and clang already likes to unroll tiny loops by 2 or 4 vectors, depending how tiny they are, and sometimes even 8 with some -mtune settings. (Clang's unrolling also interleaves, doing 4 loads then 4 vpaddd ymm, ymm, [mem] then 4 stores, rather than 4x load/add/store. Which might matter on an in-order CPU like a low-power ARM Cortex-A53 efficiency core.)

Bumping up to vectorize_width(64) so one logical "vector" takes 8x 32-byte (8-element) vector registers, I think it's seeing that the loop is already big enough with one "64-element vector" per iteration (8 instructions each to load/load+add/store) and deciding not to unroll to a multiple of that amount of work¹. Thus interleave=1 for a total unroll factor in the asm of 8, exactly the same as vectorize_width(8) and interleave=8 if there's a way to ask for that.

When asking for "vectors" wider than the target HW supports, the chunks of that vector are also an unroll with independent work, producing about the same asm as a higher unroll count would, at least for this very simple problem where input and output element widths are the same so it doesn't need to invent any shuffles.

I guess this could be useful as a way to get it to unroll a loop more than it would with the current -mtune= options implied by -march=core-avx2 or better² -march=haswell (the first Intel "Core" CPU with AVX2). But normally clang's default amount of unrolling is generous enough.

It might be more relevant in a reduction (like a sum of an array or a dot product), where there is a data dependency across loop iterations. In that case, unrolling with more vector registers really does interleave more chains of work in ways out-of-order exec can't do for you: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

Clang does already unroll with multiple accumulators for associative math (integers, or FP with -ffast-math or #pragma omp simd reduction (+:my_sum)), but a hot loop might benefit from more unrolling than it does by default; without profile-guided optimization, it doesn't want to spend too much code size on loops that might not be hot or might typically be run with fairly small n.

If you compile with -march=x86-64-v4 (which includes AVX-512), even asking for 16-element vectors doesn't get it to use 64-byte ZMM vectors, unfortunately³. For that you want -mprefer-vector-width=512. Or -march=znver4 which implies -mtune=znver4 - Zen 4 has no downside for using 512-bit vectors (because they're actually double-pumped 256-bit ALUs), unlike Intel, so compilers will freely use them when tuning for it.

#pragma clang loop vectorize_width(64) can reduce the vector width used in the asm from the -mtune default, down to scalar if you use 1, or down to XMM if you use 4 for 4-byte elements. (Or 16 for 1-byte elements.) With a width of 2, it uses vmovq 64-bit loads/stores on XMM registers, fortunately not MMX!

vectorize_width(1) could perhaps be useful to stop a compiler from vectorizing a cleanup loop after a manually-vectorized loop (with intrinsics), if it can't already see the iteration count is 0..3 or something. But it might still want to make unrolled scalar so that might not help. As always, check the asm. (And often there are ways of making the cleanup loop trip-count more obviously a small number, like deriving it from n & 3 instead of just resuming iterating with the i from the manually-vectorized loop like for ( ; i < n ; i++ );)

Footnote 1: unroll choices with AVX-512 for 256 or 512-bit regs

With -march=znver4 (or -march=icelake-client -mprefer-vector-width=512) so it will use 64-byte ZMM registers (16-element for uint32_t), vectorize_width(64) does get it to unroll by a total of 16 ZMM vectors. That's 4x ZMM for each of the "64 element vectors" we asked for, and it's choosing to unroll by 4 because it thinks the loop is still small?

Godbolt with Clang 17 for for znver4 or -march=x86-64-v4 -mprefer-vector-width=512 -
vectorized loop (vectorization width: 64, interleaved count: 4)

AVX-512 makes 32 vector regs available, but I don't think it was worried about using up all 16 YMM vectors; with just -march=x86-64-v4 or other option that allows AVX-512 but prefers 256-bit vector-width, we get "vectorization width: 64", "interleaved count: 1", i.e. unroll by 8x YMM. This is still more unrolling than its default 4 vectors (of YMM or ZMM width depending on tuning).

Footnote 2: `-march=` strings: core-avx2 is an obsolete way to specify Haswell, Skylake, etc.

Those old arch strings like core-whatever are pretty clunky and unclear since Intel made many generations of CPU with the same "core" naming; avoid them. Use a newer clang that understands -march=x86-64-v3 if you want a CPU-tuning-neutral AVX2+FMA+BMI2 microarchitecture feature level, or use -march=skylake, -march=znver3, or -march=icelake-client -mno-avx512f or whatever to optimize for a specific CPU as well as enabling everything it has. Or -march=x86-64-v3 -mtune=skylake. For Skylake-family, see also How can I mitigate the impact of the Intel jcc erratum on gcc? which isn't enabled by default as part of -mtune=skylake)

AFAIK, there's no clear definition of what -mtune is implied by -march=core-avx2, like is that supposed to be all Haswell-and-later CPUs with "core" in their name, or is it specifically Haswell? If LLVM's optimizer does know a difference between Haswell and Skylake or Ice Lake (e.g. like that popcnt's false output-dependency is fixed in Ice Lake, and same for lzcnt/tzcnt in Skylake), then you'd rather specify a specific CPU.

GCC at least doesn't have tuning settings for Generic-CPUs-with-AVX2. -march=x86-64-v3 leaves -mtune=generic, which fortunately has stopped catering to first-gen Sandybridge so it doesn't split 32-byte vector load/store that it can't prove must be aligned. (Since that was worse for later CPUs, especially if your data was aligned all or most of the time but you hadn't jumped through hoops to promise that to the compiler.) It would be good if compilers did have tune options that could leave out workarounds for CPUs that don't have the features to run the asm we're generating, instead of only a specific CPU or pure generic.

(-mtune=generic is always a moving target that changes with compiler version as old CPUs become sufficiently obsolete that we stop working around their performance potholes, especially for things that aren't total showstoppers. And as new CPUs are released with their own quirks.)

Footnote 3: Interaction with AVX-512 256 vs. 512-bit vector-width tuning choices

It might be nice if there was a per-loop way to override that, for a program that has phases of sustained heavy-duty work on mostly-aligned data where 512-bit vectors are worth paying the penalty in turbo clock speed (especially on older Intel CPUs but negligible on Sapphire Rapids) and port 1's vector ALUs being shut down on Intel.

There might be a way to influence auto-vectorization if per-function tune options are a thing, but #pragma clang loop vectorize_width(16) isn't it. Compiling separate files without -flto can work, but then you don't get -flto

Vectorizing vs interleaving loop C++

There are 3 best solutions below

Footnote 1: unroll choices with AVX-512 for 256 or 512-bit regs

Footnote 2: `-march=` strings: core-avx2 is an obsolete way to specify Haswell, Skylake, etc.

Footnote 3: Interaction with AVX-512 256 vs. 512-bit vector-width tuning choices

Related Questions in C++

Related Questions in CLANG

Related Questions in COMPILER-OPTIMIZATION

Related Questions in SIMD

Related Questions in AUTO-VECTORIZATION

Trending Questions

Popular # Hahtags

Popular Questions

Vectorizing vs interleaving loop C++

There are 3 best solutions below

Footnote 1: unroll choices with AVX-512 for 256 or 512-bit regs

Footnote 2: -march= strings: core-avx2 is an obsolete way to specify Haswell, Skylake, etc.

Footnote 3: Interaction with AVX-512 256 vs. 512-bit vector-width tuning choices

Related Questions in C++

Related Questions in CLANG

Related Questions in COMPILER-OPTIMIZATION

Related Questions in SIMD

Related Questions in AUTO-VECTORIZATION

Trending Questions

Popular # Hahtags

Popular Questions

Footnote 2: `-march=` strings: core-avx2 is an obsolete way to specify Haswell, Skylake, etc.