Using google benchmark to benchmark the following functions:
int sumElements(int *arr, int count)
{
int sum = 0;
for (int i = 0; i < count; ++i) {
sum += arr[i];
}
return sum;
}
int sumElementsUnrolled(int *arr, int count)
{
int sumA = 0;
int sumB = 0;
for (int i = 0; i < count; i += 2) {
sumA += arr[i];
sumB += arr[i + 1];
}
return sumA + sumB;
}
Benchmark results:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
sumElements 965 ns 963 ns 723250
sumElementsUnrolled 641 ns 641 ns 1091406
I tried to benchmark these functions using google benchmark, and got a noticeable benefit at -O0 and -O1. I learned in several places that this is likely due to breaking dependency chains, which, if I understand correctly, leads to fewer pipeline stalls as the CPU is executing my instructions. Looking at the generated assembly at -O1, this makes sense: the unrolled version uses different registers for each sum variable, and the dependency chain is broken.
Questions:
- Is this explanation correct, or am I way off?
- If so, how do I measure pipeline stalls here? Running
perf statshows a much higher rate ofstalled-cycles-backendfor the unrolled version, but I don't really know if that's related. - If not, then what is the cause of the speedup? What does breaking a dependency chain do at the CPU level?
The benchmarking code is very simple:
#define SIZE 4096
int *initArray(int size)
{
int *arr = (int*)malloc(size * sizeof(int));
for (int i = 0; i < size; ++i) {
arr[i] = rand();
}
return arr;
}
void doBenchmark(benchmark::State &s)
{
int *arr = initArray(SIZE);
for (auto _ : s) {
int sum = sumElements(arr, SIZE);
benchmark::DoNotOptimize(sum);
}
free(arr);
}
BENCHMARK(doBenchmark);
I tried identifying the cause of the speedup using perf stat. But I am not sure how to interpret the results.
TL:DR: as 273K said, You made the manual optimization with
-O1, and your manually unrolled loop broke compiler's job with-O2and-O3. (Mainly because of auto-vectorization.)In the actual compiler-generated asm (not the C++ source), one of the tradeoffs of loop unrolling is larger I-cache footprint. You can't measure the cost of that with a microbenchmark where only the unrolled loop runs; microbenchmarks in repeat loops make complicated implementations of loops like this or
memcpylook good. The actual cost depends on I-cache pressure in whatever program this is part of.Another cost is that you might have more cleanup work to do for odd
count, if your function needs to be correct for all possiblecountvalues (unlike your test case). e.g. an unroll of 16 makes the worst-case cleanup 15 elements; depending on how you do the cleanup, that could be worse for smallish problems that also only run a few iterations of the unrolled loop.Yes, at
-O0where store/reload (store-forwarding latency) bottlenecks are a huge problem, manual unrolling helps for that reason. (Benchmarking at-O0is normally pointless; the bottlenecks are different from in normal cases. In this case I think you realize that, although you didn't useregister intfor any of your variables to let a-O0build still keep them in registers.)And at
-O1, maybe allowing two scalar adds per clock instead of 1, or at least bottlenecking on front-end throughput instead of 1 loop iteration per clock bottlenecked on loop-branch throughput.But at
-O2/-O3when auto-vectorization is enabled for clang, manual unrolling usually makes compilers do worse when auto-vectorizing. e.g. they'll do stupid things like shuffle into vectors of odd/even elements to vectorizesumAandsumBseparately. They're not smart, just complex pieces of machinery.Optimal asm for this on x86-64 involves SSE2
paddd(or AVX2vpadddto do 32 bytes at once.) Unrolling that loop by 2 can help if the data is hot in L1d cache, but that would mean a scalar unroll of 8 or 16. (And like I said, compilers aren't good at rolling up scalar code into SIMD, especially with multiple accumulators. Although sometimes it works if the scalar unroll factor matches the vector width. I'd recommend keeping your loops simple)Unlike GCC, clang already does unroll tiny loops by default, including when auto-vectorizing. Sometimes with multiple accumulators.
Clang 16
-O2(Godbolt) made this asm for the inner loop of the non-unrolled version. It's 6 uops (assuming macro-fusion of the cmp/jcc, which will happen on Intel and AMD CPUs, https://agner.org/optimize/), so Alder Lake can run it at one iteration per clock cycle. (Only doing 2 loads per clock out of the 3 it's capable of... https://uops.info/)Further unrolling, and/or using AVX with non-indexed addressing modes or aligned pointers to allow single-uop memory-source
[v]paddd, could reduce overhead enough that we bottleneck on 2 or 3 loads + vector-adds per 4 or 5 uops, letting Haswell / Ice Lake / Zen bottleneck on L1d load throughput (SIMD vector integer add throughput is at least as high as vector load throughput in all those CPUs, andpaddd xmm0, [rdi]is single-uop on all of them, but notvpaddd xmm0, xmm0, [rdi + rsi]).See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
But to get a compiler to generate further-unrolled asm, you'd want to use intrinsics like
_mm_load_si128(aligned load, can fold into a memory source operand for legacy-SSEpaddd, unlike_mm_loadu_si128unaligned load). Or maybe use profile-guided optimization so the compiler knows this loop is truly hot and will optimize it even more aggressively.As I guessed, the asm for the manually-unrolled version involves
shufpsinstructions. It loads 4 vectors and uses 4xshufpsas a 2-input shuffle to get the odd vs. even elements from pairs of vectors, feeding 4xpadddinstructions.Also note that your unrolled version is only equivalent for even loop counts, since you didn't write cleanup code for the odd element. And
i < countwill read past the end of the array instead of stopping early. In the general case,i < (count - (unroll_factor-1))will stop after the last "full" iteration, leaving some elements leftover instead of going too far. Notice how that expression simplifies toi < countfor anunroll_factorof1; that's helpful for remembering it.See also: How to remove "noise" from GCC/clang assembly output? and Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”