I've written some code for profiling small functions. At the high level it:
- Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
- Estimate the latency of a function that does nothing.
- Estimate the latency of the test function.
- Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
- Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
- Yields the thread, so that the profile loop has as-few-as-possible context switches.
- Gets the current time from a
std::chrono::high_resolution_clock(which seems to compile tosystem_clock, but). - Runs the profile loop 100,000,000 times, calling the tested function within.
- Gets the current time from a
std::chrono::high_resolution_clockand subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float.
The function signature is float(*)(float), and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?
Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.
First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by
call/retand a load we'll get to.Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.
Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:
You expect
loop_time_gross/loop_countto be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:
or, for the other loop
Notable, the load of
numberis always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.To be fair, the
are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.
Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.
A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for
sqrtsson Nehalem:i.e., the instruction cannot be pipelined and only runs on one execution port. In contrast, for
movaps,rsqrtss,mulss:the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.
If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to
Obviously you will not benchmark the same input this way, unless
INITIAL_VALUEis a fixed point ofbenchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computingfloat diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE);and then making the loopwith relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to
benchmarked_function().