Supposed we have some repetitions of the same asm that contains RDTSC such as
volatile size_t tick1;
asm ( "rdtsc\n" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n" // Shift the upper bits left.
"or %%rdx, %q0" // 'Or' in the lower bits.
: "=a" (tick1)
:
: "rdx");
this_thread::sleep_for(1s);
volatile size_t tick2;
asm ( "rdtsc\n" // clang's optimizer just thinks this asm yields
"shl $32, %%rdx\n" // the same bits as above, so it just loads
"or %%rdx, %q0" // the result to qword ptr [rsp + 8]
: "=a" (tick2) //
: // mov qword ptr [rsp + 8], rbx
: "rdx");
printf("tick2 - tick1 diff : %zu cycles\n", tick2 - tick1);
printf("CPU Clock Speed : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);
Clang++'s optimizer (even with `-O1` ) thinks those two asm blocks yield the same :
tick2 - tick1 diff : 0 cycles
CPU Clock Speed : 0.00 GHz
tick1 : bd806adf8b2
this_thread::sleep_for(1s)
tick2 : bd806adf8b2
When turn off Clang's optimizer, the 2nd block yields progressing ticks as expected :
tick2 - tick1 diff : 2900160778 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 14ab6ab3391c
this_thread::sleep_for(1s)
tick2 : 14ac17902a26
1st GCC g++ "seems" not to affect from this.
tick2 - tick1 diff : 2900226898 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 20e40010d8a8
this_thread::sleep_for(1s)
tick2 : 20e4aceecbfa
[LIVE]
However, let's add tick3 with the exact asm right after tick2
volatile size_t tick1;
asm ( "rdtsc\n" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n" // Shift the upper bits left.
"or %%rdx, %q0" // 'Or' in the lower bits.
: "=a" (tick1)
:
: "rdx");
this_thread::sleep_for(1s);
volatile size_t tick2;
asm ( "rdtsc\n" // clang's optimizer just thinks this asm yields
"shl $32, %%rdx\n" // the same bits as above, so it just loads
"or %%rdx, %q0" // the result to qword ptr [rsp + 8]
: "=a" (tick2) //
: // mov qword ptr [rsp + 8], rbx
: "rdx");
volatile size_t tick3;
asm ( "rdtsc\n"
"shl $32, %%rdx\n"
"or %%rdx, %q0"
: "=a" (tick3)
:
: "rdx");
It turns out that GCC thinks tick3's asm must produce the same value as tick2 because there are "obviously" no external side effects, so it just reload from tick2 . Even that's wrong, well, it has a very strong point though.
tick2 - tick1 diff : 2900209182 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 5670bd15088e
this_thread::sleep_for(1s)
tick2 : 567169f2b6ac
tick3 : 567169f2b6ac
[LIVE]
In C mode, the optimizers of both GCC and Clang affect with this.
In other words, even with -O1 both optimize out the repetitions of asm blocks containing rdtsc
tick2 - tick1 diff : 0 cycles
CPU Clock Speed : 0.00 GHz
tick1 : 324ab8f5dd2a
thrd_sleep(&(struct timespec){.tv_sec=1}, nullptr)
tick2 : 324ab8f5dd2a
tick3_rdx : 324b65d3368c
[LIVE]
It turns out that all optimizers can do common-subexpression elimination on identical non-volatile asm statements, so an asm statement for RDTSC needs to be volatile.
Inline assembly is not covered by the C++ standard, so I'm not sure what your definition of "legal" is here. The behavior you are seeing makes sense to me though, because you are running inline assembly for its side effects (i.e. your assembly doesn't implement a pure function) and you forgot to use the
volatilekeyword. From the GCC inline assembly documentation:Also:
If you insert the
volatilekeyword immediately afterasmthe problem goes away.P.S. Instead of using inline assembly, just include
x86intrin.hand then use__rdtsc()function.