Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?
All the major compilers support Intel's intrinsics for
rdrand
andrdseed
via<immintrin.h>
.Somewhat recent versions of some compilers are needed for
rdseed
, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like-march=skylake
, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)Some compilers define
__RDRND__
when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC,-march=ivybridge
doesn't imply-mrdrnd
or define__RDRND__
until 2021.1.ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why
do{}while()
instead ofwhile(){}
? Turns out ICC compiles to a less-dumb loop withdo{}while()
, not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.Why
unsigned long long
instead ofuint64_t
? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example,uint64_t
isunsigned long
, but GCC/clang'simmintrin.h
defineint _rdrand64_step(unsigned long long*)
, same as on Windows. So you always needunsigned long long ret
with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, whereunsigned long long
is the only 64-bit unsigned type.But ICC defines the intrinsic as taking
unsigned long*
when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need#ifdef __INTEL_COMPILER
; even in C++ I don't know a way to useauto
or other type-deduction to declare a variable that matches it.Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for
_rdrand16_step
/ 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.rdrand
-mrdrnd
defining__RDRND__
. 2021.1 for -march=ivybridge to enable-mrdrnd
rdseed
-mrdrnd
and-mrdseed
options)The earliest GCC and clang versions don't recognize
-march=ivybridge
only-mrdrnd
. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a-mtune=
with a more recent CPU.)Intel's new oneAPI / ICX compilers all support
rdrand
/rdseed
, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define__INTEL_COMPILER
, which is good because it's different from ICC.)GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use
-march=native
if compiling for your own machine, or use-march=skylake
or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need__attribute__((target("rdrnd")))
orrdseed
, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.-mrdrnd
: enabled by-march=ivybridge
or-march=znver1
(orbdver4
Exavator APUs) and later-mrdseed
: enabled by-march=broadwell
or-march=znver1
or laterNormally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But
rdrand
isn't something the compiler will use on its own (unlike BMI2shlx
for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling-mrdrnd
globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses_rdrand64_step
on CPUs without the feature.But if you are only going to run your code on some specific kind of CPU or later,
gcc -O3 -march=haswell
is a good choice. (-march
also implies-mtune=haswell
, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could-march=ivybridge -mtune=skylake
to set an older baseline of CPU features, but still tune for newer CPUs.)Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want
static inline
instead ofinline
so you don't need to manually instantiate anextern inline
version in a.c
in case a debug build decided not to inline. (Or use__attribute__((always_inline))
in GNU C.)The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't
#ifdef __RDRND__
or#if defined(__i386__)||defined(__x86_64__)
, on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define therdseed
wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.There are some commented
__attribute__((target("rdseed")))
examples you can uncomment if you want to do it that way instead of compiler options.rdrand16
/rdseed16
are intentionally omitted as not being normally useful.rdrand
runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.The fact that Intel's intrinsics API is supported at all implies that
unsigned int
is a 32-bit type, regardless of whetheruint32_t
is defined asunsigned int
orunsigned long
if any compilers do that.On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until
rdrand
leaves CF=1Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
ICC makes the same asm as long as we use a
do{}while()
retry loop; with awhile() {}
it's even worse, doing an rdrand and checking before entering the loop for the first time.Footnote 1:
rdrand
/rdseed
library wrapperslibrdrand
or Intel'slibdrng
have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array ofuint32_t*
oruint64_t*
. (Consistently takinguint64_t*
, nounsigned long long*
on some targets).A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with
__attribute__((target))
stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.libdrng
also providesRdRand_isSupported()
andRdSeed_isSupported()
, so you don't need to do your own CPUID check.But if you're going to build with
-march=
something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers.rdrand
is quite slow so that's probably not a big deal, but it also means no extra library dependency.Performance / internals of
rdrand
/rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the
rdrand
bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for/dev/random
.For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote
librdrand
, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word
step
in the name; it's one step of generating a single random number.In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping
rdrand
, but later CPUs can. Therefore it is important to actually retry.@jww has had some experience with deploying
rdrand
in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use
0
as an error return: they can't randomly generate a0
!If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for
rdrand
andrdseed
on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for
lock
ed instructions.(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running
rdrand
on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)