SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:
_mm_cvtpd_epi64()
_mm_cvtepi64_pd()
It seems that AVX doesn't have them either.
What is the most efficient way to simulate these intrinsics?
There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like
_mm512_cvtpd_epi64
and the narrower AVX512VL versions, like_mm256_cvtpd_epi64
.If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)
If you're willing to cut corners,
double <-> int64
conversions can be done in only two instructions:NaN
.double <-> int64_t
, you only care about values in the range[-2^51, 2^51]
.double <-> uint64_t
, you only care about values in the range[0, 2^52)
.double -> uint64_t
double -> int64_t
uint64_t -> double
int64_t -> double
Rounding Behavior:
double -> uint64_t
conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)double -> int64_t
conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.How does it work?
Despite this trick being only 2 instructions, it's not entirely self-explanatory.
The key is to recognize that for double-precision floating-point, values in the range
[2^52, 2^53)
have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.To convert
x
fromdouble -> uint64_t
, you add the magic numberM
which is the floating-point value of2^52
. This putsx
into the "normalized" range of[2^52, 2^53)
and conveniently rounds away the fractional part bits.Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of
M
. So rather than introducing an additional mask constant, we can simply subtract or XOR byM
. XOR has more throughput.Converting from
uint64_t -> double
is simply the reverse of this process. You add back the exponent bits ofM
. Then un-normalize the number by subtractingM
in floating-point.The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.
Related: A fast method to round a double to a 32-bit int explained
Full Range int64 -> double:
After many years, I finally had a need for this.
uint64_t -> double
int64_t -> double
uint64_t -> double
int64_t -> double
These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.
These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.