C++ Portable Floating-Point Bit Representation?

Question

C++ Portable Floating-Point Bit Representation?

3.3k Views Asked by Charles L Wilcox At 28 July 2025 at 07:46

Is there a C++ Standards compliant way to determining the structure of a 'float', 'double', and 'long double' at compile-time ( or run-time, as an alternative )?

If I assume std::numeric_limits< T >::is_iec559 == true and std::numeric_limits< T >::radix == 2, I suspect the is possible by the following rules:

first X-bits are the significand.
next Y-bits are the exponent.
last 1-bit is the sign-bit.

with the following expressions vaguely like:

size_t num_significand_bits = std::numeric_limits< T >::digits;
size_t num_exponent_bits = log2( 2 * std::numeric_limits< T >::max_exponent );
size_t num_sign_bits = 1u;

except I know

std::numeric_limits< T >::digits includes the "integer bit", whether or not the format actually explicitly represents it, so I don't know how to programmatically detect and adjust for this.
I'm guessing std::numeric_limits< T >::max_exponent is always 2^(num_exponent_bits)/2.

Background: I'm trying to overcome two issues portably:

set/get which bits are in the significand.
determine where the end of 'long double' is so I know not to read the implicit padding bits that'll have uninitialized memory.

Original Q&A

There are 3 best solutions below

Mats Petersson On 08 March 2013 at 18:55

I would say that the only portable way is to store the number as a string. This is not relying on "interpreting bit patterns"

Even if you know how many bits something is, doesn't mean that it has the same representation - the exponent zero-based or biased. Is there an invisible 1 at the front of the mantissa? The same applies to all of the other parts of the number. And it gets even worse for BCD encoded or "hexadecimal" floats - these are available in some architectures...

If you are worried about uninitialized bits in a structure (class, array, etc), then use memset to set the entire structure to zero [or some other known value].

Charles L Wilcox On 11 March 2013 at 17:35

For posterity, this is what I ended up doing.

To generate and test for my IEEE-754 signaling-NaN values, I use this pattern for 'float' and 'double'.

#include <cstdint> // uint32_t, uint64_t
#include <limits> // numeric_limits

union IEEE754_Float_Union
{
    float value;
    uint32_t bits;
};

float generate_IEEE754_float()
{
    IEEE754_Float_Union u = { -std::numeric_limits< float >::signaling_NaN() };
    size_t const num_significand_bits_to_set = std::numeric_limits< float >::digits
                                               - 1 // implicit "integer-bit"
                                               - 1; // the "signaling-bit"
    u.bits |= ( static_cast< uint32_t >( 1 ) << num_significand_bits_to_set ) - 1;
    return u.value;
}

bool test_IEEE754_float( float const& a_r_val )
{
    IEEE754_Float_Union const u = { a_r_val };
    IEEE754_Float_Union const expected_u = { generate_IEEE754_float() };
    return u.bits == expected_u.bits;
}

For 'long double', I use the 'double' functions with casting. Specifically, I generate the 'double' value and cast it to 'long double' before it's returned, and I test the 'long double' by casting to 'double' then testing that value. My idea is that, while the 'long double' format can vary, casting a 'double' into a 'long double', then casting it back to 'double' later on should be consistent, ( i.e. not loose any information. )

**James Kanze** · Accepted Answer

In short, no. If std::numeric_limits<T>::is_iec559, then you know the format of T, more or less: you still have to determine the byte order. For anything else, all bets are off. (The other formats I know that are still being used aren't even base 2: IBM mainframes use base 16, for example.) The "standard" arrangement of an IEC floating point has the sign on the high order bit, then the exponent, and the mantissa on the low order bits; if you can successfully view it as an uint64_t, for example (via memcpy, reinterpret_cast or union—`memcpy is guaranteed to work, but is less efficient than the other two), then:

for double:

uint64_t tmp;
memcpy( &tmp, &theDouble, sizeof( double ) );
bool isNeg = (tmp & 0x8000000000000000) != 0;
int  exp   = (int)( (tmp & 0x7FF0000000000000) >> 52 ) - 1022 - 53;
long mant  = (tmp & 0x000FFFFFFFFFFFFF) | 0x0010000000000000;

for `float:

uint32_t tmp;
memcpy( &tmp, &theFloat, sizeof( float ) );
bool isNeg = (tmp & 0x80000000) != 0;
int  exp   = (int)( (tmp & 0x7F800000) >> 23 ) - 126 - 24 );
long mant  = (tmp & 0x007FFFFF) | 0x00800000;

With regards to long double, it's worse, because different compilers treat it differently, even on the same machine. Nominally, it's ten bytes, but for alignment reasons, it may in fact be 12 or 16. Or just a synonym for double. If it's more than 10 bytes, I think you can count on it being packed into the first 10 bytes, so that &myLongDouble gives the address of the 10 byte value. But generally speaking, I'd avoid long double.

C++ Portable Floating-Point Bit Representation?

There are 3 best solutions below

Related Questions in C++

Related Questions in FLOATING-POINT

Related Questions in PORTABILITY

Related Questions in IEEE-754

Related Questions in BIT-REPRESENTATION

Trending Questions

Popular # Hahtags

Popular Questions