How to convert a float to a half type and the other way around in C

1.1k Views Asked by At

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc.

I don't need arithmetic because I just need the types in order to fulfill the requirement of supporting them. So the half type can just be a uint16_t or corresponding typedef. I only found ways to do it in C++ or some that didn't account for edge cases like NaN.

I need to convert a float into a half type that can be represented as a simple uint16_t, this uint16_t should just contain the binary representation of the half since I won't do arithmetic on it. I require this so I can fulfill a library requirement. I can't use existing implementations because they are build as shared libraries (also mostly for C++) which I can't use in this case. Also the GCC/Clang __fp16 and _Float16 won't work because I compile my code very to web assembly that will run in an isolated environment and as such can't use native dependent code (No WASI) (And EMCC throws an error when using _FloatXX types).

3

There are 3 best solutions below

2
On

You have a number of options:

  1. Use an existing impl e.g. the one from Industrial Light & Magic. There are a number of other impl as well.
  2. Use an intrinsic e.g. for Intel CPU's you have _mm_cvtps_ph and _mm_cvtph_ps which can convert upto 4 values at a time.
  3. Write your own using knowledge of the IEEE 754 float format and that used by half.

Edit: Since you mainly just want to convert back-and-forth, the two functions to look at in the ILM code are: Float to half: Line 85 and half to float: line 62

0
On

Below I am showing an ISO-C99 implementation of float to half conversion that has been tested exhaustively. The following assumptions apply: float maps to IEEE-754 binary32 format while half maps to IEEE-754 binary16 format; both floating-point and integer data types use the same endianness when stored; conversion to a narrower floating-point type should utilize the rounding mode to-nearest-or-even.

As a golden reference the test framework uses the x86-64 instruction set extension F16C, introduced in 2011 to support half precision (FP16) as a storage type. IEEE-754 NaN handling contains some architecture specific elements, and the float2half_rn() function below was designed to mimic the x86-64 behavior. Adjustments, for example switching to the use of a single canonical NaN encoding, are trivial.

The code below is derived from code that I previously published under a BSD license here. I used the Intel Compiler Version 13.1.3.198 Build 20130607 to build this code and ran the exhaustive test on an IvyBridge CPU.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint16_t float2half_rn (float a)
{
    uint32_t ia = float_as_uint32 (a);
    uint16_t ir;

    ir = (ia >> 16) & 0x8000;
    if ((ia & 0x7f800000) == 0x7f800000) {
        if ((ia & 0x7fffffff) == 0x7f800000) {
            ir |= 0x7c00; /* infinity */
        } else {
            ir |= 0x7e00 | ((ia >> (24 - 11)) & 0x1ff); /* NaN, quietened */
        }
    } else if ((ia & 0x7f800000) >= 0x33000000) {
        int shift = (int)((ia >> 23) & 0xff) - 127;
        if (shift > 15) {
            ir |= 0x7c00; /* infinity */
        } else {
            ia = (ia & 0x007fffff) | 0x00800000; /* extract mantissa */
            if (shift < -14) { /* denormal */  
                ir |= ia >> (-1 - shift);
                ia = ia << (32 - (-1 - shift));
            } else { /* normal */
                ir |= ia >> (24 - 11);
                ia = ia << (32 - (24 - 11));
                ir = ir + ((14 + shift) << 10);
            }
            /* IEEE-754 round to nearest of even */
            if ((ia > 0x80000000) || ((ia == 0x80000000) && (ir & 1))) {
                ir++;
            }
        }
    }
    return ir;
}

uint16_t float2half_rn_ref (float a)
{
    __m128 pa = _mm_set_ps1 (a);
    __m128i r16 = _mm_cvtps_ph (pa, _MM_FROUND_TO_NEAREST_INT);
    uint16_t res;
    memcpy (&res, &r16, sizeof res);
    return res;
}

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int main (void)
{
    float arg;
    uint16_t resi, refi;
    uint32_t argi = 0;
    do {
        arg = uint32_as_float (argi);
        refi = float2half_rn_ref (arg);
        resi = float2half_rn (arg);
        if (resi != refi) {
            printf ("error @ %15.8e (%08x): resi=%04x  refi=%04x\n", 
                    arg, argi, resi, refi);
            return EXIT_FAILURE;
        }
        argi++;
        if ((argi & 0xffffff) == 0) printf ("\r%08x", argi);
    } while (argi);
    return EXIT_SUCCESS;
}

0
On

Code for the 16-bit to 32-bit conversion is here.

Below is quickly written tested code for 32-bit to 16-bit conversion, largely based on the algorithm here. I do not have time to document it properly at the moment, and it can be improved. The tests do not check handling of the payload bits (in the significand field) of NaNs.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>


//  Provide bit N (equal to 2**N), for 0 <= N < 32.
#define Bit(N)  ((uint32_t) 1 << (N))

//  Provide a mask of N bits, for 0 <= N < 32.
#define Mask(N) (((uint32_t) 1 << (N)) - 1)


/*  Convert a 32-bit floating-point number (with 1 sign bit, 8 exponent field
    bits, and 23 significand field bits) to a 16-bit floating-point number
    (with 1 sign bit, 5 exponent field bits, and 10 significand field bits).
*/
uint16_t ConvertF32ToF16(uint32_t x)
{
    uint32_t SignBit          = x >> 31;
    uint32_t ExponentField    = x >> 23 & Mask( 8);
    uint32_t SignificandField = x       & Mask(23);

    //  All-bits-set in the exponent field means infinity or NaN.
    if (ExponentField == Mask(8))
    {
        //  Zero significand field means infinity.
        if (SignificandField == 0)
            //  Return infinity with copied sign bit.
            return SignBit << 15 | Mask(5) << 10;

        //  Otherwise, we have a NaN.
        else
        {
            //  Truncate significand field.
            SignificandField >>= 23-10;

            /*  If truncated field is zero, turn on a bit so it indicates a
                NaN, not an infinity.
            */
            if (SignificandField == 0)
                SignificandField = 1;

            return SignBit << 15 | Mask(5) << 10 | SignificandField;
        }
    }

    //  All-bits-zero in the exponent field indicates a subnormal number.
    else if (ExponentField == 0)
    {
        /*  All subnormal 32-bit floating-point numbers are too small to
            convert to anything but zero in the 16-bit format.  Return zero
            with the sign bit copied.
        */
        return SignBit << 15;
    }

    //  Everything else is a normal number.
    else
    {
        //  Convert the exponent from the 32-bit format to the 16-bit format.
        int Exponent = (int) ExponentField - 127 + 15;

        //  Small numbers convert to zero.
        if (Exponent < -11)
            return SignBit << 15;

        /*  Decode the significand field.  Note the radix point is between
            bits 23 and 22.
        */
        uint32_t Significand = Bit(23) | SignificandField;

        //  Process values that are subnormal in the 16-bit format.
        if (Exponent < 1)
        {
            /*  Below, we shift the significand to where it will be when the
                exponent is increased to the minimum exponent of the 16-bit
                format.  Here, we move the bits that will shift out to the
                left, to isolate them.
            */
            uint32_t T = Significand << 32 - (1 - Exponent) - 13;
            Significand >>= 1 - Exponent + 13;
            Exponent = 1;

            /*  If the removed bits are greater than half the low bit, we round
                up.
            */
            if (Bit(31) < T)
                ++Significand;

            /*  Otherwise, if the removed bits equal half the low bit, we round
                to make the low bit of the significand even.
            */
            if (Bit(31) == T)
                Significand += Significand & 1;

            //  That could carry to significand to 2.
            if (Bit(10) <= Significand)
                return SignBit << 15 | 1 << 10 | 0;

            return SignBit << 15 | 0 << 10 | (Significand & Mask(10));
        }

        uint32_t T = Significand & Mask(13);
        if (Bit(12) < T || Bit(12) == T && (Significand & Bit(13)))
            Significand += Bit(13);
        Significand >>= 13;
        if (Bit(11) <= Significand)
        {
            ++Exponent;
            Significand >>= 1;
        }

        if (31 <= Exponent)
            return SignBit << 15 | Mask(5) << 10;

        return SignBit << 15 | Exponent << 10 | (Significand & Mask(10));
    }
}


#include <stdlib.h>


static void Test(float x)
{
    uint16_t y0 = (union { _Float16 f; uint16_t u; }) { x } .u;

    uint32_t xu = (union { float f; uint32_t u; }) { x } .u;
    uint16_t y1 = ConvertF32ToF16(xu);

    if (y0 == y1) return;

    printf("%a -> 0x%04hx but expected 0x%04hx.\n", x, y1, y0);
    exit(EXIT_FAILURE);
}


#include <math.h>


int main(void)
{
    Test(-NAN);
    Test(+NAN);
    for (float x = -INFINITY; x < INFINITY; x = nexttowardf(x, INFINITY))
        Test(x);
    Test(INFINITY);
}