Atomic addition to floating point values in OpenCL for NVIDIA GPUs?

1k Views Asked by At

The OpenCL 3.0 specification does not seem to have intrinsics/builtins for atomic addition to floating-point values, only for integral values (and that seems to have been the case in OpenCL 1.x and 2.x as well). CUDA, however, has offered floating-point atomics for a while now:

float  atomicAdd(float*  address, float  val); // since Fermi
double atomicAdd(double* address, double val); // since Pascal
__half atomicAdd(__half *address, __half val); // ?

Naturally, any straightforward atomic operation can be simulated with compare-and-exchange, and this is available in OpenCL. But my questions are:

  1. Does NVIDIA expose floating-point atomics in OpenCL somehow? e.g. via a vendor extension? using pragmas? implicitly?
  2. Is there a more efficient mechanism than simulation with compare-exchange, which I could consider as a substitute for floating-point atomics? For NVIDIA GPUs or generally?
2

There are 2 best solutions below

0
On BEST ANSWER

As @ProjectPhysX implied in their answer, when you compile OpenCL with NVIDIA's driver, it accepts inline PTX assembly (which is of course not at all part of OpenCL nor a recognized vendor extension). This lets you basically do anything CUDA offers you - in OpenCL; and that includes atomically adding to floating point values.

So, here are wrapper functions for atomically adding to single-precision (32-bit) floating point values in global and in local memory:

float atomic_add_float_global(__global float* p, float val)
{
    float prev;
    asm volatile(
        "atom.global.add.f32 %0, [%1], %2;" 
        : "=f"(prev) 
        : "l"(p) , "f"(val) 
        : "memory" 
    );
    return prev;
}

float atomic_add_float_local(__local float* p, float val)
{
    float prev;
    // Remember "local" in OpenCL means the same as "shared" in CUDA.
    asm volatile(
        "atom.shared.add.f32 %0, [%1], %2;"
        : "=f"(prev) 
        : "l"(p) , "f"(val) 
        : "memory" 
    );
    return prev;
}

One could also perhaps tweak this by checking whether the OpenCL driver is NVIDIA's, in which case the inline assembly is used, or non-NVIDIA, in which the atomic-compare-exchange implementation is used.

1
On

Native foating-point atomics are a much desired extension for OpenCL 3.0. As of right now, they are still not available.

  1. The only possible way would be to use inline PTX.
  2. No. The implementation with atomic compare-exchange for FP32 and FP64 is currently state-of-the-art and there is no known better way.

UPDATE June 2022: Floating-point atomics are being added to the OpenCL 3.0 standard but adoption by hardware vendors might still take some time.