On GPU, is it possible to get more flops by combining double and float operations?

451 Views Asked by At

If a GPU can do N1 single precision operations per second, and N2 double precision operations per second. Is it possible, by mixing (independent) single and double precision operations to achieve N1+N2 total operations per second, or at least something larger than N1 and N2?

On intel/amd CPU, I am pretty sure this is not possible, as both double and single precision share at least some execution resources. But I have no idea if this is true for modern nvidia or amd GPUs.

1

There are 1 best solutions below

0
On

This question had partly been touched upon in a SuperUser question, where the accepted answer has a fair amount of linkage to external sources, including two talks on using mixed precision arithmetic (this and this). Both of them investigate the use of mixed precision from a correctness standpoint and don't seem to be mainly motivated by performance.

Extending upon that, parametric code that is able to conditionally change some parts of its calculation to use reduced precision (as opposed to classic "doing everything in double") where applicable can yield benefits on both modern AMD and Nvidia GPUs (Intel has yet to reveal such details about their coming GPUs). Data dependency of subsequent operations plays an important role in being able to co-issue operations.

  • Nvidia has been using separate FP32 and FP64 units in their Streaming Multiprocessors (see for eg. NVIDIA Ampère Architecture In-Depth) Every architecture is slightly different, and Volta (GV100) is known to use different dispatch ports for various ops, including FP32 and FP64 which utilize different ports and therefore can be co-issued. Developer guides usually only mention the mutual exclusivity of various op types, but not the count of dispatch ports and their op types. NSight documentation and profiling counters of various compute capabilities regarding pipeline utilization (as mentioned in the linked forum answer) may help tuning code in this regard.
  • The AMD CDNA Whitepaper details that there too are dedicated HW elements for processing vector math and matrix math. (CDNA is AMD Instinct MI100 and up, gfx908 in ISA terms.) FP64 operations are processed using the VALU, while certain FP32 ops can also be processed using the Matrix ALU. To know which instructions map to these hardware units, refer to the CNDA ISA Reference Guide.

In both cases, writing the code in such a fashion is a necessity, but ultimately one is at the mercy of the compilers to emit such ISA which then the HW (or the driver in NV's case) processes in such a fashion that co-issue of the proper operations happen. Profilers are invaluable in finding out if the magic really did happen under the hood.

Having that said, even if co-issue doesn't happen, FP32 units consume less energy while operating (less bits is less work) and therefore generate less heat, allowing the GPU to maintain boost clocks for longer. Mild performance increases may still be observed regardless of architectural subtleties by not using extra resources when not strictly necessary.