I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16
- a warp-wide matrix-multiply-and-add, of half-precision data, with shape M=16, N=8, K=16 (IIANM).
The CUDA PTX ISA guide for CUDA 12.4 indicates in Section 9.7.13.3 that at FP16 precision, we only have PTX WMMA instructions with shape (M,N,K) being one of (16, 16, 16) or (32, 8, 16) or (8, 32, 16) - nothing smaller. But Section 9.7.13.1 says that smaller matrix shapes - (16, 8, 16), (16, 8, 8) and (8, 8, 4) - Are supported.
Trying to use the intrinsics corresponding to these smaller shapes, e.g.:
__hmma_m16n8k16_ld_a
results in an error:
mma-smaller.hpp(86): error: identifier "__hmma_m16n8k16_ld_a" is undefined
__hmma_m16n8k16_ld_a((int*)&a, (const int*)p, ldm, 0);
^
So are these shapes supported in PTX, or are they not?
Note: I'm using an Ampere GPU.
TL;DR: You can issue such a SASS instruction through appropriate choice of a PTX-level
mma
instruction (notwmma
), but there is no corresponding C++ intrinsic documented to do that, AFAIK, at this time.Longer: Let's start with some general background to disentangle some of these ideas. The mma class of instruction are there primarily to exercise tensorcore units, which provide hardware accelerated matrix-matrix multiply operations.
wmma.mma
,mma
, andwgmma.mma_async
.wmma.mma
instructions are distinguished by the fact that they also have corresponding matrix load and store instructions - they do not expose the per-thread register storage footprint directly. Themma
instructions, on the other hand, take PTX register input/output directly.wmma
style operations are documented - that is a subset of possible tensorcore ops, and that subset corresponds to the PTXwmma.mma
instructions, and that subset is also distinguished by the fact that matrix load/store functions are used, not direct register manipulation.wmma:mma_sync(...)
. There are no intrinsics documented in the C++ programming guide that look like__hmma_m16n8k16_ld_a
Yes, you can issue a 16x8x16 (M,N,K) 16-bit floating point tensorcore op using PTX. It cannot be directly done using a (documented) C++ intrinsic, and in PTX I wouldn't use a
wmma.mma
instruction for it, I would use this mma PTX instruction - mma.m16n8k16 . A detailed description for this with PTX register layout is here. An instruction-skeleton example is given here. The "target ISA notes" section later in that link provides hardware support info. Of note:Here is a complete example (a modification of what I depicted here):
As indicated in the link I provided, these tensorcore ops compute:
In the above example, I have chosen to use/declare A and B as 16-bit floating point, whereas C and D are 32-bit floating point.
If we disassemble the above built code, we observe the following, indicating the SASS level tensorcore op in use:
The indicated tensorcore SASS instruction is HMMA.16816.F32 R4, R12, R4, RZ
If you want to see HMMA.16816.F16, then switch the C and D matrices to 16-bit float, and modify the PTX instruction accordingly. Something like this:
(I have removed non-essential lines due to hitting the character limit in my answer).