After reading the answer to this question:
Does PTX (8.4) not cover smaller-shape WMMA instructions?
and re-reading the section of the PTX ISA reference distinguishing WMMA from MMA instructions, I wonder - why the distinction?
That is,
- Why do some of the instructions get the
w
prefix? After all, some of the non-w MMA operations are warp-wide... - Why don't we just have
mma.load
andmma.store
warp-wide instructions which can take care of the loading data into registers? - Why is there no coverge by intrinsics and templates (e.g.
fragment<...>
) of all of the matrix-multiply-add-related PTX instructions?