I am profiling CUDA kernel using nvprof with PC sampling enabled, as to understand some latency problems I am having. The GPU I am using is the P100 (compute 6.0)
PC sampling reports that a DFMA is stalling frequently due to memory dependencies. The SASS code for the DFMA is as follows:
DFMA R22, R4, R8, R22
My take of the problem is that R8 needs to be loaded via an LDG.E.CI.64 with a very high miss rate on L2.
The definition of a memory dependency stall is "A load/store cannot be made because the required resources are not available or are fully utilized, or too many requests of a given type are outstanding."
What confuses me is that DFMA are not load/store operations and if I am right that the stall is due to data not available on R8, then it should be an execution dependency. What does a memory dependency stall on a DFMA means?