Nvidia-smi only provides a few metrics to measure GPU utilization. Most importantly, utilization.gpu
represents the percent of time over the past sample period during which one or more kernels was executing on the GPU. Thus, it seems that a value of 100% does not at all indicate "full" GPU usage.
Alternatively, Nsight Compute provides many detailed metrics, but I found it to run very slowly on even small neural networks - it doesn't seem to be the use case. Another option seems to be DLProf, but this again only provides rather granular metrics such as "GPU utilization" and "Tensor Core Efficiency", whose definitions I could not find.
Therefore, is there another tool (or parameter) which provides detailed GPU usage metrics?
Have you considered trying DCGM? https://developer.nvidia.com/dcgm#:~:text=NVIDIA%20Data%20Center%20GPU%20Manager,including%20power%20and%20clock%20management.