I'm trying to measure some programs, including some of those included in the CUDA samples, with nvprof in my environment with two Nvidia Quadro Q1000 GPUs.
For 5_Domain_Specific/MonteCarloMultiGPU Nvprof returns this strange behavior as shown in the figure:
Apart from the execution time, which in reality is about 1 second, I expect the execution of the two kernels to occur almost at the same time and not one at the beginning and the other at the end.
The Cuda Toolkit version is 12.2 while the GPUS the CUDA Capability is 61

