Cuda measurement of loop

108 Views Asked by At

I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.

__global__ void kernel(){
int x1,x2;

x1=5;
x2=1;

for (int k=0;k<=1000000;k++)
  {
   x1+=x2;

  }
}

The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5

I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.

The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?

Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).

2

There are 2 best solutions below

0
On BEST ANSWER

As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.

To get around both of these problems, rewrite your code like this:

__global__ 
void kernel(int *out, int x0, bool flag)
{
    int x1 = x0, x2 = 1;

    for (int k=0; k<=1000000; k++) {
       x1+=x2;
    }

    if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
}

and then run it like this:

kernel<<<1,512>>>((int *)0, 5, false);

By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.

0
On

Because compiler eliminates the dead paths. You code doesn't actually do anything. Look at the assembly.

If you are actually seeing the value, then the compiler may have just optimized out the loop as it can know the value during compile time.

When you write out the register contents to shared memory, compiler cannot guarantee that the result will not be used, and hence the value will actually be computed. In other words, the value you compute must be used somewhere eventually or written to memory otherwise its computation will be dropped.