I was working on a large cuda kernel and I noticed that the kernel was using 43 registers per thread. In order to understand what was going on, I wrote a smaller program to figure out register usage. I noticed that whenever I use if
, register usage goes up. The small code is as follows:
#include <limits.h>
#include <stdio.h>
#include <fstream>
#include <iostream>
#include <cstdlib>
#include <stdint.h>
using namespace std;
__global__ void test_ifs(unsigned int* result){
unsigned int k = 0;
for(int j=0;j<MAX_COMP;j++){
//if(j <= threadIdx.x%MAX_COMP){
k += j;
//}
}
result[threadIdx.x] = k;
}
int main(){
unsigned int* result;
cudaError_t e1 = cudaMalloc((void**) &result, THREADSPERBLOCK*sizeof(unsigned int));
if(e1 == cudaSuccess){
test_ifs<<<1, THREADSPERBLOCK>>>(result);
cudaError_t e2 = cudaGetLastError();
if(e2 == cudaSuccess){
}
else{
cout << "kernel failed to launch" << endl;
}
}
else{
cout << "Failed to allocate results memory" << endl;
}
}
When I compile this code, each thread uses 5 registers
ptxas info : Compiling entry function '_Z8test_ifsPj' for 'sm_20'
ptxas info : Function properties for _Z8test_ifsPj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 40 bytes cmem[0]
But, if I uncomment if
, each thread uses 8 registers. Can anyone please explain to me what is going on?
ptxas info : Compiling entry function '_Z8test_ifsPj' for 'sm_20'
ptxas info : Function properties for _Z8test_ifsPj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 40 bytes cmem[0]
The behaviour you are seeing in this example is due to compiler optimization. In the simple loop case, the result of the loop can be calculated at compile time and the whole loop code replaced by a constant, while in the loop containing an if statement case, the result of the loop depends on a variable whose value is not known to the compiler, and the loop must stay.
To prove this is the case, let's look at an only slightly modified version of your kernel:
and the PTX it emits. For the first case:
and the second case the loop remains intact:
You should not conclude that differences will always be due to compiler optimisation, because that depends strongly on the code and the compiler. But in this case, that is the difference.