CUDA kernel with single branch runs 1.5x faster than kernel without branch

702 Views Asked by At

I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.

Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.

Idea of the algorithm:

  1. Run a loop for all possible interaction types (which is five)
  2. At every cycle a warp thread votes for its interaction type
  3. After loop completion every warp thread knows about another threads with the same interaction type
  4. Threads elect they leader (per interaction type)
  5. Leader updates interactions offsets table using atomicAdd
  6. Each thread writes its data to corresponding offset

I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/

My first kernel contains a branch inside loop and runs for ~5ms:

int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
  if (i == decision)
  {
    active = __ballot(1);
    leader = __ffs(active) - 1;
    warp_progress = __popc(active);
  }
}

My second kernel use lookup table of two elements, use no branching and runs for ~8ms:

int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
  const int masks[2] = { 0, ~0 };
  int mask = masks[i == decision];
  active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);

Common part:

int warp_offset;
if (lane_id() == leader)
  warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...

How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?

UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src

1

There are 1 best solutions below

2
On

I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.

But there is no branch prediction on GPU and no penalty, so only instructions count matters.

First I need to rewrite code to the simple one.

With branch:

int active;
for (int i = 0; i != hit_interaction_count; ++i)
    if (i == decision)
        active = __ballot(1);

Without branch:

int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
  int mask = 0 - (i == decision);
  active |= (mask & __ballot(mask));
}

In first version there are ~3 operations: compare, if and __ballot(). In second version there are ~5 operations: compare, make mask, __ballot(), & and |=. And there are ~15 ops in common code.

Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.