Force all threads in a work group to execute the same if/else branch

143 Views Asked by At

I would like to use the local/shared memory optimization to reduce global memory access, so I basically have this function

float __attribute__((always_inline)) test_unoptimized(const global float* data, ...) {
    // ...
    for(uint j=0; j<def_data_length; j++) {
        const float x = data[j];
        // do sime computation with x, like finding the minimum value ...
    }
    // ...
    return x_min;
}

and do the usual local/shared memory optimization on it:

float __attribute__((always_inline)) test_optimized(const global float* data, ...) {
    // ...
    const uint lid = get_local_id(0); // shared memory optimization (only works with first ray)
    local float cache_x[def_ws];
    for(uint j=0; j<def_data_length; j+=def_ws) {
        cache_x[lid] = data[j+lid];
        barrier(CLK_LOCAL_MEM_FENCE);
        #pragma unroll
        for(uint k=0; k<min(def_ws, def_data_length-j); k++) {
            const float x = cache_x[k];
            // do sime computation with x, like finding the minimum value ...
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    // ...
    return x_min;
}

Now the difficulty is that test_optimized is called in the kernel only in one of two possible if/else branches. If only some threads in a workgroup execute the else-branch, all other threads must not choose the if-branch for the local memory optimization in test_optimized to work. So I created a workaround: The condition for each thread in the workgroup is atomic_or-ed into an integer and then the integer, which is the same for all threads, is checked for branching. This ensures that, if 1 or more threads in the thread block choose the else-branch, all the others do too.

kernel void test_kernel(const global float* data, global float* result...) {
    const uint n = get_global_id(0);
    
    // ...
    const bool condition = ...; // here I get some condition based on the thread ID n and global data

    local uint condition_any; // make sure all threads within a workgroup are in the if/else part
    condition_any = 0u;
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_or(&condition_any, condition);
    barrier(CLK_LOCAL_MEM_FENCE);

    if(condition_any==0u) {
        // if-part is very short
        result = 0;
        return;
    } else {
        // else-part calls test_optimized function
        const float x_min = test_optimized(data, ...);
        result = condition ? x_min : 0;
    }
}

The above code works flawlessly and is about 25% faster than with the test_unoptimized function. But atomically jamming a bit into the same local memory from all threads in the workgroup seems a bit like a hack to me and it only runs efficiently for small workgroup size (def_ws) 32, 64 or 128, but not 256 or greater.

Is this trick used in other codes and does it have a name? If not: Is there a better way to do it?

1

There are 1 best solutions below

0
On BEST ANSWER

With OpenCL 1.2 or older, I don't think there's a way to do this any faster. (I'm not aware of any relevant vendor extensions, but check your implementation's list for anything promising.)

With OpenCL 2.0+, you can use workgroup functions, in this case specifically work_group_any() for this sort of thing.