OpenCL parallel buffer compaction barrier issue

141 Views Asked by At

As a school project, we're 4 working on a parallel raytracer with OpenCL. It's our first project using OpenCL, so we might have some incomprehensions about it.

We're trying to implement parallel buffer compaction to remove finished rays, or rays that didn't collide with anything so the next iteration has less data to process. Basically, we've got a buffer of as many s_ray_states as needed for rendering, tracing them, getting collision data, compacting the buffer so there'd be only rays which collided with an object inside it, then shading them.

So we have a buffer uint *prefix_sum which contains the indices at which each s_ray_state must be moved to in the buffer s_ray_state *ray_states in order to reduce the number of rays which are sent to the shading kernel, and the next iterations of the trace/shade kernels.

Sadly, the ray_sort kernel below doesn't seem to be working right, we verified the input prefix_sum data, which is 100% correct, same for ray_states buffer, but we're getting unwanted data in the output.

We're launching a single workgroup ( global work size = local work size ), the rays are always moved in the buffer to a smaller index than their original. We've put barriers, and are using the s_ray_state *tmp buffer to prevent parallel executions to write on each-other's data, but it doesn't seem to work, even when removing the barriers we'd get the same result.

The both of us have been working on it for 4 days and have already asked for help from other students, but no-one seems to be able to figure out what's wrong. We may not be understanding barriers / mem fences enough to be sure this can in fact work.

We already tried making a single work item in a single work group sort the whole array, which works, and even gives better performance.

Is the code below supposed to be working ? With our understanding of OpenCL, it should be working, and we did a lot of research, but never really got any clear answer..

kernel void ray_sort(
    global read_only uint *prefix_sum,
    global read_write struct s_ray_state *ray_states,
    global read_only uint *ray_states_size,
    local read_write struct s_ray_state *tmp
)
{
    int l_size = get_local_size(0);
    int l_id = get_local_id(0);
    int group_id = -1;
    int group_nb = *ray_states_size / l_size;
    int state_id;

    while (++group_id < group_nb)
    {
        state_id = group_id * l_size + l_id;
        tmp[l_id] = ray_states[state_id];
        barrier(CLK_LOCAL_MEM_FENCE);
        if (did_hit(tmp[l_id]))
            ray_states[prefix_sum[state_id]] = tmp[l_id];
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

ray_states length is ray_states_size

prefix_sum contains indices at which each ray_states element must be moved to

tmp is a local buffer of size local_work_size

local_work_size = global_work_size

did_hit() returns 1 if the ray hits an object, 0 otherwise

We're expecting the ray_states elements to be moved to the indices contained in prefix_sum

Example: Each ray_states[id] gets moved to prefix_sum[id] index in ray_states

prefix_sum: 0 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 4

did_hit(ray_states[id]): 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0

did_hit(output[id]): 1 | 1 | 1 | 1 | X | X | X | X | X

Xs can be anything

1

There are 1 best solutions below

4
On

I could be completely off here, but it seems to me with did_hit(ray_states[state_id]) you are reading the same piece of global memory which you put into a local memory buffer tmp just 2 lines above. Which wouldn't be a problem, except you are using that buffer for both input and output.

The way i see it, what actually happens on the hardware is this:

    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];
    tmp[l_id] = ray_states[state_id];

       ... local-work-size times

    barrier(CLK_LOCAL_MEM_FENCE);

    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];
    if (did_hit(ray_states[state_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];

      ... again local-work-size times

Considering WItem parallel execution order is undefined (hardware may chose any order it wants), this will lead to random results. Can you try this instead:

    if (did_hit(tmp[l_id]))
        ray_states[prefix_sum[state_id]] = tmp[l_id];

BTW if ray_states_size is just a simple integer, you can pass it directly, by making the argument "uint ray_states_size". No need to fool around with buffers there.

EDIT1: my suggestion will only work if prefix_sum[state_id] doesn't have any duplicates in every local-work-size worth of ids, otherwise there will still be a data race. So e.g. if for both state_id-s 1 and 3 the prefix_sum[state_id] array has 0, and your local WG size is >= 4, there will be a data race.

Also, is there some really good reason you must use the same buffer for input and output ? it seems to me it'd be a lot less complicated if you had separate in/out buffers.

EDIT2: i just noticed you said "the rays are always moved in the buffer to a smaller index than their original" (sorry i missed it). That's good, but not enough - are they always moved to a smaller index then any other ray's index in the same local WG ? if yes, good, but there's still the other data race i mentioned.