As a school project, we're 4 working on a parallel raytracer with OpenCL. It's our first project using OpenCL, so we might have some incomprehensions about it.
We're trying to implement parallel buffer compaction to remove finished rays, or rays that didn't collide with anything so the next iteration has less data to process.
Basically, we've got a buffer of as many s_ray_states
as needed for rendering, tracing them, getting collision data, compacting the buffer so there'd be only rays which collided with an object inside it, then shading them.
So we have a buffer uint *prefix_sum
which contains the indices at which each s_ray_state
must be moved to in the buffer s_ray_state *ray_states
in order to reduce the number of rays which are sent to the shading kernel, and the next iterations of the trace/shade kernels.
Sadly, the ray_sort
kernel below doesn't seem to be working right, we verified the input prefix_sum
data, which is 100% correct, same for ray_states
buffer, but we're getting unwanted data in the output.
We're launching a single workgroup ( global work size = local work size ), the rays are always moved in the buffer to a smaller index than their original. We've put barriers, and are using the s_ray_state *tmp
buffer to prevent parallel executions to write on each-other's data, but it doesn't seem to work, even when removing the barriers we'd get the same result.
The both of us have been working on it for 4 days and have already asked for help from other students, but no-one seems to be able to figure out what's wrong. We may not be understanding barriers / mem fences enough to be sure this can in fact work.
We already tried making a single work item in a single work group sort the whole array, which works, and even gives better performance.
Is the code below supposed to be working ? With our understanding of OpenCL, it should be working, and we did a lot of research, but never really got any clear answer..
kernel void ray_sort(
global read_only uint *prefix_sum,
global read_write struct s_ray_state *ray_states,
global read_only uint *ray_states_size,
local read_write struct s_ray_state *tmp
)
{
int l_size = get_local_size(0);
int l_id = get_local_id(0);
int group_id = -1;
int group_nb = *ray_states_size / l_size;
int state_id;
while (++group_id < group_nb)
{
state_id = group_id * l_size + l_id;
tmp[l_id] = ray_states[state_id];
barrier(CLK_LOCAL_MEM_FENCE);
if (did_hit(tmp[l_id]))
ray_states[prefix_sum[state_id]] = tmp[l_id];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
ray_states
length is ray_states_size
prefix_sum
contains indices at which each ray_states
element must be moved to
tmp
is a local buffer of size local_work_size
local_work_size
= global_work_size
did_hit()
returns 1 if the ray hits an object, 0 otherwise
We're expecting the ray_states
elements to be moved to the indices contained in prefix_sum
Example: Each ray_states[id]
gets moved to prefix_sum[id]
index in
ray_states
prefix_sum: 0 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 4
did_hit(ray_states[id]): 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0
did_hit(output[id]): 1 | 1 | 1 | 1 | X | X | X | X | X
X
s can be anything
I could be completely off here, but it seems to me with
did_hit(ray_states[state_id])
you are reading the same piece of global memory which you put into a local memory buffer tmp just 2 lines above. Which wouldn't be a problem, except you are using that buffer for both input and output.The way i see it, what actually happens on the hardware is this:
Considering WItem parallel execution order is undefined (hardware may chose any order it wants), this will lead to random results. Can you try this instead:
BTW if
ray_states_size
is just a simple integer, you can pass it directly, by making the argument "uint ray_states_size". No need to fool around with buffers there.EDIT1: my suggestion will only work if
prefix_sum[state_id]
doesn't have any duplicates in every local-work-size worth of ids, otherwise there will still be a data race. So e.g. if for bothstate_id
-s 1 and 3 theprefix_sum[state_id]
array has 0, and your local WG size is >= 4, there will be a data race.Also, is there some really good reason you must use the same buffer for input and output ? it seems to me it'd be a lot less complicated if you had separate in/out buffers.
EDIT2: i just noticed you said "the rays are always moved in the buffer to a smaller index than their original" (sorry i missed it). That's good, but not enough - are they always moved to a smaller index then any other ray's index in the same local WG ? if yes, good, but there's still the other data race i mentioned.