Can I do random writes from a kernel without worrying about synchronization issues?

75 Views Asked by At

Consider a simple depth-of-field filter (my actual use case is similar). It loops over the image and scatters every pixel over a circular neighborhood of its. The radius of the neighborhood depends on the depth of the pixel - the closer the it is to the focal plane, the smaller the radius.

Note that I said "scatters" and not "gathers". In simpler image processing applications, you normally use the "gather" technique to perform an uniform Gaussian blur. IOW, you loop over the neighborhood of each pixel, and "gather" the nearby values into a weighted average. This works fine in that case, but if you make the blur kernel vary between pixels, while still using "gathering", you'll get a somewhat unrealistic effect. Such "space-variant filtering" scenarios are where "scattering" is different from "gathering".

To be clear: the scatter algo is something like this:

init resultImage to black
loop over sourceImage
    var c = fetch current pixel from sourceImage
    var toAdd = c * weight // weight < 1
    loop over circular neighbourhood of current sourcepixel
        add toAdd to current neighbor from resultImage

My question is: if I do a direct translation of this pseudocode to OpenCL, will there be synchronization issues due to different work-items simultaneously writing to the same output pixel?

Does the answer vary depending on whether I'm using Buffers or Images?

The course I'm reading suggests that there will be synchronization issues. But OTOH I read the source of Mandelbulber 1.21-2, which does a straightforward OpenCL DOF just like my above pseudocode, and it seems to work fine.

(the relevant code is in mandelbulber-opencl-1.21-2.orig/usr/share/cl/cl_DOF.cl and it's as follows)

//*********************************************************
//                   MANDELBULBER
// kernel for DOF effect
// 
//
// author: Krzysztof Marczak
// contact: [email protected]
// licence: GNU GPL v3.0
//
//*********************************************************

typedef struct
{
    int width;
    int height;
    float focus;
    float radius;
} sParamsDOF;

typedef struct
{
    float z;
    int i;
} sSortZ;

//------------------ MAIN RENDER FUNCTION --------------------
kernel void DOF(__global ushort4 *in_image, __global ushort4 *out_image, __global sSortZ *zBuffer, sParamsDOF p)
{
    const unsigned int i = get_global_id(0);

    uint index = p.height * p.width - i - 1;
    int ii = zBuffer[index].i;

    int2 scr = (int2){ii % p.width, ii / p.width};
    float z = zBuffer[index].z;
    float blur = fabs(z - p.focus) / z * p.radius;
    blur = min(blur, 500.0f);
    float4 center = convert_float4(in_image[scr.x + scr.y * p.width]);
    float factor = blur * blur * sqrt(blur)* M_PI_F/3.0f;
    int blurInt = (int)blur;

    int2 scr2;
    int2 start = (int2){scr.x - blurInt, scr.y - blurInt};
    start = max(start, 0);
    int2 end = (int2){scr.x + blurInt, scr.y + blurInt};
    end = min(end, (int2){p.width - 1, p.height - 1});

    for (scr2.y = start.y; scr2.y <= end.y; scr2.y++)
    {
        for(scr2.x = start.x; scr2.x <= end.x; scr2.x++)
        {
            float2 d = scr - scr2;
            float r = length(d);
            float op = (blur - r) / factor;
            op = clamp(op, 0.0f, 1.0f);
            float opN = 1.0f - op;
            uint address = scr2.x + scr2.y * p.width;
            float4 old = convert_float4(out_image[address]);
            out_image[address] = convert_ushort4(opN * old + op * center);
        }
    }
}
1

There are 1 best solutions below

1
On

No, you can't without worrying about synchronization. If two work items scatter to the same location without synchronization, you have a race condition and won't get the correct results. Same for both buffers and images. With buffers you could use atomics, but they can slow down your code, especially when there is contention (but even when not). AFAIK, read/write images don't have atomic operations.