How to change value CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE for OpenCL Mali-platform?

373 Views Asked by At

I am developing OpenCL code for different devices. At the current time I work with Rockchip RK3588 (OpenCL device - Mali-G610 r0p0). The program algorithm was originally written on CUDA, where the warp size is 32. In OpenCL this value is named "sub-work group size" (count Work-Items running in the current time). Also, this value can get from the value CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

For example on Intel GPU I can set this value uses __attribute__((intel_reqd_sub_group_size(32))). And now on "Mali-G610 r0p0" I get "CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 16", but the program work is not correct, I need to change this value to 32.

<clinfo> returned me the next info:
................
Preferred work group size multiple (device) 16
Preferred work group size multiple (kernel) 16
Max sub-groups per work group 64
................

Perhaps someone can help me with this?

1

There are 1 best solutions below

4
On

You're interpreting the value incorrectly. The "Preferred Work Group Size Multiple" value is just that: a preference. The OpenCL API is telling you that it prefers work groups come in multiples of 16, but you can specify other sizes instead if you absolutely need to. But, lucky for you, you need a size of 32, which it is perfectly happy with.

What you actually need to do is manually specify the work group size when you enqueue the kernel to the GPU. If you don't specify it, the API will figure out for itself what the size should be, and while its guesses are usually pretty good, if you're doing work at the level of the work groups itself (very common with Reduction algorithms) you might have to manually specify it.

When you're submitting your kernel, you should be using code that resembles this:

size_t globalWorkSize[] {512};
cl_event event;

clEnqueueNDRangeKernel(
    command_queue,
    kernel,
    1, //work_dim
    nullptr, //global_work_offset
    globalWorkSize,
    nullptr, //local_work_size
    0, //num_events_in_wait_list
    nullptr, //event_wait_list
    &event
);

You need to add a parameter that explicitly specifies the work group size:

size_t globalWorkSize[] {512};
size_t localWorkSize[] {32}; //Will create 16 work groups
cl_event event;

clEnqueueNDRangeKernel(
    command_queue,
    kernel,
    1, //work_dim
    nullptr, //global_work_offset
    globalWorkSize,
    localWorkSize,
    0, //num_events_in_wait_list
    nullptr, //event_wait_list
    &event
);