pyopencl crashes randomly after enqueueing a kernel on device

42 Views Asked by At

I am trying to perform grayscale ( a random operation I arbitrarily choose ) to perform in a heterogenous manner. If I have 100 pictures, i want to split them across all the available devices in the computer so that they will finish at the same time thus producing a faster implementation compared to using a single device.

The kernel code I am trying to run is this:

// GRAYSCALE = """
__kernel void grayscale(__global uchar *input,
                        __global uchar *output,
                        uint width,
                        ulong size,
                        const ulong chunk_size)
{
    ulong gid_column = get_global_id(0);
    ulong gid_row = get_global_id(1);
    ulong pixel;
    float luma,r,g,b;
    while(1){};
    for(ulong i = 0; i<chunk_size;i++){
        pixel = ((gid_column*chunk_size)+(gid_row*width)+i)*3;
        if(pixel<size-3){
            r = input[pixel+0];
            g = input[pixel+1];
            b = input[pixel+2];
            luma = 0.299f * b + 0.587f * g + 0.114f * r;
            output[pixel+0] = luma;
            output[pixel+1] = luma;
            output[pixel+2] = luma;
        }
        else break;
    } 
}
///    """

The while(1){}; is added intentionally because I want to see how scheduler will work, and I need the device to spend more time on the task.

This is where the issue occurs:


    def grayscale(self, images):
        result = []
        device_queues = {}
        device_queues_status = {}
        DEBUG_PROCESED_TASKS = {}
        selected_device = None

        for device in self.opencl_devices:
            device_queues[device] = []
            device_queues_status[device] = 0
            DEBUG_PROCESED_TASKS[device.device_name] = 0

        for image in images:
            for device in device_queues:
                device_queues_status[
                    device] = OpenCLFunctions.Pictures.get_work_amount(
                        device_queues[device])
            selected_device = min(device_queues_status,
                                  key=device_queues_status.get)
            ev = selected_device._grayscale(image)
            device_queues[selected_device].append(ev)
            DEBUG_PROCESED_TASKS[selected_device.device_name] += 1
        print(DEBUG_PROCESED_TASKS)


    def _grayscale(self, input_image):
        ENQUEUE_PROCESS_ASYNC = opencl_device.enqueue_nd_range_kernel
        ENQUEUE_COPY_ASYNC = opencl_device.enqueue_copy
        OPENCL_BUFFER = opencl_device.Buffer

        program = self.kernels["_grayscale"].grayscale
        buffers = OpenCLBuffer(
            OPENCL_BUFFER(self.context,
                          opencl_device.mem_flags.READ_WRITE
                          | opencl_device.mem_flags.ALLOC_HOST_PTR,
                          size=input_image.size),
            OPENCL_BUFFER(self.context,
                          opencl_device.mem_flags.READ_WRITE
                          | opencl_device.mem_flags.ALLOC_HOST_PTR,
                          size=input_image.size))
        input_copy = ENQUEUE_COPY_ASYNC(self.queue,
                                        buffers.opencl_buffer_input,
                                        input_image,
                                        is_blocking=False)
        program.set_args(
            buffers.opencl_buffer_input, buffers.opencl_buffer_output,
            opencl_device_numpy.uint32(input_image.shape[0]),
            opencl_device_numpy.uint64(buffers.size),
            opencl_device_numpy.uint64(self.CHUNK_PROCESSING_SIZE))
        max_work_group_size_kernel = program.get_work_group_info(
            opencl_device.kernel_work_group_info.WORK_GROUP_SIZE, self.device)
        prefered_local_size = program.get_work_group_info(
            opencl_device.kernel_work_group_info.
            PREFERRED_WORK_GROUP_SIZE_MULTIPLE, self.device)
        global_size = OpenCLFunctions.Pictures._get_global_size_picture(
            input_image.shape[0], input_image.shape[1],
            self.CHUNK_PROCESSING_SIZE)
        local_size, global_size = OpenCLFunctions.OpenCLScheduler._get_optimal_local_global_size(
            global_size, max_work_group_size_kernel,
            self.device.max_work_group_size, self.device.max_work_item_sizes,
            prefered_local_size)
        try:
            ev = ENQUEUE_PROCESS_ASYNC(self.queue,
                                       program,
                                       global_size,
                                       local_size,
                                       wait_for=[input_copy])
            return [ev, buffers.opencl_buffer_output]
        except Exception as e:
            print(e)

Basically, this function will assign something to process to the first device. The next round it comes, it will assign a new task to the least busy device. The problem it is after the function being called here ev = selected_device._grayscale(image) the debugger is no longer able to view the variables, and the main thread from python hangs. it does work a bit after enqueueing the kernel, but it acts sporadically.

Is there something that maybe I am forgetting?

P.S: This is just for research and experimenting, I just want to understand this particular issue.

Tried:

  • prevent the python garbage collector from collecting the events
  • use a long for loop instead of a while(1){};
  • add to watch the variables I am following so that I can catch the event being busy after it gets to scheduler.
  • using smaller data

expectation: What I am expecting is to see a task being busy (because of

while(1){}; 

on the GPU, the scheduler sees it is busy, then it assigns the next task to other device (CPU or iGPU)

0

There are 0 best solutions below