I am trying to perform grayscale ( a random operation I arbitrarily choose ) to perform in a heterogenous manner. If I have 100 pictures, i want to split them across all the available devices in the computer so that they will finish at the same time thus producing a faster implementation compared to using a single device.
The kernel code I am trying to run is this:
// GRAYSCALE = """
__kernel void grayscale(__global uchar *input,
__global uchar *output,
uint width,
ulong size,
const ulong chunk_size)
{
ulong gid_column = get_global_id(0);
ulong gid_row = get_global_id(1);
ulong pixel;
float luma,r,g,b;
while(1){};
for(ulong i = 0; i<chunk_size;i++){
pixel = ((gid_column*chunk_size)+(gid_row*width)+i)*3;
if(pixel<size-3){
r = input[pixel+0];
g = input[pixel+1];
b = input[pixel+2];
luma = 0.299f * b + 0.587f * g + 0.114f * r;
output[pixel+0] = luma;
output[pixel+1] = luma;
output[pixel+2] = luma;
}
else break;
}
}
/// """
The while(1){}; is added intentionally because I want to see how scheduler will work, and I need the device to spend more time on the task.
This is where the issue occurs:
def grayscale(self, images):
result = []
device_queues = {}
device_queues_status = {}
DEBUG_PROCESED_TASKS = {}
selected_device = None
for device in self.opencl_devices:
device_queues[device] = []
device_queues_status[device] = 0
DEBUG_PROCESED_TASKS[device.device_name] = 0
for image in images:
for device in device_queues:
device_queues_status[
device] = OpenCLFunctions.Pictures.get_work_amount(
device_queues[device])
selected_device = min(device_queues_status,
key=device_queues_status.get)
ev = selected_device._grayscale(image)
device_queues[selected_device].append(ev)
DEBUG_PROCESED_TASKS[selected_device.device_name] += 1
print(DEBUG_PROCESED_TASKS)
def _grayscale(self, input_image):
ENQUEUE_PROCESS_ASYNC = opencl_device.enqueue_nd_range_kernel
ENQUEUE_COPY_ASYNC = opencl_device.enqueue_copy
OPENCL_BUFFER = opencl_device.Buffer
program = self.kernels["_grayscale"].grayscale
buffers = OpenCLBuffer(
OPENCL_BUFFER(self.context,
opencl_device.mem_flags.READ_WRITE
| opencl_device.mem_flags.ALLOC_HOST_PTR,
size=input_image.size),
OPENCL_BUFFER(self.context,
opencl_device.mem_flags.READ_WRITE
| opencl_device.mem_flags.ALLOC_HOST_PTR,
size=input_image.size))
input_copy = ENQUEUE_COPY_ASYNC(self.queue,
buffers.opencl_buffer_input,
input_image,
is_blocking=False)
program.set_args(
buffers.opencl_buffer_input, buffers.opencl_buffer_output,
opencl_device_numpy.uint32(input_image.shape[0]),
opencl_device_numpy.uint64(buffers.size),
opencl_device_numpy.uint64(self.CHUNK_PROCESSING_SIZE))
max_work_group_size_kernel = program.get_work_group_info(
opencl_device.kernel_work_group_info.WORK_GROUP_SIZE, self.device)
prefered_local_size = program.get_work_group_info(
opencl_device.kernel_work_group_info.
PREFERRED_WORK_GROUP_SIZE_MULTIPLE, self.device)
global_size = OpenCLFunctions.Pictures._get_global_size_picture(
input_image.shape[0], input_image.shape[1],
self.CHUNK_PROCESSING_SIZE)
local_size, global_size = OpenCLFunctions.OpenCLScheduler._get_optimal_local_global_size(
global_size, max_work_group_size_kernel,
self.device.max_work_group_size, self.device.max_work_item_sizes,
prefered_local_size)
try:
ev = ENQUEUE_PROCESS_ASYNC(self.queue,
program,
global_size,
local_size,
wait_for=[input_copy])
return [ev, buffers.opencl_buffer_output]
except Exception as e:
print(e)
Basically, this function will assign something to process to the first device. The next round it comes, it will assign a new task to the least busy device. The problem it is after the function being called here ev = selected_device._grayscale(image) the debugger is no longer able to view the variables, and the main thread from python hangs. it does work a bit after enqueueing the kernel, but it acts sporadically.
Is there something that maybe I am forgetting?
P.S: This is just for research and experimenting, I just want to understand this particular issue.
Tried:
- prevent the python garbage collector from collecting the events
- use a long for loop instead of a
while(1){}; - add to watch the variables I am following so that I can catch the event being busy after it gets to scheduler.
- using smaller data
expectation: What I am expecting is to see a task being busy (because of
while(1){};
on the GPU, the scheduler sees it is busy, then it assigns the next task to other device (CPU or iGPU)