In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:
Host side Code:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
Device side code
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
What is the equivalent in OpenCL for acheiving the above case ?
On the host, when you enqueue your kernel using
clEnqueueNDRangeKernel
, you have to specify the global and local work size. For instance:In your kernel, use:
to retrieve the global and local work sizes and indices respectively, where
dim
is0
forx
,1
fory
and2
forz
.The equivalent of your
idx
will thus be simplysize_t idx = get_global_id(0);
See the OpenCL Reference Pages.