My cuda kernel generates something that is fed to host in the end of block execution.
The skeleton is as follows.
host_data where data is written to is allocated as host mapped memory.
host_data_count is also mapped memory which indicates the number of data produced.
The GPU I'm using is GTX 580 with Fermi architecture and CC 2.0.
__global__ void kernel(host_data, host_data_count)
{
__shared__ int shd_data[1024];
__shared__ int shd_cnt;
int i;
if (threadIdx.x == 0)
shd_cnt = 0;
__syncthreads();
while ( ... )
{
if (something happens)
{
i = atomicAdd(&shd_cnt, 1);
shd_data[i] = d;
}
}
__syncthreads();
if (threadIdx.x == 0)
{
i = atomicAdd(host_data_count, shd_cnt);
memcpy(&host_data[i], shd_data, shd_cnt * 4);
}
}
What am I missing in this kernel code?
Can anybody help?
It's hard to tell what you are missing, because you didn't actually state what is the problem you are facing. I see few possibilities, but this depends on actual implementation of some of your higher concepts.
host_data_count
is of typeint*
(or similar?). It either points to global memory, or to host memory through mapped-pinned memory. I would strongly suggest to actually use global memory for the sake of speed.host_data_count
is a pinned memory, keep in mind, the atomic operations are atomic only within GPU. If, in the meantime, CPU does something with it, it may break the atomicy. You will, most likely, need to synchronize the host thread after the kernel call and before reading/using the value. Kernel calls are always asynchronous.memcpy
in device code? I assume you implemented it yourself, right? Are you copying memory using a single thread, or a whole block? Using whole block will be faster, but then you need to use this function outside theif (threadIdx.x==0)
, and the variablei
must be shared.