CUDA: atomic operation on shared memory

Question

CUDA: atomic operation on shared memory

1.8k Views Asked by superscalar At 20 October 2024 at 19:53

My cuda kernel generates something that is fed to host in the end of block execution.
The skeleton is as follows.
host_data where data is written to is allocated as host mapped memory.
host_data_count is also mapped memory which indicates the number of data produced.
The GPU I'm using is GTX 580 with Fermi architecture and CC 2.0.

__global__ void kernel(host_data, host_data_count)
{
    __shared__ int  shd_data[1024];
    __shared__ int  shd_cnt;
    int i;

    if (threadIdx.x == 0)
        shd_cnt = 0;
    __syncthreads();

    while ( ... )
    {
        if (something happens)
        {
            i = atomicAdd(&shd_cnt, 1);
            shd_data[i] = d;
        }
    }

    __syncthreads();
    if (threadIdx.x == 0)
    {
        i = atomicAdd(host_data_count, shd_cnt);
        memcpy(&host_data[i], shd_data, shd_cnt * 4);
    }
}

What am I missing in this kernel code?
Can anybody help?

Original Q&A

There are 1 best solutions below

**CygnusX1** · Answer 1

It's hard to tell what you are missing, because you didn't actually state what is the problem you are facing. I see few possibilities, but this depends on actual implementation of some of your higher concepts.

I assume a host_data_count is of type int* (or similar?). It either points to global memory, or to host memory through mapped-pinned memory. I would strongly suggest to actually use global memory for the sake of speed.
If host_data_count is a pinned memory, keep in mind, the atomic operations are atomic only within GPU. If, in the meantime, CPU does something with it, it may break the atomicy. You will, most likely, need to synchronize the host thread after the kernel call and before reading/using the value. Kernel calls are always asynchronous.
What is memcpy in device code? I assume you implemented it yourself, right? Are you copying memory using a single thread, or a whole block? Using whole block will be faster, but then you need to use this function outside the if (threadIdx.x==0), and the variable i must be shared.

CUDA: atomic operation on shared memory

There are 1 best solutions below

Related Questions in CUDA

Related Questions in GPU-ATOMICS

Related Questions in GPU-SHARED-MEMORY

Trending Questions

Popular # Hahtags

Popular Questions