I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.
__shared__ int sData[256];
When is coalesced access performed?
1.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];
Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?
2.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];
If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?
The rules for which accesses can be coalesced are somewhat complicated and they have changed over time. Each new CUDA architecture is more flexible in what it can coalesce. I would say not to worry about it at first. Instead, do the memory accesses in whatever way is the most convenient and then see what the CUDA profiler says.