I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.
__shared__ int sData[256];
When is coalesced access performed?
1.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];
Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?
2.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];
If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?
Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.
Proof:
Example:
So its a "luck" game if a block is coalesced, so in general No
But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.
Here is some nice source:
http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf