In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a loop:
inline __device__ void pack(const float4 *g_src, float4 *s_dst, const unsigned int w, const unsigned int d) {
uint2 indx = { blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y };
uint2 sindx = { threadIdx.x, threadIdx.y };
int i;
for (i = 0; i < d; ++i) s_dst[(sindx.y * blockDim.x + sindx.x) * d + i] = g_src[(w * indx.y + indx.x) * d + i];
}
where w is set to width of the global memory buffer (in number of float4s) and d is set to 8 (number of float4s copied).
Can such configuration and further usage of the memory, lead to bank conflicts, or will broadcasting be applied? Will this be a case also when threads copy only, say 5 float4s, not 8?
MK
P.S. Same topic in Nvidia forum
During prefetching phase bank conflicts will occur. E.g. threads within first warp with IDs (computed as
threadIdx.x + threadIdx.y * blockDim.x) 0, 4, 8, ... 28 access same bank. You can see it as thread (0,0) and thread (4,0) foriequals 0 accesss_dst[0]ands_dst[32]belonging to the same bank.If bank conflicts occur during further usage depends on the scheme you will access
s_dst.Broadcast mechanism is applied only when threads simultaneously read the same address.
How many bank conflicts occur depends on the value of
d. Ifd mod 32 == 1there won't be any conflicts.EDIT: IMHO the best way to avoid bank conflicts in prefetching phase, specially if
dis changing, is to equaly split the work among the warps. Lets say you need to prefetchnvalues to shared memory,w_idis ID of warp andl_idis ID of thread within warp (from 0 to 31). Than prefetching should look like this:But this helps only to avoid bank conflicts during prefetching. As I have already said to avoid conflicts during further usage depends on the scheme you will access
s_dst.