Shared memory configuration for prefetching

Question

Shared memory configuration for prefetching

607 Views Asked by Dorota Kadłubowska At 08 November 2025 at 14:53

In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a loop:

inline __device__ void pack(const float4 *g_src, float4 *s_dst, const unsigned int w, const unsigned int d) {
    uint2 indx = { blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y };
    uint2 sindx = { threadIdx.x, threadIdx.y };
    int i;

    for (i = 0; i < d; ++i) s_dst[(sindx.y * blockDim.x + sindx.x) * d + i] = g_src[(w * indx.y + indx.x) * d + i];
}

where w is set to width of the global memory buffer (in number of float4s) and d is set to 8 (number of float4s copied).

Can such configuration and further usage of the memory, lead to bank conflicts, or will broadcasting be applied? Will this be a case also when threads copy only, say 5 float4s, not 8?

MK

P.S. Same topic in Nvidia forum

Original Q&A

There are 1 best solutions below

**stuhlo** · Accepted Answer

During prefetching phase bank conflicts will occur. E.g. threads within first warp with IDs (computed as threadIdx.x + threadIdx.y * blockDim.x) 0, 4, 8, ... 28 access same bank. You can see it as thread (0,0) and thread (4,0) for i equals 0 access s_dst[0] and s_dst[32] belonging to the same bank.

If bank conflicts occur during further usage depends on the scheme you will access s_dst.

Broadcast mechanism is applied only when threads simultaneously read the same address.

How many bank conflicts occur depends on the value of d. If d mod 32 == 1 there won't be any conflicts.

EDIT: IMHO the best way to avoid bank conflicts in prefetching phase, specially if d is changing, is to equaly split the work among the warps. Lets say you need to prefetch n values to shared memory, w_id is ID of warp and l_id is ID of thread within warp (from 0 to 31). Than prefetching should look like this:

for(int i = l_id + w_id*WARP_SIZE; i < n; i += WARP_SIZE*COUNT_OF_WARPS_IN_BLOCK)
{
    s_dst[i] = ...;
}

But this helps only to avoid bank conflicts during prefetching. As I have already said to avoid conflicts during further usage depends on the scheme you will access s_dst.

Shared memory configuration for prefetching

There are 1 best solutions below

Related Questions in CUDA

Related Questions in NVIDIA

Related Questions in GPU-SHARED-MEMORY

Related Questions in BANK-CONFLICT

Trending Questions

Popular # Hahtags

Popular Questions