I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:
- When there is no bank conflict (
offset=1
) - When there is a bank conflict (
offset=32
, all threads are accessingbank 0
)
Here is a sample of the code (only the kernel):
__global__ void kernel(int offset) {
__shared__ uint32_t shared_memory[MEMORY_SIZE];
// init shared memory
if (threadIdx.x == 0) {
for (int i = 0; i < MEMORY_SIZE; i++)
shared_memory[i] = i;
}
__syncthreads();
uint32_t index = threadIdx.x * offset;
// 2048 / 32 = 64
for (int i = 0; i < 64; i++)
{
shared_memory[index] += index * 10;
index += 32;
index %= MEMORY_SIZE;
__syncthreads();
}
}
I expected the version with offset=32
to run slower than the one with offset=1
as access should be serialized but found out that they have similar output time. How is that possible ?
You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.