I read the post In CUDA, what is memory coalescing, and how is it achieved? that memory coalescing means consecutive threads access consecutive memory address at the same time, so I performed a simple experiment (see the following code). Basically:
- I have an array in the global memory (1024 floats), and 256 threads, therefore, each thread will be in charge of 4 operations for the entire array.
coalesced_access: after processing the current address, the next address will be offset by 256. This way, consecutive threads will access consecutive gloval memory address.non_coalesced_access: each thread will access 4 consecutive addresses (e.g., thread 0 will access 0, 1, 2, 3). So for consecutive threads, memory access is not consecutive.
The code is also straight forward (at the end of the post). I figured: coalesced_access might be faster since the memory access is coalesced, but there might be more cache miss? Since the four addresses one thread would access are not close enough.
I am not familiar with the caching mechanism in CUDA, but I do get this profiling result (from ncu):
coalesced_access:
DRAM Frequency cycle/nsecond 4.76
SM Frequency cycle/usecond 973.50
Elapsed Cycles cycle 3,219
Memory [%] % 0.78
DRAM Throughput % 0.49
Duration usecond 3.26
L1/TEX Cache Throughput % 47.67
L2 Cache Throughput % 0.78
SM Active Cycles cycle 23.25
Compute (SM) [%] % 0.25
non_coalesced_access:
DRAM Frequency cycle/nsecond 5.21
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 5,274
Memory [%] % 1.14
DRAM Throughput % 0.68
Duration usecond 4.86
L1/TEX Cache Throughput % 69.64
L2 Cache Throughput % 1.14
SM Active Cycles cycle 52.39
Compute (SM) [%] % 0.1
It seems that coalesced_access spends fewer cycles but is indeed worse in terms of cache performance, since the L1/L2 cache throughput is lower. So the key question is:
- Is there a "cache-performance" & "coalescing" trade-off we need to carefully consider? Can memory coalescing severely impact cache coherence and actually slow down my program?
- How do these L1/L2 caches work in GPU? Do they work in the similar way as those in the CPU?
Here is the code:
#include <iostream>
__host__ static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
__host__ static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
printf("%s returned %s(%d) at %s:%u\n", statement, cudaGetErrorString(err), err, file, line);
exit (1);
}
__global__ void coalesced_access(float* data) {
int index = threadIdx.x;
#pragma unroll
for (int i = 0; i < 4; i++, index += 256)
++ data[index];
}
__global__ void non_coalesced_access(float* data) {
int index = 4 * threadIdx.x;
#pragma unroll
for (int i = 0; i < 4; i++)
++ data[index + i];
}
int main() {
float *g_data1, *g_data2;
CUDA_CHECK_RETURN(cudaMalloc(&g_data1, sizeof(float) * 1024));
CUDA_CHECK_RETURN(cudaMalloc(&g_data2, sizeof(float) * 1024));
coalesced_access<<<1, 256>>>(g_data1);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
non_coalesced_access<<<1, 256>>>(g_data2);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
CUDA_CHECK_RETURN(cudaFree(g_data1));
CUDA_CHECK_RETURN(cudaFree(g_data2));
return 0;
}