CUDA coalesced access acceleration and cache throughput

34 Views Asked by At

I read the post In CUDA, what is memory coalescing, and how is it achieved? that memory coalescing means consecutive threads access consecutive memory address at the same time, so I performed a simple experiment (see the following code). Basically:

  • I have an array in the global memory (1024 floats), and 256 threads, therefore, each thread will be in charge of 4 operations for the entire array.
  • coalesced_access: after processing the current address, the next address will be offset by 256. This way, consecutive threads will access consecutive gloval memory address.
  • non_coalesced_access: each thread will access 4 consecutive addresses (e.g., thread 0 will access 0, 1, 2, 3). So for consecutive threads, memory access is not consecutive.

The code is also straight forward (at the end of the post). I figured: coalesced_access might be faster since the memory access is coalesced, but there might be more cache miss? Since the four addresses one thread would access are not close enough.

I am not familiar with the caching mechanism in CUDA, but I do get this profiling result (from ncu):

  • coalesced_access:
DRAM Frequency                           cycle/nsecond                           4.76
SM Frequency                             cycle/usecond                         973.50
Elapsed Cycles                                   cycle                          3,219
Memory [%]                                           %                           0.78
DRAM Throughput                                      %                           0.49
Duration                                       usecond                           3.26
L1/TEX Cache Throughput                              %                          47.67
L2 Cache Throughput                                  %                           0.78
SM Active Cycles                                 cycle                          23.25
Compute (SM) [%]                                     %                           0.25
  • non_coalesced_access:
DRAM Frequency                           cycle/nsecond                           5.21
SM Frequency                             cycle/nsecond                           1.07
Elapsed Cycles                                   cycle                          5,274
Memory [%]                                           %                           1.14
DRAM Throughput                                      %                           0.68
Duration                                       usecond                           4.86
L1/TEX Cache Throughput                              %                          69.64
L2 Cache Throughput                                  %                           1.14
SM Active Cycles                                 cycle                          52.39
Compute (SM) [%]                                     %                           0.1

It seems that coalesced_access spends fewer cycles but is indeed worse in terms of cache performance, since the L1/L2 cache throughput is lower. So the key question is:

  • Is there a "cache-performance" & "coalescing" trade-off we need to carefully consider? Can memory coalescing severely impact cache coherence and actually slow down my program?
  • How do these L1/L2 caches work in GPU? Do they work in the similar way as those in the CPU?

Here is the code:

#include <iostream>

__host__ static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

__host__ static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
    if (err == cudaSuccess)
        return;
    printf("%s returned %s(%d) at %s:%u\n", statement, cudaGetErrorString(err), err, file, line);
    exit (1);
}

__global__ void coalesced_access(float* data) {
    int index = threadIdx.x; 
    #pragma unroll
    for (int i = 0; i < 4; i++, index += 256)
        ++ data[index];
}

__global__ void non_coalesced_access(float* data) {
    int index = 4 * threadIdx.x; 
    #pragma unroll
    for (int i = 0; i < 4; i++)
        ++ data[index + i];
}

int main() {
    float *g_data1, *g_data2;
    CUDA_CHECK_RETURN(cudaMalloc(&g_data1, sizeof(float) * 1024));
    CUDA_CHECK_RETURN(cudaMalloc(&g_data2, sizeof(float) * 1024));

    coalesced_access<<<1, 256>>>(g_data1);
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());

    non_coalesced_access<<<1, 256>>>(g_data2);
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());

    CUDA_CHECK_RETURN(cudaFree(g_data1));
    CUDA_CHECK_RETURN(cudaFree(g_data2));
    return 0;
}
0

There are 0 best solutions below