Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

1k Views Asked by At

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:

for(int i = 0; i < big_number; i++) {
    auto thread_idx = determine_thread_index_into_its_own_array();
    T value = calculate_value();
    write_to_own_shmem(thread_idx, value);
}

Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.

If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.

But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?

Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.

2

There are 2 best solutions below

0
On BEST ANSWER

tl;dr: Use the same kind of interleaving as for 32-bit values.

On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).

This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:

  1. The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
  2. The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)

So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.

(This answer is based on @RobertCrovella's comments.)

10
On

On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).

Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.

  • Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
  • To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `

    uint64_t joined =
        reinterpret_cast<uint32_t&>(&upper_half) << 32  +
        reinterpret_cast<uint32_t&>(&lower_half);
    auto& my_t_value = reinterpret_cast<T&>(&joined);
    

    and the same in reverse when writing.

As comments suggest, it is better to make 64-bit access, as described in this answer.