Let A
be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A
at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
I got following output:
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different
N
(number of elements in array),NBANK
(number of banks in shared memory),WARPSIZE
(warp size of machine), andNSAMPLE
(number of random shared memory accesses generated to evaluate the model).