Let A be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
I got following output:
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different
N(number of elements in array),NBANK(number of banks in shared memory),WARPSIZE(warp size of machine), andNSAMPLE(number of random shared memory accesses generated to evaluate the model).