Expected number of bank conflicts in shared memory at random access

Question

Expected number of bank conflicts in shared memory at random access

738 Views Asked by CygnusX1 At 18 August 2025 at 07:47

Let A be a properly aligned array of 32-bit integers in shared memory.

If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?

In other words:

__shared__ int A[N];          //N is some big constant integer
...
int v = A[ random(0..N-1) ];  // <-- expected number of bank conflicts here?

Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).

My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.

I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...

Original Q&A

There are 3 best solutions below

harrism On 12 October 2012 at 00:28

I'll try a math answer, although I don't have it quite right yet.

You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"

If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).

For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000

int main(){  
    int i=0,j=0;

    int *bank=(int*)malloc(sizeof(int)*NBANK);
    int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
    memset(maxCount, 0, sizeof(int)*(NBANK+1));

    for (int i=0; i<NSAMPLE; ++i) {
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }

        // check the bank conflict
        memset(bank, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            bank[randomNumber[j]]++;       
        }

        for(j=0; j<WARPSIZE; j++) 
            max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);

        // store statistic
        maxCount[max_bank_conflict]++;
    }

    // report statistic
    printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
    float expected = 0;
    for(i=1; i<NBANK+1; i++) {
        float prob = maxCount[i]/(float)NSAMPLE;
        printf("%02d -> %6.4f\n", i, prob);
        expected += prob * i;
    }
    printf("Expected maximum bank conflict degree = %6.4f\n", expected);
    return 0;
}

Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.

Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.

This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.

A. Webb On 18 October 2012 at 14:56

In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.

It is possible to tackle though if you build up the solution recursively and use conditional probabilities.

Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.

In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.

Apologies for the VBA code, but VBA was all I had available when motivated to answer :).

Function nCr#(ByVal n#, ByVal r#)
    Static combin#()
    Static size#
    Dim i#, j#

    If n = r Then
        nCr = 1
        Exit Function
    End If

    If n > size Then
        ReDim combin(0 To n, 0 To n)
        combin(0, 0) = 1
        For i = 1 To n
            combin(i, 0) = 1
            For j = 1 To i
                combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
            Next
        Next
        size = n
    End If

    nCr = combin(n, r)
End Function

Function p_binom#(n#, r#, p#)
    p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function

Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
  bucket#, total_buckets#, bucket_capacity#)

    If balls > bucket_capacity Then
        p_next_bucket_balls = 0
    Else
        p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
    End If
End Function

Function p_capped_buckets#(n#, cap#)
    Dim p_prior, p_update
    Dim bucket#, balls#, prior_balls#

    ReDim p_prior(0 To n)
    ReDim p_update(0 To n)

    p_prior(0) = 1

    For bucket = 1 To n
        For balls = 0 To n
            p_update(balls) = 0
            For prior_balls = 0 To balls
                p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
                   p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
            Next
        Next
        p_prior = p_update
    Next

    p_capped_buckets = p_update(n)
End Function

Function expected_max_buckets#(n#)
    Dim cap#

    For cap = 0 To n
        expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
    Next
End Function

Sub test32()
    Dim p_cumm#(0 To 32)
    Dim cap#

    For cap# = 0 To 32
        p_cumm(cap) = p_capped_buckets(32, cap)
    Next

    For cap = 1 To 32
        Debug.Print "    ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
    Next
End Sub

For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.

Output to compare to ahmad's:

           1            0.000000
           2            0.029273
           3            0.516311
           4            0.361736
           5            0.079307
           6            0.011800
           7            0.001417
           8            0.000143
           9            0.000012
           10           0.000001
           11           0.000000
           12           0.000000
           13           0.000000
           14           0.000000
           15           0.000000
           16           0.000000
           17           0.000000
           18           0.000000
           19           0.000000
           20           0.000000
           21           0.000000
           22           0.000000
           23           0.000000
           24           0.000000
           25           0.000000
           26           0.000000
           27           0.000000
           28           0.000000
           29           0.000000
           30           0.000000
           31           0.000000
           32           0.000000

**lashgar** · Accepted Answer

I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32


#define NSAMPLE 10000

int main(){
    srand ( time(NULL) );

    int i=0,j=0;
    int *conflictCheck=NULL;
    int *randomNumber=NULL;
    int *statisticCheck=NULL;

    conflictCheck=(int*)malloc(sizeof(int)*NBANK);
    randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
    while(i<NSAMPLE){
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }
        // check the bank conflict
        memset(conflictCheck, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            conflictCheck[randomNumber[j]]++;
            max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
        }
        // store statistic
        statisticCheck[max_bank_conflict]++;

        // next iter
        i++;
    }
    // report statistic
    printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
    for(i=0; i<NBANK+1; i++){
        //
        printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
    }
    return 0;
}

I got following output:

Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000

We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).

Expected number of bank conflicts in shared memory at random access

There are 3 best solutions below

Related Questions in CUDA

Related Questions in GPU-SHARED-MEMORY

Related Questions in BANK-CONFLICT

Trending Questions

Popular # Hahtags

Popular Questions