Why my inter-socket MPI send bandwidth is larger than memory bandwidth?

290 Views Asked by At

I did an osu_mbw_mr test (from the OSU microbenchmarks) on a Haswell node of Cori at NERSC and got some strange results I could not explain.

The node has two sockets, with each having a 16-core Intel Xeon Processor E5-2698 v3. The two processors are connected by QPI. Details of the node and the CPU can be found here and here.

If I am correct, maximal memory bandwidth of the node is 68GB/s x 2 CPUs = 136GB/s, and maximal bandwidth of the QPI is 9.6 GT/s x 2 Links x 2 Bytes/link = 38.4 GB/s, bidirectionally. I also measured the memory bandwidth with STREAM. The copy bandwidth is about 110GB/s, which is close to the theoretical one. That is great.

I ran osu_mbw_mr with 32 MPI ranks on one node and placed the first 16 ranks on socket 0 and the next 16 ranks on socket 1.

In osu_mbw_mr, each rank allocates a send buffer (s_buf) and a receive buffer(r_buf), and then initializes them (therefore I assume the buffers have affinity to their NUMA domain through first touch). With 32 ranks, ranks 0~15 send a fixed number of messages (the window size) back-to-back to the paired receiving ranks, i.e., 16~31. I used Cray MPICH. I think no matter how MPI is implemented, the net effect is "copying data from s_buf (across QPI links) to r_buf".

The following is my test result. I don't understand why bandwidths for message sizes of 8K, 16K, etc are so large, and suddenly drop at 2MB messages. The bandwidths are larger than the QPI bandwidth, even than the DRAM bandwidth. In my theory, the bandwidth should be bound by half of the QPI bandwidth (19.2GB/s), since we sent data uni-directionally from socket 0 to 1.

What is wrong? Thanks.

# OSU MPI Multiple Bandwidth / Message Rate Test v5.4.0
# [ pairs: 16 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      47.55       47550478.99
2                      94.74       47371180.52
4                     192.20       48048858.02
8                     389.46       48683010.22
16                    767.81       47988126.30
32                   1527.34       47729482.30
64                   2139.12       33423707.44
128                  4010.11       31328973.47
256                  7749.86       30272897.24
512                 13507.57       26381964.28
1024                15918.48       15545388.20
2048                19846.84        9690838.02
4096                21718.65        5302404.21
8192               146607.66       17896442.75
16384              183905.06       11224674.34
32768              240191.47        7330061.88
65536              280938.91        4286787.57
131072             238150.74        1816945.97
262144             156911.43         598569.60
524288             156919.72         299300.61
1048576            143541.91         136892.24
2097152             28835.20          13749.69
4194304             26170.38           6239.50

As one comment reminded me, OSU microbenchmarks used the same send buffer repeatedly in sends. So the data was basically in cache. This time I used Intel MPI Benchmarks, which has an option to send off-cache data. I ran it on the same machine with

srun -n 32 -c 2 -m block:block --cpu_bind=cores,verbose ./IMB-MPI1 Uniband -off_cache 40,64

and got these numbers, which as expected fell below memory bandwidth.

#---------------------------------------------------
# Benchmarking Uniband
# #processes = 32
#---------------------------------------------------
       #bytes #repetitions   Mbytes/sec      Msg/sec
            0         1000         0.00     56794458
            1         1000        49.89     49892748
            2         1000        99.96     49980418
            4         1000       199.34     49834857
            8         1000       399.30     49912461
           16         1000       803.53     50220613
           32         1000      1598.35     49948450
           64         1000      2212.19     34565472
          128         1000      4135.43     32308048
          256         1000      7715.76     30139698
          512         1000     12773.43     24948113
         1024         1000     16440.25     16054932
         2048         1000     19674.01      9606451
         4096         1000     21574.97      5267326
         8192         1000     92699.99     11315916
        16384         1000     90449.54      5520602
        32768         1000     27340.68       834371
        65536          640     25626.04       391022
       131072          320     25848.76       197210
       262144          160     25939.50        98951
       524288           80     25939.48        49476
      1048576           40     25909.70        24709
      2097152           20     25915.54        12357
      4194304           10     25949.97         6187
0

There are 0 best solutions below