I did an osu_mbw_mr test (from the OSU microbenchmarks) on a Haswell node of Cori at NERSC and got some strange results I could not explain.
The node has two sockets, with each having a 16-core Intel Xeon Processor E5-2698 v3. The two processors are connected by QPI. Details of the node and the CPU can be found here and here.
If I am correct, maximal memory bandwidth of the node is 68GB/s x 2 CPUs = 136GB/s, and maximal bandwidth of the QPI is 9.6 GT/s x 2 Links x 2 Bytes/link = 38.4 GB/s, bidirectionally. I also measured the memory bandwidth with STREAM. The copy bandwidth is about 110GB/s, which is close to the theoretical one. That is great.
I ran osu_mbw_mr with 32 MPI ranks on one node and placed the first 16 ranks on socket 0 and the next 16 ranks on socket 1.
In osu_mbw_mr, each rank allocates a send buffer (s_buf) and a receive buffer(r_buf), and then initializes them (therefore I assume the buffers have affinity to their NUMA domain through first touch). With 32 ranks, ranks 0~15 send a fixed number of messages (the window size) back-to-back to the paired receiving ranks, i.e., 16~31. I used Cray MPICH. I think no matter how MPI is implemented, the net effect is "copying data from s_buf (across QPI links) to r_buf".
The following is my test result. I don't understand why bandwidths for message sizes of 8K, 16K, etc are so large, and suddenly drop at 2MB messages. The bandwidths are larger than the QPI bandwidth, even than the DRAM bandwidth. In my theory, the bandwidth should be bound by half of the QPI bandwidth (19.2GB/s), since we sent data uni-directionally from socket 0 to 1.
What is wrong? Thanks.
# OSU MPI Multiple Bandwidth / Message Rate Test v5.4.0
# [ pairs: 16 ] [ window size: 64 ]
# Size MB/s Messages/s
1 47.55 47550478.99
2 94.74 47371180.52
4 192.20 48048858.02
8 389.46 48683010.22
16 767.81 47988126.30
32 1527.34 47729482.30
64 2139.12 33423707.44
128 4010.11 31328973.47
256 7749.86 30272897.24
512 13507.57 26381964.28
1024 15918.48 15545388.20
2048 19846.84 9690838.02
4096 21718.65 5302404.21
8192 146607.66 17896442.75
16384 183905.06 11224674.34
32768 240191.47 7330061.88
65536 280938.91 4286787.57
131072 238150.74 1816945.97
262144 156911.43 598569.60
524288 156919.72 299300.61
1048576 143541.91 136892.24
2097152 28835.20 13749.69
4194304 26170.38 6239.50
As one comment reminded me, OSU microbenchmarks used the same send buffer repeatedly in sends. So the data was basically in cache. This time I used Intel MPI Benchmarks, which has an option to send off-cache data. I ran it on the same machine with
srun -n 32 -c 2 -m block:block --cpu_bind=cores,verbose ./IMB-MPI1 Uniband -off_cache 40,64
and got these numbers, which as expected fell below memory bandwidth.
#---------------------------------------------------
# Benchmarking Uniband
# #processes = 32
#---------------------------------------------------
#bytes #repetitions Mbytes/sec Msg/sec
0 1000 0.00 56794458
1 1000 49.89 49892748
2 1000 99.96 49980418
4 1000 199.34 49834857
8 1000 399.30 49912461
16 1000 803.53 50220613
32 1000 1598.35 49948450
64 1000 2212.19 34565472
128 1000 4135.43 32308048
256 1000 7715.76 30139698
512 1000 12773.43 24948113
1024 1000 16440.25 16054932
2048 1000 19674.01 9606451
4096 1000 21574.97 5267326
8192 1000 92699.99 11315916
16384 1000 90449.54 5520602
32768 1000 27340.68 834371
65536 640 25626.04 391022
131072 320 25848.76 197210
262144 160 25939.50 98951
524288 80 25939.48 49476
1048576 40 25909.70 24709
2097152 20 25915.54 12357
4194304 10 25949.97 6187