I am wondering why the R3 protocol shows great performance, when using multiple different buffers which exhaust the registration cache. Does it not need to pin and unpin the buffers provided for sending or how does it hide this overhead? Is it always a good choice to stick to the R3 protocol?
On the bottom you see a diagram showing my observation. I used 2 nodes sending and receiving in parallel. The x-axis denotes the numbers of buffers n (each 1MB) used for sending. The main loop looks like this:
\\ Take time
for(i to 20){
for(a to n) IRecv(rec_buffer[a])
for(a to n) ISend(send_buffer[a])
waitForAllRecv()
waitForAllSend()
}
\\ Plot time
See plot at: http://i47.tinypic.com/2vkn6ty.jpg