CUDA bandwidthTest to get attainable peak

368 Views Asked by At

I want to know how good my CUDA kernels are in terms of memory bandwidth utilisation. I run them on a Tesla K40c with ECC on. Is the result given by the bandwidthTest utility a good approximation to the attainable peak? Else, how would one go about writing a similar test to find the peak bandwidth?

I mean device memory bandwidth.

1

There are 1 best solutions below

0
On

The source code for bandwidth test is included with the CUDA SDK so you can review it directly. The bandwidthTest example performs a test of the transfer time between the device and the host, the host and the device, and the device and the device (transferring memory on the card).

This is a real execution of a memory transfer but it takes advantage of several things:

  1. Medium to large memory transfers. If you are doing tons of tiny transfers you will pay a high penalty in overhead and this will reduce your transfer rates.
  2. Pinned memory. The bandwidthTest uses pinned memory so that the transfers can be as fast as possible. You may or may not have this option.
  3. Sustained read/write of memory. As I recall, the bandwidthTest does a number of transfers that can be queued up. Any startup delays or anomalies will be smoothed out and it has the advantage of stringing together lots of transfers together in the queue. You may have to do transfer-work-work-transfer so you may end up with additional delays. Improvements in memory transfers from CUDA 5 may assist in mitigating this.

Doing real work with a kernel while performing memory transfers will likely result in a reduction of performance. However, you can reference the bandwidth test code and use it as a guide for improving your transfers. Consider pinned memory, asynchronous transfers, or the newer shared memory methods that do not require explicit transfer of data. Also keep in mind that bandwidthTest is only counting bulk transfers around memory and is not really taking a measure of things like shared memory.

The final performance will depend greatly on the kernel and the count and size of the memory transfers you are performing.