What stride should I use for matrices in CUDA for the fastest possible speed?

152 Views Asked by meisel At 07 December 2023 at 17:01

I'm working with matrices that range in size from 2,000x2,000 up to 5,000x5,000, doing operations such as multiplication and QR decomposition. I'm curious if, for example, I should align the stride by 64 for all matrixes for best performance. Also, should I avoid strides that are a multiple of some page size due to cache associativity, or does that not apply to GPU memory?

Original Q&A

There are 1 best solutions below

Homer512 On 07 December 2023 at 20:08 BEST ANSWER

I imagine most people trust cudaMallocPitch or cudaMalloc3D to provide the proper alignment as this is their stated purpose. While not explicitly clarified in the runtime documentation, they align to cudaDeviceProp::textureAlignment (512 byte on current hardware). There are also NPP's allocator functions which seem to have different alignment strategies (or at least did so in the past). See How does CUDA's nppiMalloc... function guarantee alignment? for some discussion on that.

The lack of a pitched allocator function for the stream ordered memory allocator suggests that alignment may not be as relevant today. Or it might be an oversight in the API, who knows?

What we do know from different parts of the programming guide is that

access to global memory happens in naturally aligned 32-, 64-, or 128-byte memory transactions
all allocators align the start of an allocation to at least 256 byte
memcpy_async requires 16 byte alignment for best performance

The best practices guide simply recommends 32 byte aligned memory transactions.

I'm not aware of a list of cache parameters for each generation. Turing's L2 is 4 MiB 16-way set associative with 64 byte lines and the memory pages are 2 MiB. If I did the math right, this means an alignment of 256 kiB would be pathological. With these numbers I'd imagine you could start seeing effects with 16 kiB alignment or more but I'm not aware of any official guidance on the subject.

Personally I stick with the pitched allocators and when I don't use them, I use the texture alignment except for smaller line sizes where I just use the next power of 2 as to not waste so much memory unless I plan to use texture binding.

What stride should I use for matrices in CUDA for the fastest possible speed?

There are 1 best solutions below

Related Questions in CUDA

Related Questions in MEMORY-ALIGNMENT

Related Questions in STRIDE

Trending Questions

Popular # Hahtags

Popular Questions