I sometimes see the following shared memory declaration in CUDA kernels, and I am not sure what it means:
extern __shared__ T shmem[][SZ]
with SZ being a compile-time constant. The kernel is launched as:
kernel<<<grid, block, shared_memory_size>>>()
My questions are:
- Why could it be useful to have a mixture of dynamic and static shared memory (apart from simplifying the programmer's address computations)?
- What is the total size of the shared memory buffer shmem? Is it
shared_memory_size * SZ? - Related to Q2: Suppose I can compute shared_memory_size at compile time how would I have to rewrite the shared memory declaration for it to be static?
shared T shmem[SZ*shared_memory_size]?
The proper syntax is
I believe address computation is the only reason. Of course this also factors into static analysis etc. but that's really all that it is; making your work easier.
Of course if you make
SZa runtime variable even though it could be compile time, then the computational cost also goes up because then the GPU has to do proper integer multiplication for its address generation instead of optimizing it into cheaper operations such as bit shifts.No, the kernel launch parameter is in bytes. So the size along the outer dimension would be
shared_memory_size / (SZ * sizeof(T)). To put it the other way around, if you wantshmem[N][SZ]at runtime, thenshared_memory_size = N * SZ * sizeof(T)You mean like this?
Note how it is no longer declared
extern.Footnote: In terms of nomenclature I would not call this mixing static and dynamic shared memory. I believe most people would interpret mixing as using
__shared__ Tandextern __shared__ Tin the same kernel.