I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?
Shouldn't it be something like:
__global__ my_kernel(int a)
{
__local__ unsigned char p[50];
}
All you need is this:
The compiler will automatically spill this to thread local memory if it needs to. But be aware that local memory is stored in SDRAM off the GPU, and it is as slow as global memory. So if you are hoping that this will yield a performance improvement, it might be that you are in for a disappointment.....