I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?
Shouldn't it be something like:
__global__ my_kernel(int a)
{
__local__ unsigned char p[50];
}
You are mixing up local and register memory space.
Single variables and constant sized arrays are automatically saved in register space on the chip with almost no costs for read and write.
If you exceed your amount of registers per multiprocessor they will get stored in local memory.
Local memory resides in global memory space and has the same slow bandwidth for read and write operations.