I have a kernel which uses a lot of registers and spills them into local memory heavily.
4688 bytes stack frame, 4688 bytes spill stores, 11068 bytes spill loads
ptxas info : Used 255 registers, 348 bytes cmem[0], 56 bytes cmem[2]
Since the spillage seems quite high I believe it gets past L1 or even L2 cache. Since the local memory is private to each thread, how are accesses to local memory coalesced by the compiler? Is this memory read in 128byte transactions like global memory? With this amount of spillage I am getting low memory bandwidth utilisation (50%). I have similar kernels without the spillage that obtain up to 80% of the peak memory bandwidth.
EDIT
I've extracted some more metrics from with the nvprof
tool. If I understand well the technique mentioned here, then I have a significant amount of memory traffic due to register spilling (4 * l1 hits and misses / sum of all writes across 4 sectors of L2 = (4 * (45936 + 4278911)) / (5425005 + 5430832 + 5442361 + 5429185) = 79.6%
). Could somebody verify whether I am right here?
Invocations Event Name Min Max Avg
Device "Tesla K40c (0)"
Kernel: mulgg(double const *, double*, int, int, int)
30 l2_subp0_total_read_sector_queries 5419871 5429821 5425005
30 l2_subp1_total_read_sector_queries 5426715 5435344 5430832
30 l2_subp2_total_read_sector_queries 5438339 5446012 5442361
30 l2_subp3_total_read_sector_queries 5425556 5434009 5429185
30 l2_subp0_total_write_sector_queries 2748989 2749159 2749093
30 l2_subp1_total_write_sector_queries 2748424 2748562 2748487
30 l2_subp2_total_write_sector_queries 2750131 2750287 2750205
30 l2_subp3_total_write_sector_queries 2749187 2749389 2749278
30 l1_local_load_hit 45718 46097 45936
30 l1_local_load_miss 4278748 4279071 4278911
30 l1_local_store_hit 0 1 0
30 l1_local_store_miss 1830664 1830664 1830664
EDIT
I've realised that it is 128-byte and not bit transactions I was thinking of.
According to Local Memory and Register Spilling the impact of register spills on performance entails more than just coalescing decided at compile time; more important: read/write from/to L2 cache is already quite expensive and you want to avoid it.
The presentation suggests that using a profiler you can count at run time the number of L2 queries due to local memory (LMEM) access, see whether they have a major impact on the total number of all L2 queries, then optimize the shared to L1 ratio in favour of the latter, through a single host call for example cudaDeviceSetCacheConfig( cudaFuncCachePreferL1 );
Hope this helps.