CUDA Local memory register spilling overhead

5.6k Views Asked by At

I have a kernel which uses a lot of registers and spills them into local memory heavily.

    4688 bytes stack frame, 4688 bytes spill stores, 11068 bytes spill loads
ptxas info    : Used 255 registers, 348 bytes cmem[0], 56 bytes cmem[2]

Since the spillage seems quite high I believe it gets past L1 or even L2 cache. Since the local memory is private to each thread, how are accesses to local memory coalesced by the compiler? Is this memory read in 128byte transactions like global memory? With this amount of spillage I am getting low memory bandwidth utilisation (50%). I have similar kernels without the spillage that obtain up to 80% of the peak memory bandwidth.

EDIT I've extracted some more metrics from with the nvprof tool. If I understand well the technique mentioned here, then I have a significant amount of memory traffic due to register spilling (4 * l1 hits and misses / sum of all writes across 4 sectors of L2 = (4 * (45936 + 4278911)) / (5425005 + 5430832 + 5442361 + 5429185) = 79.6%). Could somebody verify whether I am right here?

Invocations                                Event Name         Min         Max         Avg
Device "Tesla K40c (0)"
Kernel: mulgg(double const *, double*, int, int, int)
     30        l2_subp0_total_read_sector_queries     5419871     5429821     5425005
     30        l2_subp1_total_read_sector_queries     5426715     5435344     5430832
     30        l2_subp2_total_read_sector_queries     5438339     5446012     5442361
     30        l2_subp3_total_read_sector_queries     5425556     5434009     5429185
     30       l2_subp0_total_write_sector_queries     2748989     2749159     2749093
     30       l2_subp1_total_write_sector_queries     2748424     2748562     2748487
     30       l2_subp2_total_write_sector_queries     2750131     2750287     2750205
     30       l2_subp3_total_write_sector_queries     2749187     2749389     2749278
     30                         l1_local_load_hit       45718       46097       45936
     30                        l1_local_load_miss     4278748     4279071     4278911
     30                        l1_local_store_hit           0           1           0
     30                       l1_local_store_miss     1830664     1830664     1830664

EDIT

I've realised that it is 128-byte and not bit transactions I was thinking of.

1

There are 1 best solutions below

3
On

According to Local Memory and Register Spilling the impact of register spills on performance entails more than just coalescing decided at compile time; more important: read/write from/to L2 cache is already quite expensive and you want to avoid it.

The presentation suggests that using a profiler you can count at run time the number of L2 queries due to local memory (LMEM) access, see whether they have a major impact on the total number of all L2 queries, then optimize the shared to L1 ratio in favour of the latter, through a single host call for example cudaDeviceSetCacheConfig( cudaFuncCachePreferL1 );

Hope this helps.