CUDA Kepler: not enough ALUs

125 Views Asked by At

According to the Kepler whitepage, the warp size for a Kepler based GPU is 32 and each multiprocessor contains 4 warp schedulars which select two independant instructions from a chosen warp. This means that each clock cycle, 32*4*2 = 256 calculations are to be performed, but a multiprocessor only contains 192 ALUs. How are these calculations performed then?

1

There are 1 best solutions below

0
On BEST ANSWER

The actual whitepaper wording is as follows:

The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle.

The interpretation is that in any given cycle, at most 4 warps can be scheduled. For each of those 4 warps, (up to) 2 independent instructions per warp can be dispatched. "can be dispatched" is not the same as "will be dispatched".

The 192 ALUs you are referring to are related to single precision floating point arithmetic operations (SP units for the purpose of this discussion). However there are other functional units in the SM(X) such as double precision floating point arithmetic units (DP units), load/store units (LD/ST units), and other units. Refer to the diagram on page 8 of the whitepaper linked above. If a given set of instructions were all using the SP units, then 8 instructions could not be scheduled, at most 6 (32x6=192) could be scheduled. However, if the instruction mix contains independent instructions of different types (e.g. loads, stores, SP ops, etc.) then the limitation of 192 SP units will not necessarily be the determining factor in how many instructions actually get scheduled in any given cycle.

The bottom line is that 8 instructions (2 inst/scheduler x 4 schedulers) per cycle is the maximum possible instruction issue rate per SM(X). Real world codes do not necessarily achieve this. It's entirely possible that in a given cycle no instructions could get issued, due to stall/starvation conditions.