I just learned (from Why only one of the warps is executed by a SM in cuda?) that Kepler GPUs can actually execute instructions from several (apparently 4) warps at once.
Can a shared memory bank also serve four requests at once? If not, that would mean that bank conflicts can occur between threads of different warps that happen to be executed concurrently, even though there are no bank conflicts within any of the individual warps, right? Is there any information on this?
Compute capability 3.x devices (Kepler) have 4 warps schedulers per SM. On each cycle each warp scheduler selects a warp and issues 1-2 instructions from the warp. The SM only has one load store unit (LSU) unit that services L1 and shared memory requests so only 1 of the 8 potential instructions can be dispatched to the LSU so bank conflicts between warps will not occur.