The description of the RESOURCE_STALLS.RS
hardware performance event for Intel Broadwell is the following:
This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or from RS deallocation because of the RS array Write Port allocation scheme (each RS entry has two write ports instead of four. As a result, empty entries could not be used, although RS is not really full). This counts cycles that the pipeline backend blocked uop delivery from the front end.
This basically says that there are two situations where the RS stall event occurs:
- When all of the eligible entries of the RS are occupied and the allocator is not stalled.
- When "RS deallocation" occurs because there are only two write ports, and the allocator is not stalled.
What does "eligible" mean in the first situation? Does this mean that not all entries can be occupied by all kinds of uops? Because my understanding is that in modern microarchitectures any entry can be used by any kind of uop. Also what is RS array Write Port allocation scheme and how does it cause RS stalls even when not all entries are occupied? Does this mean that there were four write ports in Haswell but now there are only two in Broadwell? Do either of these two situations apply to Skylake or Haswell even though the manual does not explicitly say so?
Yes, it is possible for
RESOURCE_STALLS
to indicate a full RS before the RS is completely full.As the RS becomes full, allocation of new uops into the RS becomes less ideal until at some point it may stall out entirely, even though some entries remain.
Furthermore, not all RS entries are available to all instructions. For example, on Haswell, I observe that only 30-32 of the 60 RS entries are available to loads: these entries may be special in they support uop replay, for example. On Skylake, the situation is different: the entire RS is not available to any type of instruction: rather, the "97 entry" RS is actually made up of a 64-entry RS for ALU ops, and a 33 entry RS for load ops. So the entire 97 entries of RS(es) will rarely be full, unless by some coincidence both fill up at exactly the same moment.
The
RESOURCE_STALLS.RS
event (umask0x4
) only triggers when a the "ALU" part of the RS is full (or full enough that an op can't allocate). For the load RS (which overlaps with the ALU RS in Haswell but not Skylake), the corresponding event has umask0x40
. You can use it withperf
as'cpu/event=0xa2,umask=0x40,name=resource_stalls_memrs_full/
. Although the events are not documented for Skylake, they seem to work fine (although events with umasks0x10
through0x80
are very different than documented on Sandy Bridge.Future Intel chips are likely to have even finer-grained reservation stations.