In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load?
For example, consider the following sequence:
mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]
The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?
Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case.
No.
At least, not on Haswell, Broadwell or Skylake processors. On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). On AMD, similar limitations apply.
From Agner Fog's excellent optimization manuals:
Haswell/Broadwell
Emphasis added
Skylake
Emphasis added
In General:
A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value.
A Test
A test with the following tight loop:
Shows that the
ld_blocks.store_forwardPMU counter does indeed increment. This event is documented as follows:This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered).