How does a multiprocessor with write-buffers maintain the sequential consistency? To my knowledge, in a uniprocessor, If the buffer is FIFO and the reads to an element that is pending to be write on main memory is supplied by the buffer, it maintains the consistency. But how it works in a MP? I think that If a processor puts an store in his buffer, another processor can't read this, and I think that this break the sequencial consistency.
How does it work in a multithread environment with a write-buffer per thread? It also breaks the sequential consistency?
Sequential consistency with store buffers in a multiprocessor?
558 Views Asked by joanlopez AtThere are 2 best solutions below

I'm assuming X86 here.
The store in the store buffer in itself isn't the problem. If for example a CPU would only do stores and the stores in the store buffer all retire in order, it would be exactly the same behavior as a processor that doesn't have a store buffer. For SC the real time order doesn't need to be preserved.
And you already indicated that a processor will see its own stores in the store buffer in order. The part where SC gets violated is when a store is followed by a load to a different address.
So imagine
A=1
r1=B
Then without a store buffer, first the store of A would be written to cache/memory. And then the B would be read from cache/memory.
But with a store buffer, it can be that the load of B will overtake the store of A. So the load will read from cache/memory before the store of A is written to cache/memory.
The typical example of where SC breaks with store buffers is Dekkers algorithm.
lock_a=1
while(lock_b==1){
if(turn == b){
lock_a=0
while(lock_b==1);
lock_a=1
}
}
So at the top you can see a store of lock_a=1
followed by a load of lock_b
. Due to store buffer it can be that these 2 get reordered and as a consequence 2 threads could enter the critical section.
One way to solve it is to add a [StoreLoad] fence between the load and store, which prevents loads from being executed till the store buffer has been drained. This way SC is restored.
Note 1: store buffers are per CPU; not per thread.
Note 2: store (and load) buffers are before the cache.
You referred to:
Typically, a CPU only sees the random access; the fact that memory busses are sequentially accessed is hidden to the CPU itself, so from the point of view of the CPU, there's no FIFO involved here.
In SMP modern machines, there's so-called snoop control units that watch the memory transfers and invalidate the cache copy of the RAM if necessary. So there's dedicated hardware to make sure data is synchronous. This doesn't mean it's really synchronous -- there's always more than one way to get invalid data (for example, by already having loaded a memory value into a register before the other CPU core changed it), but that is what you were getting at.
Also, multiple threads are basically a software concept. So if you need to synchronize software FIFOs, you will need to use proper locking mechanisms.