When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
- Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work. Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10
10 (c=1, d=0) is easily explained:
barhappened to run first, and finished beforefooreadb.Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though
thread(foo)ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose forfoo, delaying it from actually getting into that code1.Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the
a.load2 in bar separately from thed.storeof the resulting int temporary.atomic_thread_fence(seq_cst)has no impact anywhere because all your operations are alreadyseq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1: Since all your atomic vars are probably in the same cache line, even if execution did reach the top of
fooandbarat the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get11if we getb=1thena=1before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.Footnote 2: The lvalue-to-rvalue evaluation of bare
auses the overloaded(int)conversion which is equivalent toa.load(seq_cst). The operations fromfoocould happen between that load and thed.storethat gets a temporary value from it.d.store(a)is not an atomic copy; it's equivalent toint tmp = a;d.store(tmp);. That isn't necessary to explain your observations.