A full/general memory barrier is one where all the LOAD and STORE operations specified before the barrier will appear to happen before all the LOAD and STORE operations specified after the barrier with respect to the other components of the system.
According to cppreference, memory_order_seq_cst is equal to memory_order_acq_rel plus a single total modification order on all operations so tagged. But as far as I know, neither acquire nor release fence in C++11 enforces a #StoreLoad (load after store) ordering. A release fence requires that no previous read/write can be reordered with any following write; An acquire fence requires that no following read/write can be reordered with any previous read. Please correct me if I am wrong;)
Giving an example,
atomic<int> x;
atomic<int> y;
y.store(1, memory_order_relaxed); //(1)
atomic_thread_fence(memory_order_seq_cst); //(2)
x.load(memory_order_relaxed); //(3)
Is it allowed by a optimizing compiler to reorder instruction (3) to before (1) so that it effective looks like:
x.load(memory_order_relaxed); //(3)
y.store(1, memory_order_relaxed); //(1)
atomic_thread_fence(memory_order_seq_cst); //(2)
If this is a valid tranformation, then it proves that atomic_thread_fence(memory_order_seq_cst) doesn't not necessarily encompass the semantics of what a full barrier has.
C++ fences are not direct equivalents of CPU fence instructions, though they may well be implemented as such. C++ fences are part of the C++ memory model, which is all about visibility and ordering constraints.
Given that processors typically reorder reads and writes, and cache values locally before they are made available to other cores or processors, the order in which effects become visible to other processors is not usually predictable.
When thinking about these semantics, it is important therefore to think about what it is that you are trying to prevent.
Let's assume that the code is mapped to machine instructions as written, (1) then (2) then (3), and these instructions guarantee that (1) is globally visible before (3) is executed.
The whole purpose of the snippet is to communicate with another thread. You cannot guarantee that the other thread is running on any processor at the time that this snippet executes on our processor. Therefore the whole snippet may run uninterrupted, and (3) will still read whatever value was in
xwhen (1) was executed. In this case, it is indistinguishable from an execution order of (3) (1) (2).So: yes, this is an allowed optimization, because you cannot tell the difference.