I have read about std::memory_order in C++ and understood partially. But I still had some doubts around it.
- Explanation on
std::memory_order_acquiresays that, no reads or writes in the current thread can be reordered before this load. Does that mean compiler and cpu is not allowed to move any instruction present below theacquirestatement, above it?
auto y = x.load(std::memory_order_acquire);
z = a; // is it leagal to execute loading of shared `b` above acquire? (I feel no)
b = 2; // is it leagal to execute storing of shared `a` above acquire? (I feel yes)
I can reason out why it is illegal for executing loads before acquire. But why it is illegal for stores?
- Is it illegal to skip a useless load or store from
atomicobjects? Since they are notvolatile, and as I know only volatile has this requirement.
auto y = x.load(std::memory_order_acquire); // `y` is never used
return;
This optimization is not happening even with relaxed memory order.
- Is compiler allowed to move instructions present above
acquirestatement, below it?
z = a; // is it leagal to execute loading of shared `b` below acquire? (I feel yes)
b = 2; // is it leagal to execute storing of shared `a` below acquire? (I feel yes)
auto y = x.load(std::memory_order_acquire);
- Can two loads or stores be reordered without crossing
acquireboundary?
auto y = x.load(std::memory_order_acquire);
a = p; // can this move below the below line?
b = q; // shared `a` and `b`
Similar and corresponding 4 doubts with release semantics also.
Related to 2nd and 3rd question, why no compiler is optimizing f(), as aggressive as g() in below code?
#include <atomic>
int a, b;
void dummy(int*);
void f(std::atomic<int> &x) {
int z;
z = a; // loading shared `a` before acquire
b = 2; // storing shared `b` before acquire
auto y = x.load(std::memory_order_acquire);
z = a; // loading shared `a` after acquire
b = 2; // storing shared `b` after acquire
dummy(&z);
}
void g(int &x) {
int z;
z = a;
b = 2;
auto y = x;
z = a;
b = 2;
dummy(&z);
}
f(std::atomic<int>&):
sub rsp, 24
mov eax, DWORD PTR a[rip]
mov DWORD PTR b[rip], 2
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rdi]
lea rdi, [rsp+12]
mov DWORD PTR b[rip], 2
mov eax, DWORD PTR a[rip]
mov DWORD PTR [rsp+12], eax
call dummy(int*)
add rsp, 24
ret
g(int&):
sub rsp, 24
mov eax, DWORD PTR a[rip]
mov DWORD PTR b[rip], 2
lea rdi, [rsp+12]
mov DWORD PTR [rsp+12], eax
call dummy(int*)
add rsp, 24
ret
b:
.zero 4
a:
.zero 4
Q1
Generally, yes. Any load or store that follows (in program order) an acquire load, must not become visible before it.
Here is an example where it matters:
Try on godbolt
This program is free of UB and must print either
0ortoo soon. If thewriterstore of 42 toxcould be reordered before the load offinished, then it would be possible that thereaderload ofxreturns 42 and thewriterload offinishedreturnstrue, in which case the program would improperly print42.Q2
Yes, it would be okay for a compiler to delete the atomic load whose value is never used, since there is no way for a conforming program to detect whether the load was done or not. However, current compilers generally don't do such optimizations. Partly out of an abundance of caution, because optimizations on atomics are hard to get right and bugs can be very subtle. It may also be partly to support programmers writing implementation-dependent code, that is able to detect via non-standard features whether the load was done.
Q3
Yes, this reordering is perfectly legal, and real-world architectures will do it. An acquire barrier is only one way.
Q4
Yes, this is also legal. If
a,bare not atomic, and some other thread is reading them concurrently, then the code has a data race and is UB, so it is okay if the other thread observes the writes having happened in the wrong order (or summons nasal demons). (If they are atomic and you are doing relaxed stores, then you can't get nasal demons, but you can still observe the stores out of order; there is no happens-before relationship mandating the contrary.)Optimization comparison
Your
fversusgexamples is not really a fair comparison: ing, the load of the non-atomic variablexhas no side effects and its value is not used, so the compiler omits it altogether. As mentioned above, the compiler doesn't omit the unnecessary atomic load ofxinf.As to why the compilers don't sink the first accesses to
aandbpast the acquire load: I believe it's simply a missed optimization. Again, most compilers deliberately don't try to do all possible legal optimizations with atomics. However, you could note that on ARM64 for instance, the load ofxinfcompiles toldar, which the CPU can definitely reorder with earlier plain loads and stores