is std::atomic::fetch_add a serializing operation on x86-64?

1.1k Views Asked by At

Considering the following code:

std::atomic<int> counter;

/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
/* otherStuff 2 */

Is there an instruction in x86-64 (say less than 5 years old architectures) that would allow otherStuff 1 and 2 be re-ordered across the fetch_add or is it going to be always serializing ?

EDIT:

It looks like this is summarized by "is lock add a memory barrier on x86 ?" and it seems it is not, though I am not sure where to find a reference for that.

3

There are 3 best solutions below

12
On BEST ANSWER

First let's look at what the compiler is allowed to do when using std::memory_order_relaxed.
If there are no dependencies between otherStuff 1/2 and the atomic operation, it can certainly reorder the statements. For example:

g = 3;
a.fetch_add(1, memory_order_relaxed);
g += 12;

clang++ generates the following assembly:

lock   addl $0x1,0x2009f5(%rip)        # 0x601040 <a>
movl   $0xf,0x2009e7(%rip)             # 0x60103c <g>

Here clang took the liberty to reorder g = 3 with the atomic fetch_add operation, which is a legitimate transformation.

When using std::memory_order_seq_cst, the compiler output becomes:

movl   $0x3,0x2009f2(%rip)        # 0x60103c <g>
lock   addl $0x1,0x2009eb(%rip)   # 0x601040 <a>
addl   $0xc,0x2009e0(%rip)        # 0x60103c <g>

Reordering of statements does not take place because the compiler is not allowed to do that. Sequential consistent ordering on a read-modify-write (RMW) operation, is both a release and an acquire operation and as such, no (visible) reordering of statements is allowed on both compiler and CPU level.

Your question is whether, on X86-64, std::atomic::fetch_add, using relaxed ordering, is a serializing operation..
The answer is: yes, if you do not take into account compiler reordering.

On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.

You can say that, on an X86 CPU, each RMW operation:

  • is a release operation for memory operations that precede it and is an acquire operation for memory operations that follow it.
  • becomes visible in a single total order observed by all threads.
0
On

The target architecture

On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.

I wish people would stop saying that.

That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.

What can be said about these RMW operations:

  • all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
  • and no operation after the RMW are visible before the RMW is visible.

That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.

Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.

Visibility vs. execution ordering

That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).

The answer to your question could be different whether

  • you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
  • or guarantee that benchmarks are reliable,
  • or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).

High level languages vs. CPU features

The question title is "is std::atomic::fetch_add a serializing operation on x86-64?" of the general form:

"does OP provide guarantees P on ARCH"

where

  • OP is a high level operation in a high level language
  • P is the desired property
  • ARCH is a specific CPU or compiler target

As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.

The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.

The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.

Here you must use a volatile std::atomic<int> to even be able to ask a meaningful question about architectural guarantees.

Present code generation

The meaningful variant of your question would use this code:

volatile std::atomic<int> counter;

/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);

That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).

And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.

There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)

But then if you want that guarantee, you can just use:

  • a release operation: to ensure that previous globally visible operations are complete
  • an acquire operation: to ensure that following globally visible operations do not start before
  • RMW operation on an object that is visible by multiple threads.

So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.

Possible RMW variants in a processor family

Is there an instruction in x86-64 (say less than 5 years old architectures) that would

Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.

Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:

  • Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
  • each store is a release operation (for previous normal memory operations);
  • loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
  • loads are acquire operations.

Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.

(Examples for potential imaginary RMW operation to be added in the future are xmult and xdiv.)

But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.

The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.

(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).

8
On

When using std::memory_order_relaxed the only guarantee is that the operation is atomic. Anything around the operation can be reordered at will by either the compiler or the CPU.

From https://en.cppreference.com/w/cpp/atomic/memory_order:

Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)