I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.
My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.
Is the following code correct in design and use of memory ordering? Is there a better way?
using namespace std;
class counter {
atomic<uint32_t> lo_{};
atomic<uint32_t> hi_{};
atomic<uint32_t> gen_{};
uint64_t read() const {
auto acquire = memory_order_acquire;
uint32_t lo, hi, gen1, gen2;
do {
gen1 = gen_.load(acquire);
lo = lo_.load(acquire);
hi = hi_.load(acquire);
gen2 = gen_.load(acquire);
} while (gen1 != gen2 || (gen1 & 1));
return (uint64_t(hi) << 32) | lo;
}
void increment() {
auto release = memory_order_release;
gen_.fetch_add(1, release);
uint32_t newlo = 1 + lo_.fetch_add(1, release);
if (newlo == 0) {
hi_.fetch_add(1, release);
}
gen_.fetch_add(1, release);
}
};
edit: whoops, fixed auto acquire = memory_order_release;
This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.) It's not lock-free; a writer sleeping at the wrong time will leave readers spinning until the writer finishes. But in the common case where that doesn't happen, it has excellent performance with no contention between readers which are truly read-only.
You don't need or want the increment of the payload to use atomic RMW operations. (Unless you're on a system that can cheaply do a 64-bit atomic add or load, then do that instead of a SeqLock).
You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap
relaxed
orrelease
memory order for the payload, and using arelease
store for the 2nd sequence counter update, what you're calling the "generation" counter).Similarly the sequence counter also doesn't need to be an atomic RMW. (Unless you're using it as a spinlock with multiple writers)
The single writer only needs pure loads and pure stores with only
release
ordering, which are (much) cheaper than atomic RMW, or stores withseq_cst
ordering:The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value
release
, on CPUs where that's more expensive thanrelaxed
.Unfortunately to satisfy C++ rules, the
value
has to beatomic<T>
, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARMldrd
orldp
/stp
load-pair aren't guaranteed atomic until ARMv8.4a, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a
volatile uint64_t
and anatomic<uint64_t>
I wrote this C++
SeqLock<class T>
template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).This tries to check if the target already supports lock-free atomic operations on
atomic<T>
to stop you from using this when it's pointless. (Disable that for testing purposed by definingIGNORE_SIZECHECK
.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using astatic_assert
.I provided an
inc()
function forT
that supports a++
operator. TODO would be anapply()
that accepts a lambda to do something to aT
, and store the result between sequence counter updates.It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome
volatile
rules.It uses non-atomic
volatile T data
for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access tovolatile
objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That's the whole point of a SeqLock.)One problem with
volatile T data
is that in ISO C++,T foo = data
won't compile for struct objects unless you provide a copy-constructor from avolatile
object, likeThis is really annoying for us, because we don't care about the details of how memory is read, just that multiple reads aren't optimized into one.
volatile
is really the wrong tool here, and plainT data
with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with aasm("":::"memory");
compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won't do with separatevolatile
accesses.I think
std::atomic_thread_fence(mo_acquire)
would also be a sufficient barrier, but I'm not 100% sure.In ISO C, you can copy a
volatile
aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can't have nice things apparently.Related: single-core systems with a writer in an interrupt handler
In an embedded system with one core, and some variables that are only updated by interrupt handlers, you may have a writer that can interrupt the reader but not vice versa. That allows some cheaper variations that use the value itself to detect torn reads.
See Reading a 64 bit variable that is updated by an ISR, especially for a monotonic counter Brendan's suggestion of reading the most significant-half first, then the low half, then the most-significant half again. If it matches, your read wasn't torn in a way that matters. (A write that didn't change the high half isn't a problem even if it interrupted the reader to change the low half right before or after the reader read it.)
Or in general, re-read the whole value until you see the same value twice in a row.
Neither of these techniques are SMP-safe: the read retry only guards against torn reads, not torn writes if the writer stored the halves separately. That's why a SeqLock uses a 3rd atomic integer as a sequence counter. They would work in any case where the writer is atomic wrt. the reader, but the reader isn't atomic. Interrupt handler vs. main code is one such case, or signal handler is equivalent.
You could potentially use the low half of a monotonic counter as a sequence number, if you don't mind incrementing by 2 instead of 1. (Perhaps requiring readers to do a 64-bit right shift by 1 to recover the actual number. So that's not good.)