Let's say I have a struct Foo
s. t.
struct alignas(64) Foo {
std::atomic<int> value;
Foo* other;
};
Then, if I have an array Foo array[2048];
of Foo
's: I already have initialized the array, but it has some dirty values inside, I want to reset it back to zero state.
If I want to reset this array to zeroed Foo
's without violating standard there is an efficient way to do it in C (if Foo
has value
of type not std:atomic<int>
, but volatile int
): memset
.
However, what is safe, yet efficient, way to do it in C++?
The object-representation of a null pointer is not guaranteed to be the bit-pattern
0
, so even in C it's not fully portable to usememset
instead of assignment ofptr = 0;
. But all modern mainstream systems use0
as the bit-pattern for null pointers. In the rest of this answer, I'll assume that you only care about optimizing for such platforms.volatile int
is also not guaranteed by standards to be thread-safe in C or C++ (although it does work on most compilers as a legacy way to do something like atomic withmemory_order_relaxed
).memset
on a struct containingvolatile
members doesn't respect their volatility so your proposed C equivalent was potentially not even safe in practice, let alone guaranteed by anything. The C equivalent ofstd::atomic<int>
/std::atomic_int
is_Atomic int
/atomic_int
. C11 doesn't have a portableatomic_ref
, only compiler-specific stuff like GNU C__atomic_load_n
etc.To allow efficient zeroing during single-threaded phases of your program, consider using C++20
std::atomic_ref
on a plainint
member of your struct instead of astd::atomic<int>
member. This makes your struct only have primitive types as members, making it safe to usememset
as long as no other threads are reading or writing simultaneously. (Orstd::fill
.)It would still be data-race UB to
memset
a struct while any other threads could be reading or writing it, so this only helps if you have phases of your program where synchronization guarantees that no other threads will be accessing your array. Unless you want to rely on non-portable behaviourOne thing that would make it faster is not padding your structs to 4x the size. Your struct would be 16 bytes (on a typical 64-bit architecture) without
alignas(64)
, or 8 bytes total with 32-bit pointers. It can make sense to align the whole array of structs by 64, but aligning every individual struct puts each one in a separate cache line. Perhaps you're doing that to avoid false-sharing? That does make it slower to zero them since it's more cache-lines to write, so test to see if (and how much) it speeds up your program on various hardware to have each pointer+counter pair in its own cache line.With 3/4 of the space being padding (assuming 16-byte
Foo
objects),memset
on the whole array would typically be doing at least 2 stores per struct (x86-64 with AVX2 for 32-byte stores). Worse on AArch64 without SVE (16-byte vectors), much worse on RISC-V32 without vector extensions (just 4-byte scalar stores, AFAIK).So if you are going to use this much padding, it's not bad to just loop manually and do normal assignments to the two members of each struct. If this has to be thread-safe (so you can't just access a plain
int
), usememory_order_relaxed
for the atomic member, unless you need something stronger. You certainly don't want to be draining the store buffer for every struct, which is what would happen on x86 witharr[i].value = 0;
(which defaults toseq_cst
), so usearr[i].value.store(0, std::memory_order_relaxed)
.Looping manually would result in two stores per struct when you really only need one 16-byte store. Compilers don't optimize atomics, and they won't turn
.value.store(0, relaxed); .other = nullptr;
into a single 16-byte store, unfortunately. Even without anatomic
member, GCC has a missed-optimization (bug 82142) where it avoids writing padding, stopping it from coalescing stores when doing struct assignment. Usingsize_t value
orptrdiff_t value
could avoid that.With no atomic members, store coalescing should result in a loop that does one 16-byte store (or 8-byte in 32-bit code) per 64-byte struct. Unfortunately GCC fails at that for x86-64, but clang succeeds. GCC gets it right for AArch64.
Godbolt - GCC for x86-64 fails to coalesce the two 8-byte stores even in the non-atomic version where that would be possible. But GCC for AArch64 does use scalar
stp xzr, xzr, [x0]
for it, which is also a 16-byte store of the zero-register twice. Most microarchitectures run it as a single 16-byte store, at least on ARMv8.4 and later where it's guaranteed atomic. So it's efficientClang compiles these to asm at least as good as memset, with unrolled loops without any branching to handle tiny sizes, just an unrolled loop. The nonatomic version only does one
movaps xmmword ptr [rax], xmm0
per struct; theatomic_ref
loop does separate stores for each member. Neither spend instructions storing the padding likememset
on the whole array would do.On real hardware, the
atomic_ref
version would also be safe with 16-bytemovaps
stores, but hardware vendors don't guarantee it. See Per-element atomicity of vector load/store and gather/scatter? - it's not plausible that a 16-byte aligned store could have tearing at 8-byte boundaries on x86, especially since 8-byte atomicity is guaranteed.On x86 with AVX, 16-byte stores are guaranteed atomic, so it would actually be fully safe for GCC and clang to coalesce the atomic 8-byte store to
.value
with the non-atomic 8-byte store to.other
. But compilers aren't that smart, treatingatomic
a lot likevolatile
.It's frustrating to know that most (all?) hardware can do stuff C++ won't let you do portably, but such is life in writing portable programs.
You could manually vectorize with SIMD intrinsics like
_mm_store_si128
. Or modern compilers will inline a 16-bytememset
as a single instruction, but there's no guarantee of that. Or useoffsetof(Foo, other) + sizeof(Foo::other)
as the size for eachmemset
, to only write the parts of struct that contain the non-padding data.offsetof
isn't guaranteed on structs that aren't "standard layout", but C++17 makes that "conditionally supported" instead of UB. But of coursememset
on objects being read+written by other threads is data-race UB so I don't recommend that unless you want to carefully check your code for every compiler version, and any change to where this zeroing gets inlined into, to make sure it always compiles to safe asm.(
std::fill
wouldn't be usable this way since it takes an iterator range of the same type. You can't pass it aptrdiff_t *
as the start and aFoo **
as the end iterators. You could pass it a wholeFoo *
, but then might do a 64-byte struct assignment, unless the compiler decided to skip storing the padding. If you care a lot about what asm you're getting, it's probably not a good choice.)With
std::atomic<int>
, not plainint
+atomic_ref
In this case,
std::fill
won't compile becausestd::atomic
is not trivially-copyable (deleted copy-constructor). In C++, struct assignment goes per-member.memset
will bypass that and let you do things you're not "supposed" to do.If no other threads are running,
memset
on astd::atomic<int>
(or a struct containing one) works in practice on mainstream C++ implementations becausestd::atomic<int>
is lock-free and its object representation is the same as anint
. But I wouldn't recommend writing code this way since ISO C++ doesn't guarantee it.On GCC/Clang and libstdc++ or libc++,
std::atomic<int>
will have a plainint
member, and the member functions are wrappers for compiler built-ins like GNU C__atomic_load_n
and__atomic_fetch_add
. So there'd be no UB in usingmemset
to change the bytes of the object representation. But again, ISO C++ doesn't guarantee anything about the internals ofstd::atomic
so this would be relying on implementation details.For larger non-lock-free atomic objects like
atomic<big_struct>
, some compilers (like MSVC: Godbolt) include a spinlock inside thestd::atomic
object. IDK if any compilers include a fullstd::mutex
which shouldn't be zeroed even if you know it's unlocked (no concurrent readers or writers). (Most other compilers use a separate hash table of spinlocks, and people have said MSVC is planning that change for their next C++ ABI break.)MSVC uses zero as the unlocked state so static zero-initialized storage is usable directly, but in theory an implementation could have 0 meaning locked, so memset would create a deadlock.
Zeroing a spinlock while there are concurrent readers or writers could let some threads see torn values and maybe cause some memory_order weirdness.
It's still data-race UB to
memset
while other threads might be reading or writing the object. In practice it will probably just work since any decentmemset
will work internally as if storing in chunks at least as wide asint
, especially on a large aligned array, and in asm arelaxed
atomic store doesn't need any special instructions on mainstream platforms.But without actual guarantees, there's a lot of "this probably won't break" being relied on, so I wouldn't recommend it for portable code. Only consider
memset
while other threads are running if you need to squeeze every drop of performance out of something for a specific compiler / standard library and architecture, and are prepared to verify your assumptions by checking the asm of the standard library implementation, or checking what gets inlined into your code.And as I said earlier, padding your structs to one per cache line means only 1/4 or 1/8th of the bytes actually need to get stored, so an optimized libc
memset
isn't such an advantage. If you only aligned your structs by their size,memset
could be a lot better since it's able to use 32-byte stores even if the rest of your program isn't compiled to depend on x86 AVX or ARM SVE or whatever.On some microarchitectures, storing every byte of a cache line could have advantages in allowing the CPU to just invalidate other copies of the cache line without doing a Read For Ownership. Normally a store needs to merge some bytes into the old value of a cache line, but storing every byte of it can avoid that. But a CPU would have to detect that across multiple stores, not just looking at the first store to a line, unless you're using AVX-512 to store a whole cache line at once. (And even then IDK if x86 CPUs do anything special for aligned
vmovdqa64 [mem], zmm
, but they might.) See also Enhanced REP MOVSB for memcpy for more about no-RFO store protocols on x86 CPUs. It's normally most relevant for larger arrays that don't fit in cache, but could be relevant here. NT stores likemovntps
/_mm_stream_ps
also avoid RFOs, but they fully evict from cache, including from L3, so that makes the next read slower. (Shared L3 cache is normally a backstop for coherency traffic.)