I have a few questions regarding memory barriers.
Say I have the following C code (it will be run both from C++ and C code, so atomics are not possible) that writes an array into another one. Multiple threads may call thread_func(), and I want to make sure that my_str is returned only after it was initialized fully. In this case, it is a given that the last byte of the buffer can't be 0. As such, checking for the last byte as not 0, should suffice.
Due to reordering by compiler/CPU, this can be a problem as the last byte might get written before previous bytes, causing my_str to be returned with a partially copied buffer. So to get around this, I want to use a memory barrier. A mutex will work of course, but would be too heavy for my uses.
Keep in mind that all threads will call thread_func() with the same input, so even if multiple threads call init() a couple of times, it's OK as long as in the end, thread_func() returns a valid my_str, and that all subsequent calls after initialization return my_str directly.
Please tell me if all the following different code approaches work, or if there could be issues in some scenarios as aside from getting the solution to the problem, I'd like to get some more information regarding memory barriers.
__sync_bool_compare_and_swapon last byte. If I understand correctly, any memory store/load would not be reordered, not just the one for the particular variable that is sent to the command. Is that correct? if so, I would expect this to work as all previous writes of the previous bytes should be made before the barrier moves on.#define STR_LEN 100 static uint8_t my_str[STR_LEN] = {0}; static void init(uint8_t input_buf[STR_LEN]) { for (int i = 0; i < STR_LEN - 1; ++i) { my_str[i] = input_buf[i]; } __sync_bool_compare_and_swap(my_str, 0, input_buf[STR_LEN - 1]); } const char * thread_func(char input_buf[STR_LEN]) { if (my_str[STR_LEN - 1] == 0) { init(input_buf); } return my_str; }__sync_bool_compare_and_swapon each write. I would expect this to work as well, but to be slower than the first one.static void init(char input_buf[STR_LEN]) { for (int i = 0; i < STR_LEN; ++i) { __sync_bool_compare_and_swap(my_str + i, 0, input_buf[i]); } }__sync_synchronizebefore each byte copy. I would expect this to work as well, but is this slower or faster than (2)?__sync_bool_compare_and_swapis supposed to be a full barrier as well, so which would be preferable?static void init(char input_buf[STR_LEN]) { for (int i = 0; i < STR_LEN; ++i) { __sync_synchronize(); my_str[i] = input_buf[i]; } }__sync_synchronizeby condition. As I understand it,__sync_synchronizeis both a HW and SW memory barrier. As such, since the compiler can't tell the value ofuse_syncit shouldn't reorder. And the HW reordering will be done only ifuse_syncis true. is that correct?static void init(char input_buf[STR_LEN], bool use_sync) { for (int i = 0; i < STR_LEN; ++i) { if (use_sync) { __sync_synchronize(); } my_str[i] = input_buf[i]; } }
GNU C legacy
__syncbuiltins are not recommended for new code, as the manual says.Use the
__atomicbuiltins which can take a memory-order parameter like C11 stdatomic. But they're still builtins and still work on plain types not declared_Atomic, so using them is like C++20std::atomic_ref. In C++20, usestd::atomic_ref<unsigned char>(my_str[STR_LEN - 1]), but C doesn't provide an equivalent so you'd have to use compiler builtins to hand-roll it.Just do the last store separately with a release store in the writer, not an RMW, and definitely not a full memory barrier (
__sync_synchronize()) between every byte!!! That's way slower than necessary, and defeats any optimization to use memcpy. Also, you need the store of the final byte to be at leastRELEASE, not a plain store, so readers can synchronize with it. See also Who's afraid of a big bad optimizing compiler? re: how exactly compilers can break your code if you try to hand-roll lockless code with just barriers, not atomic loads or stores. (It's written for Linux kernel code, where a macro would use*(volatile char*)to hand-roll something close to__atomic_store_nwith __ATOMIC_RELAXED`)So something like
The
if (my_str[STR_LEN - 1] == 0)load inthread_funcis of course data-race UB when there are concurrent writers.For safety it needs to be an acquire load, like
__atomic_load_n(&my_str[STR_LEN - 1], __ATOMIC_ACQUIRE) == 0, since you need a thread that loads a non-0value to also see all other stores by another thread that raninit(). (Which did a release-store to that location, creating acquire/release synchronization and guaranteeing a happens-before relationship between these threads.)See https://preshing.com/20120913/acquire-and-release-semantics/
Writing the same value non-atomically is also UB in ISO C and ISO C++. See Race Condition with writing same value in C++? and others.
But in practice it should be fine except with
clang -fsanitize=thread. In theory a DeathStation9000 could implement non-atomic stores by storingvalue+1and then subtracting1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. I'd have a look at the generated asm on any new compiler / ISA combination you're trying, just to make sure.It would be hard to test; the init stuff can only race once per program invocation. But there's no fully safe way to do it that doesn't totally suck for performance, AFAIK. Perhaps doing the init with a cast to
_Atomic unsigned char*ortypedef _Atomic unsigned long __attribute__((may_alias)) aliasing_atomic_ulong;as a building block for a manual copy loop?Bonus question:
if(use_sync) __sync_synchronize()inside the loop.Optimization is possible to asm that works something like
if(use_sync) { slow barrier loop } else { no-barrier loop }. This is called "loop unswitching": making two loops and branching once to decide which to run, instead of every iteration. GCC has been able to do that optimization (in some cases) since 3.4. So that defeats your attempt to take advantage of how the compiler would compile to trick it into doing more ordering than the source actually requires.Yes, that part is correct.
Also, inlining and constant-propagation of
use_synccould easily defeat this, unlessuse_syncwas avolatileglobal or something. At that point you might as well just make a separate_Atomic unsigned char array_init_doneflag / guard variable.And you can use it for mutual exclusion by having threads try to set it to
1withint old = guard.exchange(1), with the winner of the race being the one to runinitwhile they spin-wait (or C++20.wait(1)) for the guard variable to become2or-1or something, which the winner of the race will set after finishinginit.Have a look at the asm GCC makes for non-constant-initialized
staticlocal vars; they check a guard variable with an acquire load, only doing locking to have one thread do the run_once init stuff and the others wait for that result. IIRC there's a Q&A about doing that yourself with atomics.