Why does this C++ code crash with an apparent memory ordering race condition?

250 Views Asked by At

Why does this kind of code crash (very occasionally!) on x64?

class Object
{
   public:
   void manipulate(); // reads and writes object
   bool m_finished = false; // note this is a regular bool and NOT atomic
}

Thread 1

object->manipulate();
object->m_finished = true;
// no further access of object

Thread 2

if (object->m_finished)
   delete object;

In two separate applications I saw crashes due to this pattern (this is not the exact code, just an example of a common pattern).

The crash takes the form of heap corruption where it looks very much like the the memory occupied by 'object' was modified after it was freed. It's as if Thread 1 did:

object->m_finished = true;
object->manipulate()

and the object got deleted between the setting of m_finished and manipulate().

In the first app, the fix was to put a mutex around both setting and testing of m_finished (actually it already had a mutex around the setting part). It involved an OS bug in aio_error and is described in this question I asked

The second time I saw this, it was fixed by changing m_finished to a std::atomic.

In both cases it was an extremely rare heisencrash that would disappear with any attempt to use valgrind, ASAN etc. My assumption is that it was due to the CPUs re-ordering the writes in object->manipulate() and delete object.

Note that I am 95% sure there was no compiler level re-ordering going on, as the various methods involved were virtual functions etc.

I get that it's good to use atomics when doing inter-thread signalling (and I promise I will forever more). As far as I understand x86 (Intel and AMD, we run AMD) does not allow for store-store re-ordering. So if the deleting thread witnesses m_finished == true, all the preceding stores performed in object->manipulate() should be committed to memory right?

I understand that in general the behaviour of this code is undefined. But specifically why, on x64, does it appear that use of a mutex or atomic for m_finished is required when the cpu specs guarantee ordering of stores and loads?

2

There are 2 best solutions below

9
doron On

On modern processors there are multiple cores, each core has its own L1 cache. When memory written and then reread, this may happen well before it passes down to main memory and is the main point of caches.

When memory is shared between cores, a write by one core may not be seen by another core until later. This effectively may result in reads and writes between cores being reordered which is extactly the problem you describe.

Therefore when sharing memory between cores (threads), one will need to add a memory barrier(s) to ensure that the state of the two cache are the same. This is what std::atomic does.

ARM has a very relaxed approach meaning that memory barriers are needed in all scenario. x86/amd64 make stronger guarantees but still has the LOCK instruction prefix MFENCE instruction to deal with some pathological cases like the one you observed.

EDIT If you take a look at the spec mentioned below, it seems amd64 does a good job of enforcing proper ordering except in one case: loads may be reordered with older stores to a different location.

https://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

3
Angelicos Phosphoros On

I think, the most probable reason of crashing on your code is compiler optimizations. Does your code crash in builds without optimization (e.g. with -O0)?

For example, let see code below on godbolt.

#include <atomic>

class Object
{
public:
   std::atomic<bool> m_finished_atomic = false;
   bool m_finished_non_atomic = false;
};

void delete_without_sync(Object* ob){
    while (true){
        if (ob->m_finished_non_atomic){
            delete ob;
            break;
        }
    }
}

void delete_with_sync(Object* ob){
    while (true){
        if (ob->m_finished_atomic.load(std::memory_order::relaxed)){
            delete ob;
            break;
        }
    }
}

As you see, both methods do same thing, std::memory_order::relaxed means nothing for x86_64 CPU. But for compiler, this methods are VERY different. Let's see assembly:

delete_without_sync(Object*):        # @delete_without_sync(Object*)
        jmp     operator delete(void*)@PLT                      # TAILCALL
delete_with_sync(Object*):           # @delete_with_sync(Object*)
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
        movzx   eax, byte ptr [rdi]
        test    al, 1
        je      .LBB1_1
        jmp     operator delete(void*)@PLT                      # TAILCALL

As you can see, delete_without_sync was optimized away by compiler to direct delete of our struct without any waiting while delete_with_sync repeatedly tests variable and deletes object only when it is true. So why?

From compiler point of view, in delete_without_sync, field m_finished_non_atomic cannot change because it thinks that it cannot be shared between threads because there is no any synchronization. So compiler converts code to something like this:

void delete_without_sync(Object* ob){
    if (ob->m_finished_non_atomic) {
       delete ob;
       return;
    }
    while (true) {}
}

Then it sees that there is an infinite loop without any side effects. Such loops are considered undefined behaviour so they cannot be reached. Therefore, field m_finished_non_atomic must be true so compiler removes even reads from it.

And now we have use after free because of data race.

In delete_with_sync compiler cannot assume that variable doesn't change so it generated "honest" loop we asked.

In large project, there can be a lot of things happen around which is hard to track for a compiler. Maybe there was some loop which wait until boolean flag updates, or some other undefined behaviour that occur because compiler assumes that variable cannot change. It is hard to track for a person compared to a program like compiler.

Another thing that could happen, that compiler just moved object->m_finished = true before manipulate() call because it noticed that manipulate never updates or reads it.

In conclusion: I think that it is compiler optimizations trigger that behaviour here, not CPU itself. This is a reason why you should always use synchronization at some point to prevent miscompilations. C++ memory model is way more relaxed compared to CPUs exactly to make such optimizations available.