Why does this kind of code crash (very occasionally!) on x64?
class Object
{
public:
void manipulate(); // reads and writes object
bool m_finished = false; // note this is a regular bool and NOT atomic
}
Thread 1
object->manipulate();
object->m_finished = true;
// no further access of object
Thread 2
if (object->m_finished)
delete object;
In two separate applications I saw crashes due to this pattern (this is not the exact code, just an example of a common pattern).
The crash takes the form of heap corruption where it looks very much like the the memory occupied by 'object' was modified after it was freed. It's as if Thread 1 did:
object->m_finished = true;
object->manipulate()
and the object got deleted between the setting of m_finished and manipulate().
In the first app, the fix was to put a mutex around both setting and testing of m_finished (actually it already had a mutex around the setting part). It involved an OS bug in aio_error and is described in this question I asked
The second time I saw this, it was fixed by changing m_finished to a std::atomic.
In both cases it was an extremely rare heisencrash that would disappear with any attempt to use valgrind, ASAN etc. My assumption is that it was due to the CPUs re-ordering the writes in object->manipulate() and delete object.
Note that I am 95% sure there was no compiler level re-ordering going on, as the various methods involved were virtual functions etc.
I get that it's good to use atomics when doing inter-thread signalling (and I promise I will forever more). As far as I understand x86 (Intel and AMD, we run AMD) does not allow for store-store re-ordering. So if the deleting thread witnesses m_finished == true, all the preceding stores performed in object->manipulate() should be committed to memory right?
I understand that in general the behaviour of this code is undefined. But specifically why, on x64, does it appear that use of a mutex or atomic for m_finished is required when the cpu specs guarantee ordering of stores and loads?
On modern processors there are multiple cores, each core has its own L1 cache. When memory written and then reread, this may happen well before it passes down to main memory and is the main point of caches.
When memory is shared between cores, a write by one core may not be seen by another core until later. This effectively may result in reads and writes between cores being reordered which is extactly the problem you describe.
Therefore when sharing memory between cores (threads), one will need to add a memory barrier(s) to ensure that the state of the two cache are the same. This is what std::atomic does.
ARM has a very relaxed approach meaning that memory barriers are needed in all scenario. x86/amd64 make stronger guarantees but still has the
LOCK instruction prefixMFENCE instruction to deal with some pathological cases like the one you observed.EDIT If you take a look at the spec mentioned below, it seems amd64 does a good job of enforcing proper ordering except in one case: loads may be reordered with older stores to a different location.
https://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf