Java instruction reordering and CPU memory reordering

535 Views Asked by At

This is a follow up question to

How to demonstrate Java instruction reordering problems?

There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.

When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.

Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

1

There are 1 best solutions below

13
On

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.

Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.

Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).

It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.

Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.


This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour1).

And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.

(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)

Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.

Related:

Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.


Caches are coherent, despite common misconceptions

However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).

Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.


Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.

(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)