As per my current understanding from the ARM Cortex A57 and A78 TRM, micro ops can be issued out of order to 1 among the several execution pipelines.
This is instruction reordering for independent instruction as far as I understood.
Memory access reordering is something which means observers and slaves in a system may observe memory accesses in different sequence compare to the program sequence. This could mean 1 of the following -
1 - CPU reordered the memory access micro ops and issued to the load and store pipelines. Interconnect(ACE/CHI) did not do any reordering
2 - CPU issues the micro-ops in program order but Interconnect(ACE/CHI) reordered it
Is my understanding correct? If yes, then will the barrier instruction halt the CPU pipeline by stopping further instruction issue or Interconnect throttles the CPU master interface till Barrier instruction response is received?
I asked in ARM blog but no response as of now.
EDIT 1
As per suggestion by Peter, I wanted to mention following precondition for my question -
1 - Multi cluster ARM SoC along with other ACE masters like DMA enginer, iGPU, etc.
2 - The question is for inner-shareable as well as outer shareable memory (eg - Memory accessed by threads running in different CPU cluster)
3 - Question is for Cacheable (This is clarified by Peter to a great extent) and Non-Cacheable normal memory as I wanted to understand how memory access observation by other observers is related to ordering in CPU pipeline in out of order pipeline architecture such as ARM Cortex A78
Memory reordering (of access to globally-visible cache state) happens inside the CPU core, not the interconnect. A barrier instruction doesn't send any messages to other cores.
(At least not
dmb ish. I don't know about outer-shareable / non-cache coherent stuff, but those barriers might just order things wrt. cache-control instructions that you also need in those cases. The A32/T32 and A64 docs sound to me like even for stronger orders, it's still just about waiting for completion of things that were already going to happen because of other instructions, including loads or stores. There are probably more detailed docs somewhere, but maybe an ARM expert can shed some more light on this with another answer if this answer is missing anything important.)Issuing a load micro-op to an execution unit attempts to read from cache right then. But issuing a store just copies the data+address to the store buffer. Memory reordering (of their accesses to coherent shared cache) happens inside each core, by various mechanisms including the store buffer and hit-under-miss non-blocking caches.
Out-of-order execution is one significant mechanism for LoadLoad reordering (if load addresses are ready in a different order), but all major kinds of memory reordering can happen on an in-order pipeline, due to cache miss loads and a store buffer. (And if the store buffer allows out of order commit of stores, which ARM normally would since its memory model doesn't guarantee StoreStore ordering.)
My understanding is that interconnects generally don't introduce reordering themselves. So memory barriers just have to make things inside this core wait until earlier loads have completed and/or the store buffer drains.
See also:
https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ - analogies for memory reorderings in terms of out-of-order accesses to a coherent shared state.
Can a speculatively executed CPU branch contain opcodes that access RAM? - store buffers decouple execution from committing stores to cache, allowing speculative execution among other benefits.
How does memory reordering help processors and compilers? - CPUs want to load early and store late.
Does a memory barrier ensure that the cache coherence has been completed? - No, cache stays coherent all the time, memory barriers just order the global visibility of this core's memory operations. Not just their execution order (in terms of actually running on execution units).
How is load->store reordering possible with in-order commit?
Can CPU Out-of-Order-Execution cause memory reordering?