Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?
When CPU flush value in storebuffer to L1 Cache?
335 Views Asked by Pengcheng At
1
There are 1 best solutions below
Related Questions in ATOMIC
- pass a structure to gcc built-ins for atomic accesses
- Atomic variables other than c++11 and boost
- How to use std::atomic<>
- Should load-acquire see store-release immediately?
- Using compare and set against multiple expected values with AtomicReference
- Penalty of AtomicVariables over locks in Java
- Extremely fast access to an array used by multiple threads?
- Understanding Xcode crash message and @synchronized directive
- ConcurrentHashMap atomic get, increment and replace
- Atomic/not-atomic mix, any guarantees?
- Do we need to declare a property atomic if we use GCD?
- C++ atomic list container
- Is simple getter call on volatile variable atomic operation?
- Largest data type which can be fetch-ANDed atomically?
- Synchronizing access to data using a "got there first" flag, instead of a lock/mutex
Related Questions in CPU-ARCHITECTURE
- Real-world analog to TIS-100
- What is faster: equal check or sign check
- Multicore clock counter consistency
- How do MemReq and MemResp exactly work in RoccIO - RISCV
- What is the simplest Turing complete CPU instruction set which can execute code from ROM?
- Had 16-bit DOS a memory access limitation of 1 MB? If yes, how?
- Are correct branch predictions free?
- Assembly: why some x86 opcodes are invalid in x64?
- Memory barriers force cache coherency?
- FreeRTOS : How to measure context switching time?
- HACK Machines and its assembler
- Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2
- Computer Architecture/Assembly, Amdahl's Law
- How the heap and stack size is decided in process image
- How can I get the virtual address of a shared library by the use of computer architecture state?
Related Questions in CPU-CACHE
- 3D FFT with data larger than cache
- How can I mitigate the performance impact of transposed array access order?
- How do I find the L2CacheSize, L3CacheSize from C++ on Windows7?
- Fastest use of a dataset of just over 64 bytes?
- Loop stride and cache line
- Can't sample hardware cache events with linux perf
- cache coherence MESI protocol
- What is PDE cache?
- Performance cost of MESI protocol?
- cache optimization of matrice operation
- How can I measure cache misses on OS X Yosemite?
- Write-back vs Write-Through caching?
- Cache specifications for intel core i7
- Is it possible the to lock the ISR instructions to L1 cache?
- loop tiling. how to choose block size?
Related Questions in MESI
- Why is the standard C# event invocation pattern thread-safe without a memory barrier or cache invalidation? What about similar code?
- Performance cost of MESI protocol?
- Invalidation of an Exclusive cache line
- How `memory_order_relaxed` is enough in TTAS spinlock for Arm64?
- Can CPU load data from another CPU's cache using LOCK CMPXCHG instruction in x86?
- Shortcomings of cache coherence alternative
- Is synchronization faster on the same physical CPU core?
- Can MESI protocol auto sync a variable value bewteen cpu cores?
- MSI: When shared and invalid states can occur at the same time
- How does cache coherence work in multi-core and multi-processor architecture?
- Cache coherence- MESI protocol
- Data races with MESI optimization
- How is message queue implemented in cache coherence protocol?
- Which cache-coherence-protocol does Intel and AMD use?
- When CPU flush value in storebuffer to L1 Cache?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.
In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point1 the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.
In short, invalidations from other cores have little effect on stores in the store buffer2, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.
RFO Response
Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.
This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.
Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.
One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).
Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!
For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:
... with different performance tradeoffs:
The rest that document goes into some detail about how the various modes work.
So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".
1 Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.
2 Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.