What is the purpose of `_mm_clevict` intrinsic and corresponding clevict0, clevict1 instructions?

372 Views Asked by At

Intel® Intrinsics Guide says about _mm_clevict:

void _mm_clevict (const void * ptr, int level)
#include <immintrin.h>
Instruction: clevict0 m8
             clevict1 m8
CPUID Flags: KNCNI

Evicts the cache line containing the address ptr from cache level level (can be either 0 or 1).

What could be the purpose of this operation? Is it different from _mm_cldemote?

2

There are 2 best solutions below

0
John D McCalpin On BEST ANSWER

As far as I can tell, these instructions were added to the 1st generation Xeon Phi (Knights Corner, KNC) processors to help deal with some very specific performance issues for data motion through the cache hierarchy. It has been quite a while since I looked at the details, but my recollection is that there were some performance problems associated with cache victims, and that throughput was improved if the no-longer-needed lines were evicted from the caches before the cache miss that would cause an eviction.

Idea (1): This might have been due to memory bank conflicts on dirty evictions. E.g., consider what would happen if the address mapping made it too likely that the new item being loaded would be located in a DRAM bank that conflicted with the victim to be discarded. If there were not enough write buffers at the memory controller, the writeback might have to be committed to DRAM before the DRAM could switch banks to service the read. (Newer processors have lots and lots of write buffers in the memory controller, so this is not a problem, but this could have been a problem for KNC.)

Idea (2): Another possibility is that the cache victim processing could delay the read of the new value because of serialization at the Duplicate Tag Directories (DTDs). The coherence protocol was clearly a bit of a "hack" (so that Intel could use the existing P54C with minimal changes), but the high-level documentation Intel provided was not enough to understand the performance implications of some of the implementation details.

The CLEVICT instructions were "local" -- only the core executing the instruction performed the eviction. Dirty cache lines would be written out and locally invalidated, but the invalidation request would not be transmitted to other cores. The instruction set architecture documentation does not comment on whether the CLEVICT instruction results in an update message from the core to the DTD. (This would be necessary for idea (2) to make any change in performance.)

The CLDEMOTE instruction appears to be intended to reduce the latency of cache-to-cache transfers in producer-consumer situations. From the instruction description: "This may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line." This is very similar to my patent https://patents.google.com/patent/US8099557B2/ "Push for sharing instruction" (developed while I was at AMD).

5
Peter Cordes On

Note that it's KNCNI, Knight's Corner New Instructions, so that's first-gen Xeon Phi compute cards, before Knight's Landing. That evolved out of a GPU, so it's maybe not surprising to have cache-control instructions.

Perhaps also relevant for interfacing with the host system, since the compute card's caches are not coherent with the host system CPUs. Although they might be coherent with PCIe access to the device's memory, just like x86 in general has cache-coherent DMA. (Also, evicting from only from one cache level might still leave dirty data in the other, if outer cache isn't inclusive. If any manual coherency was needed before host reads of device-memory, more likely clflush or something would be used.)

I don't know exactly why KNC had it, but there's no reason to ever expect it to appear in mainstream x86 CPUs. Not even KNL had KNCNI; KNL has AVX-512F + ER + PF instead; KNCNI was a total dead-end instruction-set extension that isn't present in any later CPUs.


It might well be a similar idea to cldemote when used on dirty data, but on clean data it would let you discard data after you've finished reading it. (Recall that KNC was fully in-order, based on the P54C (Pentium) dual-issue in-order microarchitecture, so you can actually know in terms of program-order when you're done accessing a cache line. Unlike with KNL which was based on Silvermont.)

Managing cache by manually evicting data you know you don't need to read anymore is my best guess.