On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

827 Views Asked by At

Consider the following simple code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#include <err.h>

int cpu_ms() {
    return (int)(clock() * 1000 / CLOCKS_PER_SEC);
}

int main(int argc, char** argv) {
    if (argc < 2) errx(EXIT_FAILURE, "provide the array size in KB on the command line");

    size_t size = atol(argv[1]) * 1024;
    unsigned char *p = malloc(size);
    if (!p) errx(EXIT_FAILURE, "malloc of %zu bytes failed", size);

    int fill = argv[2] ? argv[2][0] : 'x'; 
    memset(p, fill, size);

    int startms = cpu_ms();
    printf("allocated %zu bytes at %p and set it to %d in %d ms\n", size, p, fill, startms);

    // wait until 500ms has elapsed from start, so that perf gets the read phase
    while (cpu_ms() - startms < 500) {}
    startms = cpu_ms();

    // we start measuring with perf here
    unsigned char sum = 0;
    for (size_t off = 0; off < 64; off++) {
        for (size_t i = 0; i < size; i += 64) {
            sum += p[i + off];
        }
    }

    int delta = cpu_ms() - startms;
    printf("sum was %u in %d ms \n", sum, delta);

    return EXIT_SUCCESS;
}

This allocates an array of size bytes (which is passed in on the command line, in KiB), sets all bytes to the same value (the memset call), and finally loops over the array in a read-only manner, striding by one cache line (64 bytes), and repeats this 64 times so that each byte is accessed once.

If we turn prefetching off1, we expect this to hit 100% in a given level of cache if size fits in the cache, and to mostly miss at that level otherwise.

I'm interested in two events l2_lines_out.silent and l2_lines_out.non_silent (and also l2_trans.l2_wb - but the values end up identical to non_silent), which counts lines that are silently dropped from l2 and that are not.

If we run this from 16 KiB up through 1 GiB, and measure these two events (plus l2_lines_in.all) for the final loop only, we get:

L2 lines in/out

The y-axis here is the number of events, normalized to the number of accesses in the loop. For example, the 16 KiB test allocates a 16 KiB region, and makes 16,384 accesses to that region, and so a value of 0.5 means that on average 0.5 counts of the given event occurred per access.

The l2_lines_in.all behaves almost as we'd expect. It starts off around zero and when the size exceeds the L2 size it goes up to 1.0 and stays there: every access brings in a line.

The other two lines behave weirdly. In the region where the test fits in the L3 (but not in the L2), the eviction are nearly all silent. However, as soon as the region moves into main memory, the evictions are all non-silent.

What explains this behavior? It's hard to understand why the evictions from L2 would depend on whether the underlying region fits in main memory.

If you do stores instead of loads, almost everything is a non-silent writeback as expected, since the update value has to be propagated to the outer caches:

stores

We can also take a look at what level of the cache the accesses are hitting in, using the mem_inst_retired.l1_hit and related events:

cache hit ratios

If you ignore the L1 hit counters, which seem impossibly high at a couple of points (more than 1 L1 hit per access?), the results look more or less as expected: mostly L2 hits when the the region fits cleanly in L2, mostly L3 hits for the L3 region (up to 6 MiB on my CPU), and then misses to DRAM thereafter.

You can find the code on GitHub. The details on building and running can be found in the README file.

I observed this behavior on my Skylake client i7-6700HQ CPU. The same effect seems not to exist on Haswell2. On Skylake-X, the behavior is totally different, as expected, as the L3 cache design has changed to be something like a victim cache for the L2.


1 You can do it on recent Intel with wrmsr -a 0x1a4 "$((2#1111))". In fact, the graph is almost exactly the same with prefetch on, so turning it off is mostly just to eliminate a confounding factor.

2 See the comments for more details, but briefly l2_lines_out.(non_)silent doesn't exist there, but l2_lines_out.demand_(clean|dirty) does which seem to have a similar definition. More importantly, the l2_trans.l2_wb which mostly mirrors non_silent on Skylake exists also on Haswell and appears to mirror demand_dirty and it also does not exhibit the effect on Haswell.

0

There are 0 best solutions below