Why cache hit takes more time than cache miss?

185 Views Asked by At

I want to clear my cache line in C or C++. My code is as follow and my gcc version is 9.3.0.

#include <chrono>
#include <cmath>
#include <cstring>
#include <iostream>

#include <x86intrin.h>


void clear_cache(char *addr, int size) {
        int round = (size-1) / 64 + 1;
        for (int i = 0; i < round; i++) {
                _mm_clflush(addr);
                addr += 64;
        }
}

float time_test_cache_hit(char * mem_block, size_t len) {
        char tmp;
        std::chrono::high_resolution_clock::time_point tmp1, tmp2;
        tmp1 = std::chrono::high_resolution_clock::now();
        for (int idx = 0; idx < len; idx ++) {
                tmp = *(mem_block + idx);
        }
        tmp2 = std::chrono::high_resolution_clock::now();
        return ((std::chrono::duration<float>)(tmp2 - tmp1)).count();
}

float time_test_cache_miss(char * mem_block, size_t len) {
        char tmp;

        clear_cache(mem_block, sizeof(char)*len);

        std::chrono::high_resolution_clock::time_point tmp1, tmp2;
        tmp1 = std::chrono::high_resolution_clock::now();
        for (int idx = 0; idx < len; idx ++) {
                tmp = *(mem_block + idx);
        }
        tmp2 = std::chrono::high_resolution_clock::now();
        return ((std::chrono::duration<float>)(tmp2 - tmp1)).count();
}


uint64_t tsc_test_cache_hit(char * mem_block, size_t len) {
        char tmp;

        for (int i = 0; i < len; i ++) {
                *(mem_block + i) = i;
        }

        uint64_t time1, time2;
         time1 = __rdtsc();
        for (int idx = 0; idx < len; idx ++) {
                tmp = *(mem_block + idx);
        }
         time2 = __rdtsc();
        return time2 - time1;
}

uint64_t tsc_test_cache_miss(char * mem_block, size_t len) {
        char tmp;

        clear_cache(mem_block, sizeof(char)*len);

        uint64_t time1, time2;
         time1 = __rdtsc();
        for (int idx = 0; idx < len; idx ++) {
                tmp = *(mem_block + idx);
        }
         time2 = __rdtsc();
        return time2 - time1;
}

int main(int argc, char ** argv) 
{

        int len = 100;

        char* mem_block = (char*) malloc(sizeof(char)*len);

        for (int i = 0; i < len; i ++) {
                *(mem_block + i) = i;
        }

        std::cout << "cache hit time: " << time_test_cache_hit(mem_block, len) << "\ncache miss time: " << time_test_cache_miss(mem_block, len) << std::endl;
        std::cout << "cache hit tsc: " << tsc_test_cache_hit(mem_block, len) << "\ncache miss tsc: " << tsc_test_cache_miss(mem_block, len) << std::endl;

        free(mem_block);

        return 0;
}

My cacheline is 64Byte. As for my cpu cache size, it's like this:

L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        40 MiB
L3 cache:                        55 MiB

I run this code many times, the results seem to be quite weird,

cache hit time: 1.224e-06
cache miss time: 9.07e-07
cache hit tsc: 1672
cache miss tsc: 2114

And if I change the clear_cache function as follow:

void clear_cache(char *addr, int len) {
        for (int idx = 0; idx < len; idx ++) {
                _mm_clflush(mem_block + idx);
        }
}

and change the function calling for clear_cache from clear_cache(mem_block, sizeof(char)*len) to clear_cache(mem_block, len), the program running seems to make sense., the results is like this:

cache hit time: 1.195e-06
cache miss time: 3.411e-06
cache hit tsc: 1580
cache miss tsc: 8896

As what is shown above, why my former writing of codes can not yield a normal result (it takes more time for cache hit than cache miss), but both writings show that cache miss takes more tsc (time stamp count) than cache hit, although their tsc for cache miss are also very different. Why the codes behave this way? Or if there is something wrong with my coding? Many thanks for the help.

1

There are 1 best solutions below

0
On

I think you try to microbenchmark something, but with wrong tools. I think the code you try to measure is too quick to be measured with system clocks, you should use other tools (details: here). The overhead of measuring, the possible OS related overheads are in the same magnitude with the runtime of your code.