Microbenchmarking C code and cache effects

Question

Microbenchmarking C code and cache effects

202 Views Asked by stefanobaghino At 29 July 2025 at 06:01

I'm on macOS Sonoma using an M1 Pro, using clang 15 (clang-1500.0.40.1) without any (explicit) optimization when I compile. (EDIT: With -O3 there are no meaningful differences I can observe between the various versions. The last version seems to be slightly faster compared to previous runs of the same version, though.)

I wrote a small function to shuffle an array and I would like to test its performance. In order to do so, I wrote the following function:

void test_performance() {
    // initialize only once to measure performance
    int ns[] = {1, 2, 3, 4, 5};
    int ops = 1E6;
    int start = clock();
    for (int i = 0; i < ops; i++) {
        shuffle(ns, 5);
    }
    int end = clock();
    double s_total = ((double) (end - start)) / CLOCKS_PER_SEC;
    double ns_op = s_total / ops * 1E9;
    printf("test_performance: %fns/op (%fs total) \n", ns_op, s_total);
}

On my machine, this yields ~30ns/op.

I then realized that the looping operations were included in the measurements. I though it was probably close to negligible, but that I could have got rid of that small noise, so I changed it to call clock() within the loop and add the actual time spent on shuffle to an accumulator, resulting in the following code.

void test_performance() {
    // initialize only once to measure performance
    int ns[] = {1, 2, 3, 4, 5};
    int ops = 1E6;
    int start, end, ticks = 0;
    for (int i = 0; i < ops; i++) {
        start = clock();
        shuffle(ns, 5);
        end = clock();
        ticks += (end - start);
    }
    double s_total = ((double) ticks) / CLOCKS_PER_SEC;
    double ns_op = s_total / ops * 1E9;
    printf("test_performance: %fns/op (%fs total) \n", ns_op, s_total);
}

The program overall ran visibly slower, which is not that surprising, given that now I have a couple million more calls to clock(). What I did find surprising, however, is that the time per operation reported at the end increased by an order-of-magnitude, around ~330ns/op.

First question: what could cause this massive difference?

My best guess is that his has to do with modern hardware, caching and the fact that clock() possibly involves a system call, with the code for shuffle being always in a CPU cache in one case and being constantly evicted in the other, but I'm not really sure whether that's the case or if there's something more obvious I'm missing.

Second question: if that's the case, what are good practices when using C to do microbenchmarking and evaluate the impact of caching? I'm interested primarily in making relatively robust benchmarks without requiring specialized tooling, but I'd also be happy to hear about microbenchmarking tools used by the C community.

EDIT

Interestingly, I tried to use clock_gettime, which also allows to explicitly ask for time which is monotonically increasing, ending up with the following:

void test_performance() {
    struct timespec start_time, end_time;
    // initialize only once to measure performance
    int ns[] = {1, 2, 3, 4, 5};
    int ops = 1E6;
    double s_total = 0;
    for (int i = 0; i < ops; i++) {
        if (clock_gettime(CLOCK_MONOTONIC, &start_time) != 0) {
            perror("clock_gettime");
            return;
        }
        shuffle(ns, 5);
        if (clock_gettime(CLOCK_MONOTONIC, &end_time) != 0) {
            perror("clock_gettime");
            return;
        }
        s_total += (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_nsec - start_time.tv_nsec) / 1.0e9;
    }
    double ns_op = s_total / ops * 1E9;
    printf("test_performance: %fns/op (%fs total) \n", ns_op, s_total);
}

Interestingly, this code reported times around ~25ns. When I tried to put the time measurements around the whole loop instead of measuring individual calls, the times reported went back to ~30ns as in my first experiment using clock().

I'd love to understand why clock_gettime doesn't seem to cause the same penalty as clock().

Original Q&A

There are 2 best solutions below

**Marco Bonelli** · Answer 1

First question: what could cause this massive difference?

Hard to tell because benchmarking stuff with optimizations disabled makes little-to-no sense, you will have to take a look at the generated machine code to understand things better.

Performing a system call and breaking the loop two times per iteration means executing a lot of useless instructions that you don't really want to time, which can easily trash the processor pipeline, and therefore decrease performance by a lot. A compiler might or might not be able to order instructions in an optimized way to avoid this. In your case it probably didn't (makes sense since you are telling it you don't care about optimizations).

It could also cause more rescheduling by the OS because syscall entry/exit are usually common scheduler entry points where your task can be set a side for others to run, without the need of kernel preemption happening (like if you do a tight userspace loop with no syscalls).

I'd love to understand why clock_gettime doesn't seem to cause the same penalty as clock().

See the answer here: Does gettimeofday() on macOS use a system call?

clock_gettime(CLOCK_MONOTONIC) is likely just a wrapper around gettimeofday, which takes a fast path via userspace and does not need a context switch (most of the times). This is a common optimization for simple and frequently used syscalls.

I don't know much about how macOS implements its "commpage" used to perform some common syscalls without context switch, but I suppose it's pretty similar to how Linux implements the "vDSO".

**chux - Reinstate Monica** · Answer 2

Possibility: truncated clock measurement and/or int overflow.

int start = clock(); may truncate the measurement if clock_t is wider than int. As the 2nd code does many more measurements that the 1st, the odds of a pair of readings (one truncated, the other not) increase the chance of a calculation error in doing the difference.

Also the difference and summation may overflow. Use a wide type.

Use:

long long ticks = 0;
...
  clock_t start = clock();
  ...
  clock_t end = clock();
  ticks += (long long) end - start;
...

Microbenchmarking C code and cache effects

EDIT

There are 2 best solutions below

Related Questions in C

Related Questions in PERFORMANCE

Related Questions in PERFORMANCE-TESTING

Related Questions in BENCHMARKING

Related Questions in MICROBENCHMARK

Trending Questions

Popular # Hahtags

Popular Questions