Why are non-consecutive loads faster, even when the cache miss penalty is guaranteed to be zero?

Question

Why are non-consecutive loads faster, even when the cache miss penalty is guaranteed to be zero?

106 Views Asked by itoqixer At 20 July 2021 at 00:43

Background:

I wrote a function in C and compiled it with arm-none-eabi-gcc (7-2018-q2-update). The generated assembly code for the loop body looks like it should take 20 cycles per iteration, including 2 wait states for load operations accessing constant data from non-volatile program memory.

However, the NVM controller cache for my MCU says that the cache miss penalty is guaranteed to be zero, so I'm not sure why it can't prefetch the data for the two NVM load operations. Therefore, I think the loop should take 18 cycles per iteration.

Unfortunately, the measured performance is quite different from the expected performance. If I change int8_t increment and int16_t patch_data_i so that both are int32_t, then GCC generates effectively the same instructions in a slightly different order. Let's call this version (b).

The interesting thing is that version (a) takes 21 cycles per iteration and version (b) takes 20 cycles per iteration!

This performance difference is highly repeatable. I have measured it very precisely, by varying the number of iterations between (5, 6, 7, 8) for version (a) and version (b). Timing measurements from a Tektronix 465 oscilloscope at a fixed B sweep setting:

T(a)[min, max, avg] = (20.0, 21.0, 20.3) c @ 48 MHz.
T(b)[min, max, avg] = (21.0, 22.0, 21.3) c @ 48 MHz.

(Peformance of this loop body is crucial, since it executes 8 iterations, and this function is called once per every 2000 clock cycles. For my application, even this single-cycle difference amounts to roughly 0.5 percent of total cpu time.)

My question has 4 parts:

What is going on here?
Why is it that version (a) takes 21 cycles and version (b) takes 20 cycles?
Why don't both versions take 18 cycles?
Is there any possible way to accurately predict the access latency for RAM and NVM on an Atmel SAMD21 microcontroller, other than trying random permutations of assembly operations and measuring everything on an oscilloscope?

(Answers to any 1 of these 4 parts would be extremely appreciated.)

Source code (version a)

__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
                         new_state)
{
    uint32_t transitions = enc_interleave_states(old_state, new_state);
    size_t j = 0;
    for (size_t i = 0; i < 8; i++, j += 4) {
        const size_t transition = (transitions >> j) & 0xf;
        const int8_t increment = enc_increment_for_transition[transition];
        int16_t patch_data_i = enc->patch_data[i];
        patch_data_i += increment;
        size_t patch_data_i_plus_one = patch_data_i + 1;
        patch_data_i = enc_constrain_8x258[patch_data_i_plus_one];
        enc->patch_data[i] = patch_data_i;
    }
}

Source code (version b)

__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
                         new_state)
{
    uint32_t transitions = enc_interleave_states(old_state, new_state);
    size_t j = 0;
    for (size_t i = 0; i < 8; i++, j += 4) {
        const size_t transition = (transitions >> j) & 0xf;
        const int32_t increment = enc_increment_for_transition[transition];
        int32_t patch_data_i    = enc->patch_data[i];
        patch_data_i += increment;
        size_t patch_data_i_plus_one = patch_data_i + 1;
        patch_data_i       = enc_constrain_8x258[patch_data_i_plus_one];
        enc->patch_data[i] = patch_data_i;
    }
}

Generated assembly (version a)

cyc addr    code        instr   fields

x   894e:   2200        movs    r2, #0
x   8950:   250f        movs    r5, #15
x   8952:   4f09        ldr     r7, [pc, #36] ; (8978)
x   8954:   4e09        ldr     r6, [pc, #36] ; (897c)
x   8956:   4460        add     r0, ip

1   8958:   000b        movs    r3, r1
1   895a:   40d3        lsrs    r3, r2
1   895c:   402b        ands    r3, r5
2   895e:   7804        ldrb    r4, [r0, #0]
2   8960:   56f3        ldrsb   r3, [r6, r3]
1   8962:   3204        adds    r2, #4
1   8964:   191b        adds    r3, r3, r4
1   8966:   18fb        adds    r3, r7, r3
2   8968:   785b        ldrb    r3, [r3, #1]
2   896a:   7003        strb    r3, [r0, #0]
1   896c:   3001        adds    r0, #1
1   896e:   2a20        cmp     r2, #32
2   8970:   d1f2        bne.n   8958 <enc_calc_transition+0x38>
18

x   8972:   bdf0        pop     {r4, r5, r6, r7, pc}
x   8974:   000090a8 ; <enc_expand_16x256>
x   8978:   00008fa4 ; <enc_constrain_8x258>
x   897c:   00008f94 ; <enc_increment_for_transition> [signed, 8x16]

instruction cycles:

movs lsrs ands ldrb ldrsb adds adds adds ldrb strb adds cmp bne
= 1 + 1 + 1 + 2 + 2 + 1 + 1 + 1 + 2 + 2 + 1 + 1 + 2
= 18

Generated assembly (version b)

cyc addr    code        instr   fields

x   894e:   2200        movs    r2, #0
x   8950:   250f        movs    r5, #15
x   8952:   4f09        ldr     r7, [pc, #36] ; (8978)
x   8954:   4e09        ldr     r6, [pc, #36] ; (897c)
x   8956:   4460        add     r0, ip

1   8958:   0021        movs    r1, r4
1   895a:   40d1        lsrs    r1, r2
2   895c:   7803        ldrb    r3, [r0, #0]
1   895e:   4029        ands    r1, r5
2   8960:   5671        ldrsb   r1, [r6, r1]
1   8962:   18fb        adds    r3, r7, r3
1   8964:   185b        adds    r3, r3, r1
2   8966:   785b        ldrb    r3, [r3, #1]
1   8968:   3204        adds    r2, #4
2   896a:   7003        strb    r3, [r0, #0]
1   896c:   3001        adds    r0, #1
1   896e:   2a20        cmp     r2, #32
2   8970:   d1f2        bne.n   8958
18

x   8972:   bdf0        pop     {r4, r5, r6, r7, pc}
x   8974:   000090a8 ; <enc_expand_16x256>
x   8978:   00008fa4 ; <enc_constrain_8x258>
x   897c:   00008f94 ; <enc_increment_for_transition> [signed, 8x16]

instruction cycles:

movs lsrs ldrb ands ldrsb adds adds ldrb adds strb adds cmp bne
= 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2
= 18

My interpretation of the generated assembly (version a)

I have written out my "interpretation" of the generated assembly for each case. This section might be unnecessary, but I thought it might as well include it since it helped me understand the differences between (a) and (b). As above, the portions before and after the loop are identical. The only significant difference I can see is that the two versions execute the same instructions in a slightly different order. In particular, version (b) (which takes 20 cycles per iteration), has zero instances of consecutive load/store operations, zero instances of consecutive load/load operations, and zero instances of consecutive store/store operations. (The documented number of wait states for each load operation is commented in brackets: 1 wait state would be indicated by // ^ ldrb [1].)

r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
    &enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]

do {
    r3 uint32_t _transitions = transitions;
    r3 uint32_t transitions_shifted = _transitions >> j;

    r3 size_t transition = transitions_shifted & mask_0xf;
    r4 int16_t patch_data_i = *(patch_data + 0); //
        // ^ ldrb [0]
    r3 int8_t _increment = *(increment_for_transition + transition);
        // ^ ldrsb [1]

    j += 4;
    r3 int16_t inc_plus_pdata = _increment + patch_data_i;
    r3 uint8_t *constrain_plus_inc_plus_pdata =
        constrain + inc_plus_pdata;
    r3 uint8_t constrained_pdata = *(constrain_plus_inc_plus_pdata + 1);
        // ^ ldr [1]

    *(patch_data + 0) = constrained_pdata;
        // ^ strb [0]
    patch_data++;
} while (j < 32);

My interpretation of the generated assembly (version b)

r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
    &enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]

do {
    r1 uint32_t _transitions = transitions;
    r1 uint32_t transitions_shifted = _transitions >> j;

    r3 int32_t patch_data_i = *(patch_data + 0);
        // ^ ldrb [0]
    r1 size_t transition = transitions_shifted & mask_0xf;
    r1 int32_t _increment = *(increment_for_transition + transition);
        // ^ ldrsb [1]

    r3 uint8_t *constrain_plus_pdata = constrain + patch_data_i;
    r3 uint8_t *constrain_plus_pdata_plus_inc =
        constrain_plus_pdata + _increment;
    r3 uint8_t constrained_pdata = *(constrain_plus_pdata_plus_inc + 1);
        // ^ ldr [1]
    j += 4;

    *(patch_data + 0) = constrained_pdata;
        // ^ strb [0]
    patch_data++;
} while (j < 32);

Platform information

The microcontroller is the Atmel/Microchip AT91SAMD21G18A.
The architecture is ARMv6-M.
The microarchitecture is ARM Cortex-M0+.
The master clock frequency of my MCU core is 48 MHz.
At 48 MHz, the SAMD21 [non-volatile] program memory requires 1 wait state if the cache is disabled.
At 48 MHz, the SAMD21 SRAM requires zero wait states.

However, I don't see any reason why it would be faster to execute the code from RAM. I believe the NVM data path is separate from the RAM data path, so instruction fetches from NVM should never contend with data fetches from RAM (I'm not 100% sure about this fact, but I think it's true.). Therefore, if the NVM controller cache is working as documented, it seems that running this loop from NVM should almost certainly be faster than running this loop from RAM.

The SAMD21 has a 64-byte cache for accesses to non-volatile memory.
The NVM controller cache "is a direct-mapped cache that implements 8 lines of 64 bits (i.e., 64 bytes)."
The NVM controller cache is enabled, in NO_MISS_PENALTY mode.
This is the datasheet description of NO_MISS_PENALTY mode: "The NVM controller (cache system) does not insert wait states on a cache miss. Gives the best system performance."
The datasheet does not provide any more information about NO_MISS_PENALTY mode.

Original Q&A

There are 1 best solutions below

**Tom V** · Answer 1 · 2021-07-20T08:51:17.093000

Tom V On 20 July 2021 at 08:51

Cortex-M0+ uses a von-Neumann architecture. Instruction fetches always contend with data access, whether in zero-wait-state SRAM or in flash.

Why are non-consecutive loads faster, even when the cache miss penalty is guaranteed to be zero?

Background:

Source code (version a)

Source code (version b)

Generated assembly (version a)

Generated assembly (version b)

My interpretation of the generated assembly (version a)

My interpretation of the generated assembly (version b)

Platform information

There are 1 best solutions below

Related Questions in CACHING

Related Questions in CORTEX-M

Related Questions in ATMEL

Related Questions in ARMV6

Related Questions in SAMD21

Trending Questions

Popular # Hahtags

Popular Questions