Background:
I wrote a function in C and compiled it with arm-none-eabi-gcc (7-2018-q2-update).
The generated assembly code for the loop body looks like it should take 20 cycles per iteration,
including 2 wait states for load operations accessing constant data from non-volatile program memory.
However, the NVM controller cache for my MCU says that the cache miss penalty is guaranteed to be zero, so I'm not sure why it can't prefetch the data for the two NVM load operations. Therefore, I think the loop should take 18 cycles per iteration.
Unfortunately, the measured performance is quite different from the expected performance.
If I change int8_t increment and int16_t patch_data_i so that both are int32_t,
then GCC generates effectively the same instructions in a slightly different order.
Let's call this version (b).
The interesting thing is that version (a) takes 21 cycles per iteration and version (b) takes 20 cycles per iteration!
This performance difference is highly repeatable. I have measured it very precisely, by varying the number of iterations between (5, 6, 7, 8) for version (a) and version (b). Timing measurements from a Tektronix 465 oscilloscope at a fixed B sweep setting:
T(a)[min, max, avg] = (20.0, 21.0, 20.3) c @ 48 MHz.
T(b)[min, max, avg] = (21.0, 22.0, 21.3) c @ 48 MHz.
(Peformance of this loop body is crucial, since it executes 8 iterations, and this function is called once per every 2000 clock cycles. For my application, even this single-cycle difference amounts to roughly 0.5 percent of total cpu time.)
My question has 4 parts:
- What is going on here?
- Why is it that version (a) takes 21 cycles and version (b) takes 20 cycles?
- Why don't both versions take 18 cycles?
- Is there any possible way to accurately predict the access latency for RAM and NVM on an Atmel SAMD21 microcontroller, other than trying random permutations of assembly operations and measuring everything on an oscilloscope?
(Answers to any 1 of these 4 parts would be extremely appreciated.)
Source code (version a)
__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
new_state)
{
uint32_t transitions = enc_interleave_states(old_state, new_state);
size_t j = 0;
for (size_t i = 0; i < 8; i++, j += 4) {
const size_t transition = (transitions >> j) & 0xf;
const int8_t increment = enc_increment_for_transition[transition];
int16_t patch_data_i = enc->patch_data[i];
patch_data_i += increment;
size_t patch_data_i_plus_one = patch_data_i + 1;
patch_data_i = enc_constrain_8x258[patch_data_i_plus_one];
enc->patch_data[i] = patch_data_i;
}
}
Source code (version b)
__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
new_state)
{
uint32_t transitions = enc_interleave_states(old_state, new_state);
size_t j = 0;
for (size_t i = 0; i < 8; i++, j += 4) {
const size_t transition = (transitions >> j) & 0xf;
const int32_t increment = enc_increment_for_transition[transition];
int32_t patch_data_i = enc->patch_data[i];
patch_data_i += increment;
size_t patch_data_i_plus_one = patch_data_i + 1;
patch_data_i = enc_constrain_8x258[patch_data_i_plus_one];
enc->patch_data[i] = patch_data_i;
}
}
Generated assembly (version a)
cyc addr code instr fields
x 894e: 2200 movs r2, #0
x 8950: 250f movs r5, #15
x 8952: 4f09 ldr r7, [pc, #36] ; (8978)
x 8954: 4e09 ldr r6, [pc, #36] ; (897c)
x 8956: 4460 add r0, ip
1 8958: 000b movs r3, r1
1 895a: 40d3 lsrs r3, r2
1 895c: 402b ands r3, r5
2 895e: 7804 ldrb r4, [r0, #0]
2 8960: 56f3 ldrsb r3, [r6, r3]
1 8962: 3204 adds r2, #4
1 8964: 191b adds r3, r3, r4
1 8966: 18fb adds r3, r7, r3
2 8968: 785b ldrb r3, [r3, #1]
2 896a: 7003 strb r3, [r0, #0]
1 896c: 3001 adds r0, #1
1 896e: 2a20 cmp r2, #32
2 8970: d1f2 bne.n 8958 <enc_calc_transition+0x38>
18
x 8972: bdf0 pop {r4, r5, r6, r7, pc}
x 8974: 000090a8 ; <enc_expand_16x256>
x 8978: 00008fa4 ; <enc_constrain_8x258>
x 897c: 00008f94 ; <enc_increment_for_transition> [signed, 8x16]
instruction cycles:
movs lsrs ands ldrb ldrsb adds adds adds ldrb strb adds cmp bne
= 1 + 1 + 1 + 2 + 2 + 1 + 1 + 1 + 2 + 2 + 1 + 1 + 2
= 18
Generated assembly (version b)
cyc addr code instr fields
x 894e: 2200 movs r2, #0
x 8950: 250f movs r5, #15
x 8952: 4f09 ldr r7, [pc, #36] ; (8978)
x 8954: 4e09 ldr r6, [pc, #36] ; (897c)
x 8956: 4460 add r0, ip
1 8958: 0021 movs r1, r4
1 895a: 40d1 lsrs r1, r2
2 895c: 7803 ldrb r3, [r0, #0]
1 895e: 4029 ands r1, r5
2 8960: 5671 ldrsb r1, [r6, r1]
1 8962: 18fb adds r3, r7, r3
1 8964: 185b adds r3, r3, r1
2 8966: 785b ldrb r3, [r3, #1]
1 8968: 3204 adds r2, #4
2 896a: 7003 strb r3, [r0, #0]
1 896c: 3001 adds r0, #1
1 896e: 2a20 cmp r2, #32
2 8970: d1f2 bne.n 8958
18
x 8972: bdf0 pop {r4, r5, r6, r7, pc}
x 8974: 000090a8 ; <enc_expand_16x256>
x 8978: 00008fa4 ; <enc_constrain_8x258>
x 897c: 00008f94 ; <enc_increment_for_transition> [signed, 8x16]
instruction cycles:
movs lsrs ldrb ands ldrsb adds adds ldrb adds strb adds cmp bne
= 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2
= 18
My interpretation of the generated assembly (version a)
I have written out my "interpretation" of the generated assembly for each case.
This section might be unnecessary, but I thought it might as well include it since it helped me understand the differences between (a) and (b).
As above, the portions before and after the loop are identical.
The only significant difference I can see is that the two versions execute the same instructions in a slightly different order.
In particular, version (b) (which takes 20 cycles per iteration),
has zero instances of consecutive load/store operations,
zero instances of consecutive load/load operations,
and zero instances of consecutive store/store operations.
(The documented number of wait states for each load operation is commented in brackets: 1 wait state would be indicated by // ^ ldrb [1].)
r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
&enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]
do {
r3 uint32_t _transitions = transitions;
r3 uint32_t transitions_shifted = _transitions >> j;
r3 size_t transition = transitions_shifted & mask_0xf;
r4 int16_t patch_data_i = *(patch_data + 0); //
// ^ ldrb [0]
r3 int8_t _increment = *(increment_for_transition + transition);
// ^ ldrsb [1]
j += 4;
r3 int16_t inc_plus_pdata = _increment + patch_data_i;
r3 uint8_t *constrain_plus_inc_plus_pdata =
constrain + inc_plus_pdata;
r3 uint8_t constrained_pdata = *(constrain_plus_inc_plus_pdata + 1);
// ^ ldr [1]
*(patch_data + 0) = constrained_pdata;
// ^ strb [0]
patch_data++;
} while (j < 32);
My interpretation of the generated assembly (version b)
r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
&enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]
do {
r1 uint32_t _transitions = transitions;
r1 uint32_t transitions_shifted = _transitions >> j;
r3 int32_t patch_data_i = *(patch_data + 0);
// ^ ldrb [0]
r1 size_t transition = transitions_shifted & mask_0xf;
r1 int32_t _increment = *(increment_for_transition + transition);
// ^ ldrsb [1]
r3 uint8_t *constrain_plus_pdata = constrain + patch_data_i;
r3 uint8_t *constrain_plus_pdata_plus_inc =
constrain_plus_pdata + _increment;
r3 uint8_t constrained_pdata = *(constrain_plus_pdata_plus_inc + 1);
// ^ ldr [1]
j += 4;
*(patch_data + 0) = constrained_pdata;
// ^ strb [0]
patch_data++;
} while (j < 32);
Platform information
The microcontroller is the Atmel/Microchip
AT91SAMD21G18A.The architecture is ARMv6-M.
The microarchitecture is ARM Cortex-M0+.
The master clock frequency of my MCU core is 48 MHz.
At 48 MHz, the SAMD21 [non-volatile] program memory requires 1 wait state if the cache is disabled.
At 48 MHz, the SAMD21 SRAM requires zero wait states.
However, I don't see any reason why it would be faster to execute the code from RAM. I believe the NVM data path is separate from the RAM data path, so instruction fetches from NVM should never contend with data fetches from RAM (I'm not 100% sure about this fact, but I think it's true.). Therefore, if the NVM controller cache is working as documented, it seems that running this loop from NVM should almost certainly be faster than running this loop from RAM.
- The SAMD21 has a 64-byte cache for accesses to non-volatile memory.
- The NVM controller cache "is a direct-mapped cache that implements 8 lines of 64 bits (i.e., 64 bytes)."
- The NVM controller cache is enabled, in
NO_MISS_PENALTYmode. - This is the datasheet description of
NO_MISS_PENALTYmode: "The NVM controller (cache system) does not insert wait states on a cache miss. Gives the best system performance." - The datasheet does not provide any more information about
NO_MISS_PENALTYmode.
Cortex-M0+ uses a von-Neumann architecture. Instruction fetches always contend with data access, whether in zero-wait-state SRAM or in flash.