If i'm not wrong ldrb r3, [r1], #1
will take 3 instruction cycles, and similarly add r4, r1, #2
will take 1 instruction cycles not discussing the interlock delays here, but i'm confused how many cycles will cmp r4, r3
takes?
Note that: It's ARM ASSEMBLY with ARM9TDMI pipeline timings.
You question is similar and uses similar code as your classmate,
The loop core is,
ldrb
,eor r3,r3,r2
is an interlock similar to figure 7.2 and requires two interlock cycles.str
andcmp
are single cycles.bne
is three cycles.See section 2.2 for the pipeline stages. It is taking approximately nine cycles. Sections are from the ARM9TDMI TRM.
This is 12 cycles for a transfer of 32 bytes, so it is approximately 24 times as fast. Using R4 first is beneficial as per figure 7-4.
This alternate loop will take even longer at 13 cycles.
This modification gives 9 cycles, the same as gcc,
However, it is one more instruction.