I would like to know the peak FLOPs per cycle for the ARM1176JZF-S core in the the Raspberry Pi 1 and Cortex-A7 cores in the Raspberry Pi 2.
From the ARM1176JZF-S Technical Reference Manual it seems that VFPv2 can do one SP MAC every clock cycle and one DP MAC every other clock cycle. In addition there are three pipelines which can operate in parallel: a MAC pipeline (FMAC), a division and sqrt pipeline (DS), and a load/store pipeline (LS). Based on this then it appears the ARM1176JZF-S of the Raspberry PI 1 can do at least (from the FMAC pipeline)
- 1 DP FLOP/cycle: one MAC/2 cycles
- 2 SP FLOPs/cycle: one MAC/cycle
Wikipedia claims the FLOPS of the raspberry PI 1 is 0.041 DP GFLOPS. Dividing by 0.700 GHz gives less than 0.06 DP FLOPs/cycle. That's about 17 times less than my estimate of 1 DP FLOP/cycle I get.
So what's the correct answer?
For the Cortex-A7 processor in the Raspberry Pi 2 I believe it's the same as the Cortex-A9. The FLOPs/cycle/core for the Cortex-A9 is:
- 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle.
Is the FLOPs/cycle/core for the Raspberry Pi 2 the same as for Corrtex-A9? If not, what is the correct answer?
Edit:
The main differences between the Cortex-A9 and Cortex-A7 (when it comes to peaks flops/cycle) are:
- the Cortex-A9 is dual-issue (two instructions per clock) and the Cortex-A7 is only partially dual-issue "the A7 cannot dual-issue floating point or NEON instructions."
- the Cortex-A9 is an out-of-order (OoO) processor and the Cortex-A7 is not.
I'm not sure why the OoO would affect the peak FLOPS. The dual issue certainly should. That would cut the peak FLOPS in half I think.
Edit: based on the table http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/ Stephen Canon gave in a comment here are my new peak flops for the Cortex-A7
- 0.5 DP FLOPs/cycle: one VMLA.F64 (VFP) every four cycles.
- 1.0 DP FLOPs/cycle: one VADD.F64 (VFP) every cycle.
- 2.0 SP FLOPs/cycle: one VMLA.F32 (VFP) every cycle.
- 2.0 SP FLOPs/cycle: one VMLA.F32 (NEON) on two 32-bit floats every other cycle.
Example 1 Compiled code MP-MFLOPSPiNeon that obtains >647 MFLOPS (data words 3.2k to 3.2M) on a 900 MHz Rpi2. Disassembly seems to be the same without threading. Compile/link command used and C code for 32 operations per data word are below [Someone might suggest faster compile options].
Following is complex disassembly. Note highlighted fused multiply accumulate or subtract instructions with an excessive number of loads
Example 2 - Using NEON intrinsic functions (from before I knew of fused instructions) > 700 MFLOPS. First C code:
Next is disassembly, again with excessive load instructions.