Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):
In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:
Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics
Now, suppose from linux this function is called,
ip_csum(){
…
…
csum = do_csum(); //function call from arm
…
…
}
do_csum(){
…
…
//NEON optimised code
…
…
returns the final checksum to ip_csum/linux/ARM
}
in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?
as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..
NOTE:
- Speaking in terms of cortex-a8
- do_csum as you can see from the link is coded with intrinsics
- compilation using gnu tool-chain
- Will be good if you also take Multi-threading or any other concept involved or comes into picture when these inter operations happen.
Questions:
- Does ARM sit idle while NEON is doing its operations? ( in this particular case)
- Or does it shelve this current ip_csum related code, and take up another process/thread till NEON is done? ( I'm almost dumb as to what happens here)
- if its sitting idle, how can we make ARM work on something else till NEON is done?
(Image from TI Wiki Cortex A8)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
Here are some hints on how to read the diagram above:
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.