If we have such C code
spatial_pred= (cur[mrefs] + cur[prefs])>>1;
when transform to Neon intrinsics
int8x8_t cur_mrefs = vld1_s8(cur+mrefs);
int8x8_t cur_prefs = vld1_s8(cur+prefs);
int8x8_t spatial_pred = vshr_n_s8(vadd_s8(cur_mrefs, cur_prefs), 1);
Do we need to consider for overflow when vadd_s8(cur_mrefs, cur_prefs)? Whether we should use vadd_s16 instead?
If you don't want to lose overflow information, you should first move
int8x8_t
toint16x8_t
then do the summing.If you want result to saturate then you should use vqadd.
If you just want to convert C version you should use vhadd or vrhadd (rounds) which does halving the sum instead of trying to do shift as a second step.