I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses:
float X[SIZE];
vector float V0;
unsigned FLOAT_VEC_SIZE = sizeof(vector float);
for (int load_index =0; load_index < SIZE; load_index+=FLOAT_VEC_SIZE)
{
V0 = vec_ld(load_index, X);
/* some computation involving V0*/
}
Now if SIZE is not a multiple of FLOAT_VEC_SIZE, it is possible that V0 contains some invalid memory elements in the last loop iteration. One way to avoid that is to reduce the loop by one iteration, another one is to mask off the potential invalid elements, is there any other useful trick here? Considering the above is inner most in a set of nested loops. So any additional non-SIMD instruction will come with a performance penalty!
Ideally you should pad your array to a multiple of
vec_step(vector float)
(i.e. multiple of 4 elements) and then mask out any additional unwanted values from SIMD processing or use scalar code to deal with the last few elements, e.g.