Part 1 - why the code below checks st_inverse in the first place
The kiss_fft code has this branch inside a loop:
do {
if(st->inverse) {
Fout[m].r = scratch[5].r - scratch[4].i;
Fout[m].i = scratch[5].i + scratch[4].r;
Fout[m3].r = scratch[5].r + scratch[4].i;
Fout[m3].i = scratch[5].i - scratch[4].r;
}else{
Fout[m].r = scratch[5].r + scratch[4].i;
Fout[m].i = scratch[5].i - scratch[4].r;
Fout[m3].r = scratch[5].r - scratch[4].i;
Fout[m3].i = scratch[5].i + scratch[4].r;
}
++Fout;
} while (--k); // Fout[] has k*4 elements.
Slightly reordered:
if(st->inverse) {
Fout[m].r = scratch[5].r - scratch[4].i;
Fout[m].i = scratch[5].i + scratch[4].r;
Fout[m3].r = scratch[5].r + scratch[4].i;
Fout[m3].i = scratch[5].i - scratch[4].r;
}else{
Fout[m3].r = scratch[5].r - scratch[4].i;
Fout[m3].i = scratch[5].i + scratch[4].r
Fout[m].r = scratch[5].r + scratch[4].i;
Fout[m].i = scratch[5].i - scratch[4].r;;
}
The two code blocks really differ only in their use of m and m3. But m and m3 are not changed inside the loop. Can I simply eliminate this inner-loop branch by swapping m and m3 ?
if(st->inverse) { swap(&m, &m3); }
do {
Fout[m].r = scratch[5].r - scratch[4].i;
Fout[m].i = scratch[5].i + scratch[4].r;
Fout[m3].r = scratch[5].r + scratch[4].i;
Fout[m3].i = scratch[5].i - scratch[4].r;
++Fout;
} while (--k);
I can indeed use that optimization. It's not necessary however with current-gen compilers that can use AVX. They'll eliminate that branch as well, using
vpcmpeqdandvblendvps.