how would you optimize this vectorized sum of harmonics?

205 Views Asked by At

I'm summing a bounch of harmonics together, with different phase/magnitude each, using vectorization (only SSE2 max as SIMD).

Here's my actual try:

float output = 0.0f;
simd::float_4 freqFundamentalNormalized = freq * (1.0f / sampleRate);
simd::float_4 harmonicIndex{1.0f, 2.0f, 3.0f, 4.0f};
simd::float_4 harmonicIncrement{4.0f, 4.0f, 4.0f, 4.0f};

// harmonics
const int numHarmonicsV4 = numHarmonics / 4;
const int numHarmonicsRemainder = numHarmonics - (numHarmonicsV4 * 4);

// v4
for (int i = 0; i < numHarmonicsV4; i++) {
    // signal
    simd::float_4 sineOutput4 = simd::sin(mPhases4[i] * g2PIf) * mMagnitudes4[i];

    for (int v = 0; v < 4; v++) {
        output += sineOutput4[v];
    }

    // increments
    mPhases4[i] += harmonicIndex * freqFundamentalNormalized;
    mPhases4[i] -= simd::floor(mPhases4[i]);

    harmonicIndex += harmonicIncrement;
}

// remainder
if (numHarmonicsRemainder > 0) {
    // signal
    simd::float_4 sineOutput4 = simd::sin(mPhases4[numHarmonicsV4] * g2PIf) * mMagnitudes4[numHarmonicsV4];

    for (int v = 0; v < numHarmonicsRemainder; v++) {
        output += sineOutput4[v];
    }

    // increments
    mPhases4[numHarmonicsV4] += harmonicIndex * freqFundamentalNormalized;
    mPhases4[numHarmonicsV4] -= simd::floor(mPhases4[numHarmonicsV4]);
}

but:

  1. I think I can optimize it more, maybe with some math tricks, or saving in some increments
  2. I don't like to repeat the "same code" once for V4, once for remainder (if the num of harmonics are not % 4): is there a way to put a sort of "mask" to the last V4 placing (for example) magnitudes at 0? (so it do the same operation in the same block, but won't sum to the final output).
2

There are 2 best solutions below

8
On BEST ANSWER

The second part of the question is the easiest. Any harmonic with magnitude 0 does not affect the sine output, so you just pad mMagnitude to a multiple of 4.

As Damien points out, sin(x) is expensive. But by Euler, exp(x)=cos(x) + i sin(x), and exp(x+dx)==exp(x)*exp(dx). Each step is just a complex multiplication.

6
On

First and foremost, make sure your implementation of simd::sin is fast. See XMVectorSin and especially XMVectorSinEst in DirectXMath library for an example how to make a fast one, or copy-paste from there, or include the library, it’s header-only. The instruction set is switchable with preprocessor macros, for optimal performance it needs SSE 4.1 and FMA3, but will work OK with SSE2-only.

As said in comments, you should only do horizontal add once, after all iterations of the loop are complete. Until then, accumulate in a SIMD vector.

Very minor and might be optimized by the compiler, but still, you should not access mPhases4 like you’re doing. Load the value into vector at the start of the loop body, compute output, increment, compute fractional part, and store the updated value just once per iteration.