Fast computation of squared norm and normalized vector with Eigen

118 Views Asked by At

I have a section of my code that needs to run millions of times. It seems that the most costly part of it is this computation:

Vector2d directedInverseSquare(Vector2d &pos1, Vector2d &pos2) {

    Vector2d diff = pos2 - pos1;
    Vector2d rHat = diff.normalized();
    double rSquared = diff.squaredNorm();

    return rHat / rSquared;
}

Which computes 1/r^2 in the direction of pos1->pos2, where r is the distance between them. This is also equivalent to R/r^3, where R is the original difference vector. But I have found the form the code is written here to be faster.

It seems to me that the obvious way to make this code faster would be to somehow "produce" both the squared distance and the normed vector at once. Their computation process is pretty similar, and yet I still currently call them one after the other without any such optimization.

I've tried to do it myself, but Eigen always comes out faster. So perhaps an Eigen function that does this would be best, but I've not found a function like that.

1

There are 1 best solutions below

2
Homer512 On BEST ANSWER

Okay, with the discussion dying off, here is my test code and improved versions.

First, my benchmark. Nothing fancy.

struct AoS
{
    Eigen::Vector2d a, b, c;
};

int main()
{
    const int repetitions = 2000000;
    constexpr int blocks = 340; // ~ 16 kiB
    AoS inbufs[blocks], outbufs[blocks];
    for(AoS& buf: inbufs)
        for(Eigen::Vector2d* vec: {&buf.a, &buf.b, &buf.c})
            *vec = Eigen::Vector2d::Random();
    for(int i = 0; i < repetitions; ++i) {
        for(int j = 0; j < blocks; ++j) {
            const AoS& in = inbufs[j];
            AoS& out = outbufs[j];
            out.a = directedInverseSquare(in.a, in.b);
            out.b = directedInverseSquare(in.b, in.c);
            out.c = directedInverseSquare(in.c, in.a);
        }
        // don't optimize the store away
        asm volatile ("" ::: "memory");
    }
}

Compile flags -DNDEBUG -march=native -fno-math-errno compiled with gcc-12.3. On my TigerLake CPU, this takes 6.2 seconds with your implementation.

Switching to the straightforward diff / r**3 gives us 4.5 seconds:

Eigen::Vector2d directedInverseSquare2(
        const Eigen::Vector2d &pos1, const Eigen::Vector2d &pos2)
{
    Eigen::Vector2d diff = pos2 - pos1;
    double r = diff.norm();
    return diff / (r * r * r);
}

chtz's version is about the same speed but fewer instructions. Objectively I think this is the best.

Eigen::Vector2d directedInverseSquare3(
        const Eigen::Vector2d &pos1, const Eigen::Vector2d &pos2)
{
    Eigen::Vector2d diff = pos2 - pos1;
    double rSquared = diff.squaredNorm();
    return diff / (rSquared * std::sqrt(rSquared));
}

To improve upon this, we should vectorize by using a 2x3 matrix. This allows us to do the b-a and c-b case in one step. The a-c operation has to be done separately.

using Matrix23d = Eigen::Matrix<double, 2, 3>;

void directedInverseSquare(Matrix23d& out, const Matrix23d& in)
{
    Eigen::Array22d diffs = in.rightCols<2>() - in.leftCols<2>();
    Eigen::Array2d rSquared = diffs.colwise().squaredNorm();
    out.leftCols<2>() = diffs.rowwise() / (rSquared * rSquared.sqrt()).transpose();
    Eigen::Vector2d diff3 = in.col(0) - in.col(2);
    double rSquared3 = diff3.squaredNorm();
    out.col(2) = diff3 / (rSquared3 * std::sqrt(rSquared3));
}
int main()
{
    const int repetitions = 2000000;
    constexpr int blocks = 340; // ~ 16 kiB
    Matrix23d inbufs[blocks], outbufs[blocks];
    for(Matrix23d& buf: inbufs)
        buf = Matrix23d::Random();
    for(int i = 0; i < repetitions; ++i) {
        for(int j = 0; j < blocks; ++j)
            directedInverseSquare(outbufs[j], inbufs[j]);
        asm volatile ("" ::: "memory");
    }
}

Unfortunately this is only slightly faster in my tests; 4.3 seconds. The assembly looks fine, though.- I also tested a transposed version which is slightly worse.

However, this matrix style, especially the transposed version, works very well if you have many more vectors. The benefit of vectorization increases, the relative cost of that last scalar element decreases.

void directedInverseSquare(Eigen::MatrixX2d& out, const Eigen::MatrixX2d& in)
{
    const Eigen::Index n = in.rows();
    const auto& diffs = in.bottomRows(n - 1) - in.topRows(n - 1);
    const auto& rsquared = diffs.rowwise().squaredNorm().array();
    out.topRows(n - 1) = diffs.array().colwise() / (rsquared * rsquared.sqrt());
    Eigen::Vector2d diff3 = in.row(0) - in.row(n - 1);
    double rSquared3 = diff3.squaredNorm();
    // note the eval() to force vectorized evaluation
    out.row(n - 1) = (diff3 / (rSquared3 * std::sqrt(rSquared3))).eval();
}
int main()
{
    const int repetitions = 2000000;
    constexpr int blocks = 340; // ~ 16 kiB
    Eigen::MatrixX2d inbufs, outbufs(340 * 3, 2);
    inbufs = Eigen::MatrixX2d::Random(340 * 3, 2);
    for(int i = 0; i < repetitions; ++i) {
        directedInverseSquare(outbufs, inbufs);
        asm volatile ("" ::: "memory");
    }
}

This takes 2.8 seconds. As usual, be careful with those auto variables in Eigen.

As a bonus, here is the best version I could come up with for the 2x3 matrix version using Intel intrinsics with AVX + FMA instructions. This takes 4.1 seconds.

#include <immintrin.h>

void directedInverseSquare(Matrix23d& out, const Matrix23d& in)
{
    const double* inptr = in.data();
    // [b0, b1, c0, c1]
    __m256d rightcols = _mm256_loadu_pd(inptr + 2);
    // [a0, a1, b0, b1]
    __m256d leftcols = _mm256_loadu_pd(inptr);
    // [b0 - a0, b1 - a1, c0 - b0, c1 - b1]
    __m256d diffs = _mm256_sub_pd(rightcols, leftcols);
    // [c0, c1]
    __m128d cvec = _mm256_extractf128_pd(rightcols, 1);
    // [a0, a1]
    __m128d avec = _mm256_castpd256_pd128(leftcols);
    // [a0 - c0, a1 - c1]
    __m128d taildiff = _mm_sub_pd(avec, cvec);
    __m256d taildiff256 = _mm256_castpd128_pd256(taildiff);
    // [b0 - a0, a0 - c0, c0 - b0, x]
    __m256d lowvals = _mm256_unpacklo_pd(diffs, taildiff256);
    // [b1 - a1, a1 - c1, c1 - b1, x]
    __m256d highvals = _mm256_unpackhi_pd(diffs, taildiff256);
    // [norm(b-a)**2, norm(a-c)**2, norm(c-b)**2, x]
    __m256d rsquared = _mm256_mul_pd(lowvals, lowvals);
    rsquared = _mm256_fmadd_pd(highvals, highvals, rsquared);
    // [norm(b-a), norm(a-c), norm(c-b), x]
    __m256d r = _mm256_sqrt_pd(rsquared);
    // [norm(b-a)**3, norm(a-c)**3, norm(c-b)**3, x]
    __m256d rcubed = _mm256_mul_pd(r, rsquared);
    // [norm(b-a)**3, norm(b-a)**3, norm(c-b)**3, norm(c-b)**3]
    __m256d bcast1 = _mm256_permute_pd(rcubed, _MM_SHUFFLE(0, 0, 0, 0));
    // [norm(b-a)**3, norm(a-c)**3]
    __m128d rcubed128 = _mm256_castpd256_pd128(rcubed);
    // [norm(a-c)**3, norm(a-c)**3]
    __m128d bcast2 = _mm_permute_pd(rcubed128, _MM_SHUFFLE2(1, 1));
    // [(b0-a0)/norm(b-a)**3, ..., (c1-b1)/norm(c-b)**3]
    __m256d result1 = _mm256_div_pd(diffs, bcast1);
    // [(a0-c0)/norm(a-c)**3, (a1-c1)/norm(a-c)**3]
    __m128d result2 = _mm_div_pd(taildiff, bcast2);
    double* outptr = out.data();
    _mm256_storeu_pd(outptr, result1);
    _mm_store_pd(outptr + 4, result2);
}

(note that unpacklo and unpackhi have rather unintuitive results for 256 bit vectors)