What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)

414 Views Asked by At

I have some code that operates on 3 symmetric sets of 3 asymmetric integer values at a time. There is a significant amount of conditional code and lots of constants.

This has become a perf bottleneck and I'm looking for some rules of thumb for when SIMD on 64-bit Intel/AMD CPUs would yield perf wins. The code is pretty long and I've never used SSE2 or AVX before, so it would be nice to have some idea of if perf wins are possible or likely before I invest the time.

If you're willing to list the rules of thumb or point to an existing whitepaper on this, I'd appreciate it.

1

There are 1 best solutions below

1
On

The tag wiki has a couple guides to vectorization, including these slides from a talk which are comprehensible on their own which have some great examples of transforming your data structures to enable vectorization (and classic pitfalls like putting [x,y,z,w] geometry vectors into single SIMD vectors).


The classic use-case for SIMD is when there are a lot of independent operations, i.e. no serial dependencies inside the loop, like z[i] = d * x[i] + y[i]. Or if there are, then only with associative operations that let you reorder. (e.g. summing an array or a similar reduction).

Also important that you can do it without a lot of shuffling; ideally all your data lines up "vertically" in vectors after loading from contiguous memory.

Having many conditions that go different ways for adjacent elements is usually bad for SIMD. That requires a branchless implementation, so you have to do all the work of both sides of every branch, plus merging. Unless you can check that all 4 (or all 16 or whatever) elements in your vector go the same way


There are some things that can be vectorized, even though you might not have expected it, because they're exceptions to the usual rules of thumb. e.g. converting an IPv4 dotted-quad string into a 32-bit IPv4 address, or converting a string of decimal digits to an integer, i.e. implementing atoi(). These are vectorized with clever use of multiple different tricks, including a lookup-table of shuffle-masks for PSHUFB with a vector-compare bitmap as an index for the LUT.

So once you know some tricks, you always rule out a vectorized implementation quickly just based on a few rules of thumb. Even serial dependencies can sometimes be worked around, like for SIMD prefix sums.