Suggestions on further optimising this chi-square function using SSE2 intrinsics

117 Views Asked by At

I am trying to convert the below chi-square function in c code to SSE2 intrinsics

I am getting the correct output for both the functions. and I have measured the time it takes for both functions to run using some random 4KB data that I have generated.and on average, I am seeing about 70-90 ms performance improvement

I am just wondering whether are there any further optimisations that I am missing that may further improve the performance. Any clues on this will be helpful

Normal C-Code:

int observed[256] = {0};
        double chiSquare = 0.0;
        double expected = (double)size / 256; // Constant expected value
        // Calculate frequency of each byte value
        for (int i = 0; i < size; i++) {
            observed[data[i]]++;
        }
        // Calculate the chi-square statistic
        for (int i = 0; i < 256; i++) {
                double diff = observed[i] - expected;
                chiSquare += (diff * diff) / expected;
        }
        return chiSquare;

SSE2 Intrinsics:

int observed[256] = {0};
        const double expected = (double)size / 256;  // Make 'expected' a constant
        double chiSquare = 0.0;
        // Process data in 16-byte (128-bit) chunks
        for (int i = 0; i < size; i += 16) {
            __m128i dataChunk = _mm_loadu_si128((__m128i*)(data + i));
            // Unpack 8-bit values into 16-bit values for counting
            __m128i dataUnpacked = _mm_unpacklo_epi8(dataChunk, _mm_setzero_si128());
            // Extract and process 8 values in parallel
            for (int j = 0; j <= 1; j++) {
                uint16_t values[8];
                _mm_storeu_si128((__m128i*)values, dataUnpacked);
                for (int k = 0; k < 8; k++) {
                    observed[values[k]]++;
                }
                dataUnpacked = _mm_unpackhi_epi8(dataChunk, _mm_setzero_si128());
            }
        }
        // Calculate the chi-square statistic using SSE2 intrinsics
        __m128d sum = _mm_setzero_pd();
        for (int i = 0; i < 256; i += 2) {
            __m128d observedVec = _mm_set_pd(observed[i + 1], observed[i]);
            __m128d diff = _mm_sub_pd(observedVec, _mm_set1_pd(expected));
            __m128d squaredDiff = _mm_mul_pd(diff, diff);
            __m128d result = _mm_div_pd(squaredDiff, _mm_set1_pd(expected));
            sum = _mm_add_pd(sum, result);
        }
        // Sum up the results in the sum
        double sumArray[2];
        _mm_storeu_pd(sumArray, sum);
        for (int i = 0; i < 2; i++) {
            chiSquare += sumArray[i];
        }
        return chiSquare;
        }**
2

There are 2 best solutions below

8
Simon Goater On

Your SSE2 version of the function benchmarks slower (~25%) on my Westmere i5 laptop than your scalar function. I made a slight improvement (~20% with 4KB data) to the performance of your scalar function on my machine. Also, the SSE2 function doesn't work for all 'size' values, which I'm sure you already know. Anyway, my function is below.

double getchisquared(int size, uint8_t *data) {
        double diff, chiSquare = 0.0;
        double expected = (double)size / 256; // Constant expected value
        int i, iterations = (size >> 2) << 2;
        // Calculate frequency of each byte value
        for (i = 0; i < iterations;) {
            observed[data[i++]]++;
            observed[data[i++]]++;
            observed[data[i++]]++;
            observed[data[i++]]++;
        }
        for (i = iterations; i < size; i++) {
            observed[data[i]]++;
        }
        // Calculate the chi-square statistic        
        for (i = 0; i < 256; i++) {
                diff = observed[i] - expected;
                chiSquare += (diff * diff) ;
        }
        return chiSquare / expected;        
}     

I think SSE2 doesn't offer much hope for optimisation of the histogram stage as chtz pointed out. You might have more luck with AVX2 but I haven't investigated.

0
Peter Cordes On

For the histogram part:

  • Unroll the histogram with 2 or 4 arrays of counts to avoid store-forwarding latency bottlenecks on repeated runs of one data[i] value within 40 to 80 i values (based on ROB (Reorder Buffer) and scheduler capacity). Perhaps use uint16_t elements if size is less than UINT16_MAX/unroll so no count bin can overflow, for denser use of L1d cache. (16-bit memory increments aren't slower than 32-bit on x86). SIMD is very good for vertically summing over 2 or 4 arrays afterwards, and 4 x 1024-byte arrays is few enough that aliasing and conflict-misses wouldn't be a problem so you can just do it in one pass. (It is more memory to zero initially and to sum over, and dirties more cache, so only worth it if it pays for itself with real speedups on your typical data.)

  • SIMD won't help for the actual histogram part unless you have AVX-512 for scatter as well as gather. (And preferably 8 or 16 arrays of counts to avoid the need for conflict-detection, if you have large enough inputs to amortize the extra zeroing and summing). Recent microcode updates to fix the Downfall vulnerability in Intel CPUs have drastically hurt gather performance, and gather/scatter performance isn't great on Zen 4 in the first place. (Scatters are slower than gathers on both AMD and Intel, https://uops.info/ search for vpscatterdd in AVX-512. AMD is slower than Intel at each, at least using the numbers from before the Downfall microcode update, and IDK how well gather and scatter can overlap with each other.)

    You're probably making it slower with _mm_unpacklo_epi8, unless maybe that unrolls to use SSE2 pextrw, but scalar shifts would be better.

For the loop dealing with the counts:

  • Sink the div out of the loop; x/e + y/e + z/e == (x+y+z)/e. And since your expected is always a power of 2, that shouldn't even change the FP rounding error since the mantissas will all be the same either way! The difference is just how close to overflow vs. underflow you are. (Plenty far, since you're summing a non-huge number of squares of usually small integers.)

    If you did need to divide multiple times by number known to be a power of 2, do mult = 1.0f/expected. This is exact for a power of 2 with binary floating point. (Compilers will do this if expected is truly a compile-time-constant power of 2 after inlining: https://godbolt.org/z/9TGbfGne7)

  • Prefer float over double for accumulating chiSquare for 2x as many elements per vector, and more efficient packed conversion (less shuffling). (Use _mm_cvtepi32_pd or _ps, not _mm_set_pd with two separate scalar doubles!) float only has 24-bit mantissa precision, though, and squaring a 16-bit integer can produce up to 32 significant bits. But in most cases your integer will be smaller, I think, like 12-bit at most for size=4K.

    Or unroll a few iterations with float vectors before summing down to 2 floats in the bottom of a vector (movhlps / addps), convert to double (_mm_cvtps_pd), and add to a __m128d accumulator. Some form of unrolling is good to hide FP latency which would otherwise be the bottleneck for the 2nd loop, once you fix the glaring problems like div inside the loop. See Latency bounds and throughput bounds for processors for operations that must occur in sequence and Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

    The second loop's count being 256 is nice for unrolling needing cleanup code.

  • Do the first sub with integer SIMD, if size is a power of 2 >= 256 so expected is a whole number. This takes some pressure off the FP math execution ports which should be the bottleneck, if latency isn't.

    You should even be able to use _mm_madd_epi16 (pmaddwd) to square and sum pairs of 16-bit signed integers, producing a vector of 32-bit integers very efficiently.

You could actually do the whole sum of squares as integer for size up to 65664 (just over 64K elements) without overflowing uint32_t. The worse case (largest sum) is when all the counts are in one bucket since that makes one large number squared, and the others the most negative as well. For example we have ((4096-4096/256) ^ 2) = 0xfe0100 in one bucket plus 255 other buckets each with (0-4096/256)^2 = 0x100, times 255 is 0xff00. Total 0xff0000 for size=4096. That's actually only 24 bits, so we have plenty of room in ((n-n/256) ^ 2) + 255*(0-n/256)^2 < 2^32.

pmaddwd is a signed multiply, so we need (size - size/256) < 32768 so it fits in an int16_t in the worst case. That does allow size == 32K.

SSE2 doesn't have an efficient 32-bit integer multiply so you might actually be better off converting to float and then double for sizes > 32K. pmuldq, _mm_mul_epi32, is a widening 32x32 => 64-bit multiply that only takes the even elements from the input, the bottom halves of 64-bit chunks. Actually that could be useful for wider counts, combined with pshufd or psrldq (_mm_bsrli_si128(v, 4)), to feed paddq, _mm_add_epi64.

Even if you had SSE4.1 pmulld (_mm_mullo_epi32), it runs as 2 uops on newer Intel CPUs, vs. 1 uop for FP mul which only has 24 bits of mantissa per 32-bit chunk. It's fast on Zen 3 and 4, and on Sandy/Ivy bridge, and earlier Intel that had it. If you have another version for newer CPUs, you might consider SSE4.1 pmulld in the version that only runs on CPUs without AVX2+FMA.

Summing as integer avoids the problem of limited float precision.


For non-tiny inputs, the histogramming will be the major cost. Without AVX-512 for scatter+gather (AVX2 only has gather), there's nothing useful to do with SIMD. You need every element in an integer register by itself, for use in an addressing mode. That loop should just use scalar operations; you're shooting yourself in the foot by manually unpacking to uint16_t with _mm_unpacklo_epi8 and _mm_unpackhi_epi8. (Unless possibly the compiler optimizes away your store/reload and uses 8x pextrw, which isn't the most efficient way, but on some CPUs could help by reducing AGU port pressure.)

Doing fewer wider loads isn't necessarily good, if you still need ALU instructions to chop them up into separate elements for an addressing mode. L1d cache is fast and very good at handling multiple small loads from the same line.

Zero-extending 8-bit to 64-bit happens for free as part of a load into an integer reg like movzx eax, byte [rdi] which costs the same as mov eax, [rdi], and modern x86 CPUs can do 2 or 3 loads per clock from L1d cache (and 1 or 2 stores) (https://agner.org/optimize/ and stuff like https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/), so there's enough throughput to do an integer load for each byte without stealing back-end cycles from the memory-increment work. And stuff like pextrb or pextrw are 2 uops each.

But AGU (address generation unit) throughput can be a problem, especially with indexed addressing modes for the memory increments on Haswell / Skylake, because the simple store-AGU on port 7 can't handle indexed addressing modes. And Sandy/Ivy Bridge don't have port 7 at all. (Ice Lake fixes this, having 2 dedicated store AGUs separate from the 2 load ports.) To work around this bottleneck, the best bet is scalar code like mov eax, [rdi] to load 4x uint8_t and split it up with movzx edx, al / movzx ecx, ah / shr eax, 16 etc. This shifts some of the unpacking work to ALU instructions, at the cost of front-end throughput. It's inconvenient to try to hand-hold the compiler into making asm like this from C, especially the movzx from AH part might just not happen; you might instead get more shifts that take so much front-end bandwidth that you're worse off than with a bottleneck on load ports.

I compared using https://uica.uops.info/. The compiler-generated inner loop for the histogram unrolled by 4 (copied from Simon Goater's answer) looks like this:

.L3:
   movzx   ecx, BYTE PTR [rax]
   add     rax, 4
   add     WORD PTR [rsp-120+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-3]
   add     WORD PTR [rsp+392+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-2]
   add     WORD PTR [rsp-120+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-1]
   add     WORD PTR [rsp+392+rcx*2], 1
   cmp     rdi, rax
   jne     .L3

vs. hand-editing it to asm that does one 4-byte load and unpacks with shift and movzx:

.L3:
   mov     ecx, [rax]       ##### 4-byte load
   add     rax, 4
   movzx   edx, cl      # mov-elimination avoids a back-end uop
   add     WORD PTR [rsp-120+rdx*2], 1
   movzx   edx, ch      # has extra latency for reading CH on Skylake and later, but throughput is ok
   add     WORD PTR [rsp+392+rdx*2], 1
   shr     ecx, 16
   movzx   edx, cl
   add     WORD PTR [rsp-120+rdx*2], 1
   movzx   edx, ch
   #shr     ecx, 8    # If reading high-8 registers has any weird effects, minimize it.
   add     WORD PTR [rsp+392+rdx*2], 1
   cmp     rdi, rax
   jne     .L3

Assuming no cache misses or stalls from store-forwarding latency, the analysis from https://uica.uops.info/ looks like this:

multiple movzx loads ALU unpacking with movzx reg,reg and shifts
Ivy Bridge 7.06 cycles / iter 7.03 cycles / iter
Skylake 5.97 c / i (6 uops each for ports 2,3) 5.25 c / iter (nearly a front-end bottleneck)
Rocket Lake 4.00 c / i (4 uops each for ports 2,3) 4.50 c / iter

If you're tuning for very old CPUs like Nehalem and earlier (1/clock loads), there'd be even more to gain from this. Optimizing SIMD histogram calculation reports a 1.5x speedup from using SSE4.1 pextrb vs. a C loop. They didn't say what CPU they were using in 2015, but that's plausible with Ross's fully unrolled version. pextrb is 2 uops so scalar is still better, though. (For Intel CPUs, ideal asm should use add dword [mem], 1, not inc for memory operands, so it can micro-fuse into fewer uops for the front-end. Compilers should already know about this, or at least GCC does, clang doesn't; I mention it only because of the code in the linked Q&A.)


Manually vectorized version (untested)

I took advantage of GNU C a couple times to write vec + vec instead of _mm_add_ps when it was getting cluttered. If you want portability to MSVC, change that.

#include <immintrin.h>
#include <stdalign.h>

double getchisquared_manual_sse2(int size, uint8_t *data)
{
    // assert(size > 0);

    // TODO: if size >= 65536, use uint32_t counts instead, unless we know they're well enough distributed
    // also, unrolling over 2 arrays means the max count in any one of them is half the size, but then summing needs to widen first.
    _Alignas(16) uint16_t observed[2][256] = {{0}};

    // Calculate frequency of each byte value
    //const int iterations = size & -4;   // size is apparently always a power of 2?
    const int iterations = size;
    for (int i = 0; i < iterations;) {
        observed[0][data[i++]]++;    // unroll over two arrays to better hide store-forwarding latency
        observed[1][data[i++]]++;    // when there's a region with a lot of the same data[i]
        observed[0][data[i++]]++;
        observed[1][data[i++]]++;
    }
    // check (size & (size-1)) == 0  if power of 2 size isn't guaranteed,
    // Which would guarantee size%256 == 0 for size>=256
#if 0
    // not needed for power-of-2 or multiple-of-256 sizes
    for (int i = iterations; i < size; i++) {
        observed[0][data[i]]++;
    }
#endif

    // Calculate the chi-square statistic        
    if (size >= 256 && size <= 32768) {    // 16-bit integer sum special case
        // size>=256 (or non-zero and size%256==0) gives a whole integer expected value
        //   one range-check is slightly more efficient than size%256==0 && size <= 32768
        //   also this lets the compiler know it's signed-positive, allowing /256 to just be a shift.
        // sum of squares fit in a uint32_t result even in the worst case of all counts in one bucket:
        //  ((n-n/256) ^ 2) + 255*(0-n/256)^2 fits in u32 even up to n=64K + 225
        //  and max diff = size - size/256 fits in *signed* int16_t size <= 32768
        // size = 32768+256 would give a max diff of 0x807f, so 32768 is the top

        int expected = size / 256;       // a whole number that can be an integer
        __m128i vchisquare = _mm_setzero_si128();
        //double expected = (double)size / 256.0f; // Constant expected value
        for (int i = 0; i < 256; i += 8) {
            __m128i obs0 = _mm_load_si128((const __m128i*)&observed[0][i]);
            __m128i obs1 = _mm_load_si128((const __m128i*)&observed[1][i]);
            __m128i diff = _mm_add_epi16(obs0, obs1);
            diff = _mm_sub_epi16(diff, _mm_set1_epi16(expected));  // with AVX, prefer -expected + obs0 + obs1 to encourage a memory-source vpaddw instead of a separate vmovdqa.  Although it will un-laminate on Intel unless the compiler uses an indexed addressing mode
            vchisquare = _mm_add_epi32(vchisquare, _mm_madd_epi16(diff,diff));
        }
        __m128i high = _mm_shuffle_epi32(vchisquare, _MM_SHUFFLE(1,0, 3,2));  // high half within vector
        vchisquare = _mm_add_epi32(vchisquare, high);
                high = _mm_shuffle_epi32(vchisquare, _MM_SHUFFLE(2,3, 0,1));  // high half within pairs
        vchisquare = _mm_add_epi32(vchisquare, high);
        double chiSquare = _mm_cvtsd_f64( _mm_cvtepi32_pd(vchisquare) );  // low scalar of packed i32 to double conversion
        // signed conversion is safe with size <= 32768
        return chiSquare / (double)expected;        
    } else {
        // 16-bit integer across count arrays, convert to i32 then f32
        // sum 8 vectors of f32 (from 4 vecs of u16) down to one vector; convert to f64 and accumulate
        __m128d vchisquare = _mm_setzero_pd();
        const __m128 vexpected = _mm_set1_ps(size / 256.0f);
        for (int i = 0; i < 256; ) {
            //const int unroll = 4;  // C doesn't treat this as fully constant, unlike C++
            #define unroll 1
            __m128 diffs[unroll+1] = {};
            for (int j = 0 ; j<unroll; j++){
                __m128i obs0 = _mm_load_si128((const __m128i*)&observed[0][i + 8*j]);
                __m128i obs1 = _mm_load_si128((const __m128i*)&observed[1][i + 8*j]);
                __m128i obs = _mm_add_epi16(obs0, obs1);
                // if expected is still an integer, but we can't use pmaddwd because size is too large, like 64K
                // then diff = _mm_sub_epi16(obs, _mm_set1_epi16(expected));  before unpacking.
                //  Or epi32 for even larger size that requires 32-bit count buckets.
                __m128 obs_lo = _mm_cvtepi32_ps(_mm_unpacklo_epi16(obs, _mm_setzero_si128()));
                __m128 obs_hi = _mm_cvtepi32_ps(_mm_unpackhi_epi16(obs, _mm_setzero_si128()));
                __m128 diff_lo = obs_lo - vexpected;
                __m128 diff_hi = obs_hi - vexpected;
                diffs[j] = _mm_add_ps(_mm_mul_ps(diff_lo,diff_lo), _mm_mul_ps(diff_hi,diff_hi));
            }
            i += 8*unroll;
            //diffs[0] += diffs[2];
            //diffs[1] += diffs[3];
            //diffs[0] += diffs[1];
            //diffs[1] = _mm_movehl_ps(diffs[1], diffs[0]);  // extract high 2 elements, avoiding an extra movaps
            diffs[1] = _mm_movehl_ps(diffs[0], diffs[0]);
            diffs[0] += diffs[1];
            vchisquare = _mm_add_pd(vchisquare, _mm_cvtps_pd(diffs[0]) );  // double accumulator
        }
        double chiSquare = _mm_cvtsd_f64(vchisquare) + _mm_cvtsd_f64(_mm_unpackhi_pd(vchisquare,vchisquare)); // ideally movhlps into another vector
        return chiSquare * (256.0/size);  // compilers optimize x / (size/256.0) to this anyway.
    }
}

See it with GCC and clang on Godbolt, along with a version that tries to auto-vectorize to the same code.
The integer-only summing part of the function compiles as desired with GCC:

.L16:
  movdqa  xmm0, XMMWORD PTR [rdi+512]   # second array of counts
  paddw   xmm0, XMMWORD PTR [rdi]       # first array of counts
  add     rdi, 16
  psubw   xmm0, xmm2       # subtract expected
  pmaddwd xmm0, xmm0       # dot-product with itself
  paddd   xmm0, xmm1       # accumulate 32-bit pairs
  movdqa  xmm1, xmm0       # stupid compiler, this is useless
  cmp     rax, rdi
  jne     .L16
# then hsum xmm1 and convert that scalar to double

uiCA predicts Ivy Bridge, Skylake, and Rocket Lake (same as Ice Lake) will all run this at about 1 iteration (8 elements) per 2 cycles, surprisingly a bit slower on RKL. So we should expect only 64 cycles to handle all 256 buckets. On a 4GHz CPU, that's 16 nanoseconds.


The FP version for larger size still only has to handle 256 count buckets, so we shouldn't go too nuts unrolling it. Starting with a vector of 8 uint16_t and unpacking to 2 vectors of float already provides plenty of instruction-level parallelism to avoid a bottleneck on the 4-cycle latency of addpd. (I had been going to be way more aggressive with an unroll, which amortizes the loop overhead a bit more, but doesn't make a big difference to cycles per element. And we only need this version for large sizes where histogramming dominates this second loop. Unless we often have very small sizes like 128.)

unroll factors by one (high/low pair, as above) by 2 (4 float vecs) by 4 (8 float vecs)
Ivy Bridge 7.16 cycles / iter predicted 12.8 c/i 25 c/i
Skylake 5.14 c/i 9.0 c/i 17.0 c/i
Rocket Lake 5.19 c/i 9.04 c/i 17.0 c/i

We're still doing better than 1 element per cycle even with the lowest unroll factor, and more unrolling only improves the cycles / element a bit.

I haven't really looked at a version that goes straight from int to double, avoiding float. That would save the cvtps2pd instructions, but require at least a mulpd and addpd each instead, more if we unroll more.


I attempted a version in pure C that can auto-vectorize. I was able to get GCC to vectorize the integer version with pmaddwd, but clang has some trouble with it. (With AVX2, it's doing a YMM add after vpmaddwd xmm. And without AVX it's zeroing half the elements, wasting half the throughput). The float loop, for large sizes, wasn't compiling well at all; see my attempt in the Godbolt link.

The integer part looks like this:

    if (size >= 256 && size <= 32768) {
        int16_t expected = size / 256;
        int chiSquare = 0;      // with size <= 32768, the worst case won't overflow a signed 32-bit int, so we can let compilers do packed 32-bit int to float conversion instead of copying to scalar integer and back for scalar int64 to float (without AVX-512 for unsigned conversion)
        //double expected = (double)size / 256.0f; // Constant expected value
        for (int i = 0; i < 256; ) {
            // 16-bit types allow the compiler to use psubw and pmaddwd when summing to int32_t
        // GCC does quite well, clang fails to use pmaddwd unless SSE4.1 is enabled, making bad code that shuffles a ton.
        //  I think clang's internals have something wrong for pmaddwd, since with AVX2 enabled it's using a YMM sum after a pmaddwd XMM, like it thinks PMADDWD widens to a wider vector?
            int16_t diff = (int16_t)(observed[0][i] + observed[1][i]) - expected;
            i += 1;
            chiSquare += (diff * (int)diff);  // pmaddwd is a 32-bit sum of two 16-bit products
        }
        return chiSquare / (double)expected;        

Note the use of casts and assignments to narrow variables to tell the compiler that I want results truncated to 16-bit except for the multiply. (C doesn't have a widening multiply; you express it by casting narrow inputs. That already happens implicitly for types narrow than int, but being explicit is probably good here.)

GCC13 -O2 auto-vectorizes to a pmaddwd loop with the same asm as the intrinsics version, but without the wasted movdqa. (Yes, even with -O2, where the "cost model" threshold is "very cheap"; earlier GCC didn't enable auto-vectorization at all at -O2.)

With -march=x86-64-v3 (AVX2+FMA), we get 256-bit vpmaddwd.


I am seeing about 70-90 ms performance improvement

On what CPU model, with what compiler? And out of what total time, i.e. is is it 2x faster? Is it 1.2x faster?

The whole calculation should take only a couple to a few microseconds for a size=4K input. (Or are you doing it in a repeat loop over the same array to get a longer interval to time?)