I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me. Also, how to use printf for __m512i type?
Thanks.
Code:
void main()
{
int i;
int N=160;
double vec[N];
for(i=0;i<N;i++)
{
vec[i]=(double)(-i) ;
if(i==10)
{
vec[i] = -1127;
}
}
int max = avxmax_(N, vec);
printf("maxindex=%d\n", max);
}
int avxmax_(int N, double *X )
{
// return the index of element having maximum absolute value.
int maxindex, ix, i, M;
register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
register __mmask8 gt;
double values_X[8];
double indices_X[8];
double maxvalue;
maxindex = 1;
if( N == 1) return(maxindex);
M = N % 8;
if( M == 0)
{
increment = _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
maxindices = indices;
maxvalues = _mm512_loadu_si512(&X[0]);
absmax = _mm512_abs_epi64(maxvalues);
for( i = 8; i < N; i += 8)
{
// advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
indices = _mm512_add_epi64(indices, increment);
// compare
max_values_tmp = _mm512_loadu_si512(&X[i]);
abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
gt = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epi64(absmax, abs_max_tmp);
}
// scalar part
_mm512_storeu_si512((__m512i*)values_X, absmax);
_mm512_storeu_si512((__m512i*)indices_X, maxindices);
maxindex = indices_X[0];
maxvalue = values_X[0];
for(i = 1; i < 8; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return(maxindex);
}
}
Your function returns
0
because you're treating the int64 index as the bit-pattern for adouble
, and converting that (tiny) number to an integer.double indices_X[8];
is the bug; should beuint64_t
. There are other bugs, see below.This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.
You
_mm512_storeu_si512
the vector ofint64_t
indices intodouble indices_X[8]
, type-punning it to double, then in plain C doint maxindex = indices_X[0];
. This is implicit type-conversion, converting that subnormaldouble
to an integer.(I noticed a mysterious
vcvttsd2si
FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving thedouble indices_X[8];
declaration down into the block that uses it, and noticing it had typedouble
.)It is actually possible to use integer operations on FP bit-patterns
But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing
nextafter
)._mm512_abs_epi64
is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.
On KNL (Knight's Landing), FP
vmaxpd
andvcmppd
have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.Bugfixed optimized version:
Use
size_t
for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix ofint
and 64-bit vector elements. (uint64_t
for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)bugfix: return
0
on N==1, not 1: the index of the only element is 0.bugfix: return
-1
onN%8 != 0
, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with
_mm512_abs_epi64
(which produces an unsigned result; remember that-LONG_MIN
overflows toLONG_MIN
if you keep treating it as signed).style improvement:
if (N%8 != 0) return -1;
instead of putting most of the body in an if block.style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type
_mm...load...
as an arg for another intrinsics. A name likev
scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.
Using a merge-masking
_mm512_mask_extracti64x4_epi64
extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :Psort of bugfix: in C,
main
has a return type ofint
in hosted implementations (running under an OS).On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints
10
.Another option for the loop would be a masked
vpbroadcastq zmm3{k1}, rax
, adding the[0..7]
per-element offsets after the loop. That would actually save thevpaddq
in the loop, and we have the righti
in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-sourcevpandd
.)Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).
Branchy strategy: when a new max is very rare
If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.
Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a
tzcnt
to find the element index, then a broadcast-load, and update the index).Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.
But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.
PS: to print vectors, store to an array and reload the elements. print a __m128i variable
Or use a debugger to show you their elements.