AVX512 perform AND of 512bits of 8-bit chars

85 Views Asked by user997112 At 22 March 2024 at 00:21

I'd like to AND two vectors of 512 bits containing 8 bit elements.

Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations:

__m512i _mm512_and_epi32 (__m512i a, __m512i b)
__m512i _mm512_and_epi64 (__m512i a, __m512i b)

but nothing for epi8 (or epi16).

Is it safe to use the epi64? My only hesitation is why they have provided both epi32 and epi64, presumably both could use epi32. Performance reasons?

Original Q&A

There are 1 best solutions below

Peter Cordes On 22 March 2024 at 00:35 BEST ANSWER

Both are just simple bitwise AND; you can use either on any data.
Or better, use _mm512_and_si512 which has the desired semantic meaning.

In asm, vpandd and vpandq can be used with masking at 32-bit or 64-bit granularity, respectively. Masking is the only reason for having separate opcodes, unlike with AVX2 and earlier where there was just vpand (_mm256_and_si256 and _mm_and_si128).

Without a mask, there's no significance to the element width. The only reason for _mm512_and_epi32 and epi64 to exist at all is for consistency with _mm512_mask[z]_and_epi[32|64].

_mm512_and_si512 exists, and will compile to either vpandq or vpandd.
The intrinsics guide says it's nominally an intrinsic for vpandd.
IIRC, most compilers favour wider elements and will pick vpandq like how they use vmovdqa64 for _mm512_load_si512. If they don't fold it into a vpternlogq with some other bitwise booleans on the same data.

AVX512BW added EVEX versions of instructions like vpaddb where element width matters even without masking. But didn't add byte or word mask widths for bitwise booleans, only vmovdqu8 / vmovdqu16 (and vpblendmb/vpblendmw) for separate load, store, or reg-reg blend (merge-masking) or zero-masking.

For 128 and 256-bit vector widths, hopefully most compilers will use AVX2 vpand for _mm256_and_epi32 if the data is in YMM0-15 (instead of YMM16-31).

Fun fact: vandps/pd wasn't part of AVX512F (foundation), only integer vpandd/q were in that. The FP versions were added as part of AVX512DQ.
(Xeon Phi is the only real hardware that has AVX512F without AVX512BW and DQ, and fewer redundant opcodes saves transistors in the decoders I guess, and I guess it didn't care about separate SIMD-int vs. FP domains for bypass forwarding. AVX-512 was an adaptation of the vector ISA developed for Larrabee and sold commercially in first-gen Xeon Phi, Knight's Corner).

AVX512 perform AND of 512bits of 8-bit chars

There are 1 best solutions below

Related Questions in C++

Related Questions in X86

Related Questions in BITWISE-OPERATORS

Related Questions in INTRINSICS

Related Questions in AVX512

Trending Questions

Popular # Hahtags

Popular Questions