I'd like to AND two vectors of 512 bits containing 8 bit elements.
Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations:
__m512i _mm512_and_epi32 (__m512i a, __m512i b)
__m512i _mm512_and_epi64 (__m512i a, __m512i b)
but nothing for epi8 (or epi16).
Is it safe to use the epi64? My only hesitation is why they have provided both epi32 and epi64, presumably both could use epi32. Performance reasons?
Both are just simple bitwise AND; you can use either on any data.
Or better, use
_mm512_and_si512which has the desired semantic meaning.In asm,
vpanddandvpandqcan be used with masking at 32-bit or 64-bit granularity, respectively. Masking is the only reason for having separate opcodes, unlike with AVX2 and earlier where there was justvpand(_mm256_and_si256and_mm_and_si128).Without a mask, there's no significance to the element width. The only reason for
_mm512_and_epi32andepi64to exist at all is for consistency with_mm512_mask[z]_and_epi[32|64]._mm512_and_si512exists, and will compile to eithervpandqorvpandd.The intrinsics guide says it's nominally an intrinsic for
vpandd.IIRC, most compilers favour wider elements and will pick
vpandqlike how they usevmovdqa64for_mm512_load_si512. If they don't fold it into avpternlogqwith some other bitwise booleans on the same data.AVX512BW added EVEX versions of instructions like
vpaddbwhere element width matters even without masking. But didn't add byte or word mask widths for bitwise booleans, onlyvmovdqu8/vmovdqu16(andvpblendmb/vpblendmw) for separate load, store, or reg-reg blend (merge-masking) or zero-masking.For 128 and 256-bit vector widths, hopefully most compilers will use AVX2
vpandfor_mm256_and_epi32if the data is in YMM0-15 (instead of YMM16-31).Fun fact:
vandps/pdwasn't part of AVX512F (foundation), only integervpandd/qwere in that. The FP versions were added as part of AVX512DQ.(Xeon Phi is the only real hardware that has AVX512F without AVX512BW and DQ, and fewer redundant opcodes saves transistors in the decoders I guess, and I guess it didn't care about separate SIMD-int vs. FP domains for bypass forwarding. AVX-512 was an adaptation of the vector ISA developed for Larrabee and sold commercially in first-gen Xeon Phi, Knight's Corner).