Despite an infamous penalty for mixing SSE and AVX encoding (see Why is this SSE code 6 times slower without VZEROUPPER on Skylake?), there may be a need to mix 128-bit and 256-bit operations.
The penalty can be avoided by always AVX encoding, even for 128-but operation, or adding vzeroupper before any SSE encoding.
For compiler-generated code, if AVX is enabled, a compiler would assume that AVX is available, and will use AVX encoding. For every function that can be called externally, a compiler would insert vzeroupper in the end.
MSVC however allows generation of AVX code without AVX enabled via the direct use of intrinsics (unlike some other compilers which would require an AVX-enabling option to use AVX intrinsics).
How would it avoid mixing SSE and AVX if both intrinsics are used in a single function?
The compiler would use AVX encoding after the first AVX intrinsic. For example, the following function:
would have the first
_mm_cvtsi32_si128encoded asmovd, and the second asvmovd. And it will insertvzeroupperin the end.It will use AVX encoding from the beginning if a parameter is taken via AVX register (this happens using
__vectorcallcalling convention). The same way, if__m256itype is returned,vzeroupperwill not be inserted in the end.This does not apply to unoptimized compilation. With
/Odor no/O...option, it will just use the minimum level encoding for any of the instructions. It will also not insertvzeroupperin the end for unoptimized compilation.Godbolt's compiler explorer demo.
Unfortunately, this does not always work
In this issue it was discussed that in some situation MSVC still emits non-VEX-encoded SSE in AVX code. This:
Makes the compiler mix
padddandvmovups.Godbolt's compiler explorer demo.
I've created Developer Community issue 10618264.