Let's say I have a pointer to a bunch of uint8_t's in RDI and I want to load 4 uint8_ts into XMM0 and use SIMD instructions to multiply it with XMM1 where I have 4 float values stored.
How can I load the initial 4 uint8_ts into XMM0, so it's always "aligned", meaning that each "compartment" has it's lower 8 bit with the uint8_t and the upper 24 bits are 0? Is there an instruction for that?
I hope my issue is understandable and I am sorry for my very naive explanation of my issue.
movdqu xmm0, [rdi]
would result in a QWORD loaded, not what I need.
For simplicity I ignore the floating point multiplication. I assume using
mulpsisn't really that hard. The real challenge is the conversion, if you can't use fixed-point 16-bit integer instead.The intel intrinsics actually come with an intrinsic that expands into a significant sequence of operations just for that:
_mm_cvtpu8_ps. But that's for MMX+SSE1 and isn't a single instruction, and compiles very inefficiently with modern compilers1. In the early days of Intel's intrinsics, they provided more "helper function" intrinsics beyond the_mm_setones, with the same naming scheme as the wrappers for single instructions.For SSE2 there is no straightforward operation. A manual unpack sequence is good:
SSSE3 can use a
pshufbto implement the zero-extension.And finally, SSE4.1 gave us the proper instruction with
pmovzxbd.Footnote 1: using MMX+SSE1
_mm_cvtpu8_ps