Currently working on some old assembly code, and MASM errors out with this line.
punpcklwd MM3, MMWORD PTR [8+EBP+ECX*2]
Gives me: error A2070: invalid instruction operands
But, this should be valid, right? The disassembled code from a compiled copy is basically identical to this.
Also, according to this PDF, this is how it's supposed to be written... https://www.intel.com/content/dam/develop/external/us/en/documents/mmx-app-mpeg1-audio-kernels-140701.pdf
The memory source operand is 32-bit DWORD, not MMWORD or QWORD.
See Intel's asm manual entry:
Unfortunately, the same is not true for the XMM version: it does count as a 128-bit load, faulting if it extends into an unmapped page or is misaligned.
The Description section backs this up:
The 128-bit behaviour is one of many dumb design decisions in SSE1/SSE2. I wonder if Pentium 4 had limitations on store-forwarding or something that would have somehow made it less efficient in that first-gen implementation to be like a
movq
load. There ismovhps xmm3, qword ptr [ecx]
to load into the upper half to replacepunpcklqdq
, but you just need a separatemovq
for narrower interleaves.The MMX behaviour of only taking an operand of the width it uses is the sensible one. I don't know why the Intel doc you linked uses MMWORD with it; maybe some assemblers accepted that at the time. It does make sense that current MASM rejects it, but that could have gone either way.
Do note that
punpckHwd
and so on want a register-width memory operand, I guess so it more closely matches the register source version, e.g.punpckhwd mm3, mm0
could be replaced withmovq [esi], mm0
/punpckhwd mm3, [esi]
and run the same, rather than needing[esi+4]
.That also let them build HW that just feeds a 64-bit load to the shuffle unit, without needing a broadcast or shifted load to get the data at the right place for input to the ALU. Modern Intel load ports can do broadcast loads (e.g.
movddup
orvbroadcastss
with a memory source run as a single uop for a load port, no ALU involved), but that's something much more recent than P5 Pentium.Omit the DWORD / MMWORD PTR entirely
And BTW,
punpcklwd MM3, [8+EBP+ECX*2]
should assemble just fine with most Intel-syntax assemblers, including MASM as well as NASM and GAS with.intel_syntax noprefix
. The register destination (along with the mnemonic) implies the size of the memory operand.GNU Binutils
objdump -drwC -Mintel
agrees with Intel's manual that it's a 32-bit memory operand. I assume MASM would want the same syntax.