Invalid instruction operand when using punpcklwd with MMWORD PTR 64-bit memory operand

103 Views Asked by At

Currently working on some old assembly code, and MASM errors out with this line.

punpcklwd MM3, MMWORD PTR [8+EBP+ECX*2]

Gives me: error A2070: invalid instruction operands

But, this should be valid, right? The disassembled code from a compiled copy is basically identical to this.

Also, according to this PDF, this is how it's supposed to be written... https://www.intel.com/content/dam/develop/external/us/en/documents/mmx-app-mpeg1-audio-kernels-140701.pdf

1

There are 1 best solutions below

0
On

The memory source operand is 32-bit DWORD, not MMWORD or QWORD.
See Intel's asm manual entry:

PUNPCKLWD mm, mm/m32                MMX
PUNPCKLWD xmm1, xmm2/m128           SSE2

Unfortunately, the same is not true for the XMM version: it does count as a 128-bit load, faulting if it extends into an unmapped page or is misaligned.

The Description section backs this up:

When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.

Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 32-bit memory location. The destination operand is an MMX technology register.

The 128-bit behaviour is one of many dumb design decisions in SSE1/SSE2. I wonder if Pentium 4 had limitations on store-forwarding or something that would have somehow made it less efficient in that first-gen implementation to be like a movq load. There is movhps xmm3, qword ptr [ecx] to load into the upper half to replace punpcklqdq, but you just need a separate movq for narrower interleaves.

The MMX behaviour of only taking an operand of the width it uses is the sensible one. I don't know why the Intel doc you linked uses MMWORD with it; maybe some assemblers accepted that at the time. It does make sense that current MASM rejects it, but that could have gone either way.


Do note that punpckHwd and so on want a register-width memory operand, I guess so it more closely matches the register source version, e.g. punpckhwd mm3, mm0 could be replaced with movq [esi], mm0 / punpckhwd mm3, [esi] and run the same, rather than needing [esi+4].

That also let them build HW that just feeds a 64-bit load to the shuffle unit, without needing a broadcast or shifted load to get the data at the right place for input to the ALU. Modern Intel load ports can do broadcast loads (e.g. movddup or vbroadcastss with a memory source run as a single uop for a load port, no ALU involved), but that's something much more recent than P5 Pentium.

When the source data comes from a 64-bit memory operand, the full 64-bit operand is accessed from memory, but the instruction uses only the high-order 32 bits. When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.


Omit the DWORD / MMWORD PTR entirely

And BTW, punpcklwd MM3, [8+EBP+ECX*2] should assemble just fine with most Intel-syntax assemblers, including MASM as well as NASM and GAS with .intel_syntax noprefix. The register destination (along with the mnemonic) implies the size of the memory operand.

GNU Binutils objdump -drwC -Mintel agrees with Intel's manual that it's a 32-bit memory operand. I assume MASM would want the same syntax.

 8049000:       0f 61 5c 4d 08          punpcklwd mm3,DWORD PTR [ebp+ecx*2+0x8]
 8049005:       66 0f 61 5c 4d 08       punpcklwd xmm3,XMMWORD PTR [ebp+ecx*2+0x8]