A wonderful Sunday everyone.
I am currently learning a lot of assembly in the 32-bit environment (currently Windows). I am using FASM for this.
I have the following code which I successfully made but I'm quite unhappy with the way I load XMM0 into ST0:
GetDistance: ;(__cdecl*)(float x1, float y1, float x2, float y2)
push ebp
mov ebp, esp
sub esp, 0x4
movss xmm0, DWORD [ebp + 0x0014] ; Load x2
subss xmm0, DWORD [ebp + 0x000C] ; Subtract x1
movss xmm1, DWORD [ebp + 0x0010] ; Load y2
subss xmm1, DWORD [ebp + 0x0008] ; Subtract y1
mulss xmm0, xmm0 ; Square of the x difference
mulss xmm1, xmm1 ; Square of the y difference
addss xmm0, xmm1 ; Sum of squared differences
sqrtss xmm0, xmm0 ; Square root
movss dword [ebp - 0x0004], xmm0
fld dword [ebp - 0x0004]
add esp, 0x4
pop ebp
ret 0
It does work but I have been googling for a straight 2 hours now (even asked ChatGPT) on how to get my XMM0 value into ST0 but I fail to search for the correct problem I guess and ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data section and therefore global variables and I think it leads me into a complete wrong direction.
I don't like that I had to use sub from and add to ESP to get XMM0 into ST0.
I also appreciate any tips to improve my code or even good resources to learn from it. I only want to focus 32-bit for now. :)
Store/reload is necessary to transfer from XMM to
st0. Even though MMX registers alias the x87 registers, there's no way to useMOVDQ2Q mm0, xmm0to get an 80-bit FP bit-pattern intost0, even apart from the problem of switching back from MMX to x87 state without clearing the registers.Related: Intel x86_64 assembly, How to move between x87 and SSE2? (calculating arctangent of double)
You don't need to waste instructions setting up EBP as a frame pointer, though, especially in simple functions like this where it's easy enough to keep track of offsets relative to ESP.
In a function with stack args, the callee (your function) "owns" them, so you can use
[esp+4]as scratch space instead of reserving new space. This is why, when calling the same function twice with the same args, the caller has to store the args again. e.g.In this case it would have been more efficient to use
fld dword [esp+4]/fmul st0/retbecause we're using a calling convention that returns inst0.If you insist on using 32-bit code, then the default calling-conventions are old and bad, passing args on the stack and returning
float/doubleinst0instead ofxmm0.For Windows there are less bad 32-bit calling conventions, though. 32-bit
vectorcallpasses the first 6 FP (or SIMD vector) args in xmm registers, and returns inxmm0. And the first 2 integer args in regs likefastcall. (64-bit vectorcall only passes 4 args in XMM regs, differing from the standard Windows x64 convention only in handling types like__m128iand__m256.) See https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170 for more.Compiles with x86 MSVC 19.10 (Godbolt). It's a callee-pops convention like
fastcall; note theret 4since we have one stack arg. If you don't have any stack args, though, just a normalretis still correct.If your callers are also hand-written asm, then you don't have to follow a standard calling convention; you can pass/return args in convenient registers and document it with comments on a per-function basis.
Unsurprising; ChatGPT is very bad at assembly language, buggy code is normal.
It doesn't "understand" what it's doing in any language, but x86 asm was probably rarer in its training data and/or harder for large language models because the same register names and mnemonics get used in all programs. And there are so many different flavours of assembly language (including multiple for x86) that probably doesn't help.