Ensuring x64 compliance of custom ASM-function in clang-cl

Question

Ensuring x64 compliance of custom ASM-function in clang-cl

56 Views Asked by Juliean At 19 March 2024 at 13:43

For my custom compiled native x64 JIT-code, I have certain instrinsic functions. A lot of them are just called from my code, thus I will generate with my own compiler. Some of them however are directly called from c++-code, and thus I want to have them be compiled inside a static-lib, so that they can at least be linked statically, if not inlined.

I need to use inline-assembly for those functions, as they perform actions that cannot be expressed in regular C++, like setting a non-volatile register from a function-input. However, the function itself must behave like a regular x64-function - it needs a prolog/epilogue, and it must have the necessary unwind-information to support stack-traces and exception-handling. Thus I cannot use MSVC (which is my native compiler), so I decided to make a static lib with clang-cl in Visual Studio instead. The best I got so far is the following:

void interruptEntry(void* pState, const char* pAddress)
{
    __asm
    {
        // load state into RBX
        mov rbx,rcx
        // load callstack-top into RDI
        mov rax,[rbx]
        mov rdi,[rax]

        // call address
        call rdx
    };
}

This will generate the proper prolog, epilogue and all required unwind-information. However, it critically lacks the 32 bytes of shadow-space that are necessary by x64 (which pAddress needs to be called by):

Acclimate Engine.dll!interruptEntry(void *, const char *):
 push        rdi  
 push        rbx  
 mov         rbx,rcx  
 mov         rax,qword ptr [rbx]  
 mov         rdi,qword ptr [rax]  
 call        rdx  
 pop         rbx  
 pop         rdi  
 ret

Keep in mind, while this code is generated via clang-cl, the DLL is linked with MSVC. The static-lib is compiled with O2 (set from the VisualStudio-project page).

Things I've tried:

Modifying RSP manually, with sub RSP,32. This results in a frame-pointer register being established, as the compiler will count this as a dynamic stack allocation. This adds too much overhead to make it worth using a statically compiled function in the first place
Similarily, I could reference "pState" directly in asm (mov rbx,pState), this will cause the shadow-space to be added - but also, pState will then be copied onto the stack, and loaded into rbx from that stack location, instead of the register. This once again defeats the purpose of what I am doing here.
Calling "pAddress" as a function-pointer directly, after the asm-block. This will still not result in any difference in code-gen
Using normal asm(), or extended asm, in combination with "attribute((naked))". That will not generate the prolog/epilogue, which I can write myself - but then the unwind-information is missing. clang-cl seems to not understand any of the unwind-data directives, like .allocstack or .pushreg, resulting in a "error : unknown directive" - regardless of in which type of asm-block it's being used.

Is there any reason why the shadow-space is missing, and any way to get it there without adding any uncessary overhead like a frame-pointer (while still having unwind-information)? I'm also open for other suggestions - for example, if there is some intrinsic that let's me set those registers (while still compiling down to the one move), I would not need to use assembly (manipulating specific registers with global effect is the main reason I cannot write plain C++).

Original Q&A

There are 2 best solutions below

Juliean On 19 March 2024 at 16:30

So, I had a look at MASM64, as suggested by Margaret Bloom, and I have to say it is generally cool to still have the option to generate assembly at all in MSVC. However, for my own case, I evaluated all the functions that are on my list to potentially rewrite in assembly. Most of the are more complex functions (including calls to other functions; asserts; loops, jumps; ...), but only need a few specific commands to modify registers, or jmp to an address instead of a call. So, while I can use MASM for the simpler cases, I do want to have the option to code certain things with inline assembly, added to a normal c++-function. Luckily, I did find a way:

void interruptEntry(ExecutionStateJIT& state, Func pAddress)
{
    asm("mov rbx,%0;"
        :
        : "r" (&state)
        : "%rbx");

    auto* pTemp = *((void**)&state);
    asm("mov rdi,[%0];"
        :
        : "r" (pTemp)
        : "%rdi");

    pAddress();

    asm volatile("");
}

This compiles to my expected result:

 push        rdi  
 push        rbx  
 sub         rsp,28h  
 mov         rbx,rcx  
 mov         rax,qword ptr [rcx]  
 mov         rdi,qword ptr [rax]  
 call        rdx  
 nop  
 add         rsp,28h  
 pop         rbx  
 pop         rdi  
 ret

It does seem like using extended inline assembly is the way. Using the __asm-block somehow confused the compiler about what type of calls are being made - which is a shame, because I much preferred the syntax of __asm. Extended asm does allow me to tell the compiler which registers are modified - which is neat, because certain functions that I want to port, I explicitely do not want the modified registers to be restired. And I also can just use regular C++ intermixed, which is one of my requirements for certain operations.

I also had to make sure to include an empty volatile asm-block after the call, to prevent it from tail-calling (which would pop the registers before). There is this one nop, which I'm unsure about (it is not caused by the empty block directly; as if I write a volatile nop there, it will have two nops). But I kind of assume clang knows what it's doing here - otherwise, the minor cost of one nop is acceptable, as far as I'm concerned.

**Peter Cordes** · Accepted Answer · 2024-03-21T02:45:04.897000

Making calls from inline asm is generally not well supported. Avoid whenever possible.
The compiler only scans the inline asm block to see what registers are potentially clobbered; it doesn't assume that call instructions in asm are to functions that follow the standard calling convention for this target (otherwise why would you be using inline asm in the first place?) So it's a huge pain to do it safely, same for x86-64 System V (Calling printf in extended inline ASM - using GNU C inline asm you also have to declare all the register clobbers yourself, as well as take care of the red-zone since there's no way to declare a clobber on that.)

Your idea of using inline asm to leave values in regs and block tail-call optimization is a good idea. But the implementation in your self-answer with two separate asm() statements doesn't do anything to stop the compiler from stepping on RBX with the instructions it emits for code outside the asm statements. A different compiler or version could easily break your code by picking RBX as a temporary instead of RAX when compiling that code between the asm statements. (And since you didn't use __attribute__((noinline)), code from parent functions could be scheduled here.)

You can write it in a way that discourages the compiler from stepping on your registers. Make those values needed in those registers after the call (as inputs to an empty asm statement), so the asm you want is the only efficient choice. That makes it a lot less likely that this will break in practice.

class ExecutionStateJIT;
using Func = void ();

void interruptEntry_safer(ExecutionStateJIT& state, Func pAddress)
{
    register void *state_addr asm("rbx") = &state;
    // strict-aliasing violation, see alternate version that's safe without -fno-strict-aliasing
    register auto* pTemp asm("rdi") = * *((void***)&state);  // two derefs
    // register ... asm("rbx") is actually redundant since I also used specific-register constraints in the asm statement

    // request those vars in RDI and RBX respectively
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // make sure they're actually loaded before
    pAddress();  // this just uses a function-pointer arg that was already in a register, doesn't need to touch any others

    // prevent a tailcall which would restore RDI and RBX before calling
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // and still wanted in these call-preserved registers after
}

Using register T foo asm("regname") local register variables lets you ask for values in any of R8-R15 which don't have specific-register constraint letters, forcing an "r"(var) constraint to pick a specific register. (And for many other ISAs, there aren't letters for any single registers.) It's not actually needed here because the "D" and "b" constraints require RDI and RBX respectively.

Godbolt shows it works in GCC and clang (-masm=ms for GCC, and -target x86_64-w64-windows-gnu for Clang. As a bonus, this doesn't require very-recent Clang for -masm=intel to apply to asm statements, since the actual templates are empty. The action is in the constraints, requiring the compiler to have both values in the registers we want, but without any

Your code also violates the strict-aliasing rule by pointing a void ** at an object of a different type. Only [unsigned] char* and pointers to objects declared with __attribute__((may_alias)) can be pointed at arbitrary things in GNU C. But for compat with MSVC, clang-cl probably enables -fno-strict-aliasing.

class ExecutionStateJIT;
using Func = void ();
void interruptEntry_safer_strict_aliasing(ExecutionStateJIT& state, Func pAddress)
{
    using voidp = void*;
    using aliasing_voidp = __attribute__((may_alias)) voidp;
    // aliasing_voidp is a pointer-to-void (e.g. 8 bytes on x86-64).
    // aliasing_voidp*  can be pointed at any object safely, to let us load a void*
    void *state_addr = &state;
    auto* pTemp = *(aliasing_voidp*)state_addr;  // like memcpy but alignment guaranteed because no __attribute__((aligned(1)))
    pTemp = *(aliasing_voidp*)pTemp;

    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // make sure they're actually loaded before
    pAddress();  // this just uses a function-pointer arg that was already in a register, doesn't need to touch any others
    // prevent a tailcall which would restore RDI and RBX before calling
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // and still wanted in these call-preserved registers after
}

Both of these compile to the same asm as yours with current versions of GCC and clang, and the source is shorter and easier to read (if you know GNU C inline asm). The point is that they will more reliably do so with future versions and even if inlined into other surrounding code.

interruptEntry_safer_strict_aliasing(ExecutionStateJIT&, void (*)()):
        push    rdi
        push    rbx
        mov     rbx, rcx               # state_addr
        sub     rsp, 40
        mov     rax, QWORD PTR [rcx]   # first pTemp
        mov     rdi, QWORD PTR [rax]   # second value of pTemp
        call    rdx
# clang puts a NOP here for some reason
        add     rsp, 40
        pop     rbx
        pop     rdi
        ret

I didn't use __attribute__((noinline)) on my versions since even in a use-case where they do inline into a caller (e.g. -flto link-time optimization), the asm statements hopefully convinces the compiler not to do something else with RBX or RDI in that window between the asm statement and the call, if it is moving code around to try to schedule it more efficiently.

Ensuring x64 compliance of custom ASM-function in clang-cl

There are 2 best solutions below

Related Questions in C++

Related Questions in X86-64

Related Questions in INLINE-ASSEMBLY

Related Questions in CALLING-CONVENTION

Related Questions in CLANG-CL

Trending Questions

Popular # Hahtags

Popular Questions