For my custom compiled native x64 JIT-code, I have certain instrinsic functions. A lot of them are just called from my code, thus I will generate with my own compiler. Some of them however are directly called from c++-code, and thus I want to have them be compiled inside a static-lib, so that they can at least be linked statically, if not inlined.
I need to use inline-assembly for those functions, as they perform actions that cannot be expressed in regular C++, like setting a non-volatile register from a function-input. However, the function itself must behave like a regular x64-function - it needs a prolog/epilogue, and it must have the necessary unwind-information to support stack-traces and exception-handling. Thus I cannot use MSVC (which is my native compiler), so I decided to make a static lib with clang-cl in Visual Studio instead. The best I got so far is the following:
void interruptEntry(void* pState, const char* pAddress)
{
__asm
{
// load state into RBX
mov rbx,rcx
// load callstack-top into RDI
mov rax,[rbx]
mov rdi,[rax]
// call address
call rdx
};
}
This will generate the proper prolog, epilogue and all required unwind-information. However, it critically lacks the 32 bytes of shadow-space that are necessary by x64 (which pAddress needs to be called by):
Acclimate Engine.dll!interruptEntry(void *, const char *):
push rdi
push rbx
mov rbx,rcx
mov rax,qword ptr [rbx]
mov rdi,qword ptr [rax]
call rdx
pop rbx
pop rdi
ret
Keep in mind, while this code is generated via clang-cl, the DLL is linked with MSVC. The static-lib is compiled with O2 (set from the VisualStudio-project page).
Things I've tried:
- Modifying RSP manually, with sub RSP,32. This results in a frame-pointer register being established, as the compiler will count this as a dynamic stack allocation. This adds too much overhead to make it worth using a statically compiled function in the first place
- Similarily, I could reference "pState" directly in asm (mov rbx,pState), this will cause the shadow-space to be added - but also, pState will then be copied onto the stack, and loaded into rbx from that stack location, instead of the register. This once again defeats the purpose of what I am doing here.
- Calling "pAddress" as a function-pointer directly, after the asm-block. This will still not result in any difference in code-gen
- Using normal asm(), or extended asm, in combination with "attribute((naked))". That will not generate the prolog/epilogue, which I can write myself - but then the unwind-information is missing. clang-cl seems to not understand any of the unwind-data directives, like .allocstack or .pushreg, resulting in a "error : unknown directive" - regardless of in which type of asm-block it's being used.
Is there any reason why the shadow-space is missing, and any way to get it there without adding any uncessary overhead like a frame-pointer (while still having unwind-information)? I'm also open for other suggestions - for example, if there is some intrinsic that let's me set those registers (while still compiling down to the one move), I would not need to use assembly (manipulating specific registers with global effect is the main reason I cannot write plain C++).
Making calls from inline asm is generally not well supported. Avoid whenever possible.
The compiler only scans the inline asm block to see what registers are potentially clobbered; it doesn't assume that
callinstructions in asm are to functions that follow the standard calling convention for this target (otherwise why would you be using inline asm in the first place?) So it's a huge pain to do it safely, same for x86-64 System V (Calling printf in extended inline ASM - using GNU C inline asm you also have to declare all the register clobbers yourself, as well as take care of the red-zone since there's no way to declare a clobber on that.)Your idea of using inline asm to leave values in regs and block tail-call optimization is a good idea. But the implementation in your self-answer with two separate
asm()statements doesn't do anything to stop the compiler from stepping on RBX with the instructions it emits for code outside theasmstatements. A different compiler or version could easily break your code by picking RBX as a temporary instead of RAX when compiling that code between the asm statements. (And since you didn't use__attribute__((noinline)), code from parent functions could be scheduled here.)You can write it in a way that discourages the compiler from stepping on your registers. Make those values needed in those registers after the call (as inputs to an empty
asmstatement), so the asm you want is the only efficient choice. That makes it a lot less likely that this will break in practice.Using
register T foo asm("regname")local register variables lets you ask for values in any of R8-R15 which don't have specific-register constraint letters, forcing an"r"(var)constraint to pick a specific register. (And for many other ISAs, there aren't letters for any single registers.) It's not actually needed here because the"D"and"b"constraints require RDI and RBX respectively.Godbolt shows it works in GCC and clang (
-masm=msfor GCC, and-target x86_64-w64-windows-gnufor Clang. As a bonus, this doesn't require very-recent Clang for-masm=intelto apply to asm statements, since the actual templates are empty. The action is in the constraints, requiring the compiler to have both values in the registers we want, but without anyYour code also violates the strict-aliasing rule by pointing a
void **at an object of a different type. Only[unsigned] char*and pointers to objects declared with__attribute__((may_alias))can be pointed at arbitrary things in GNU C. But for compat with MSVC,clang-clprobably enables-fno-strict-aliasing.Both of these compile to the same asm as yours with current versions of GCC and clang, and the source is shorter and easier to read (if you know GNU C inline asm). The point is that they will more reliably do so with future versions and even if inlined into other surrounding code.
I didn't use
__attribute__((noinline))on my versions since even in a use-case where they do inline into a caller (e.g.-fltolink-time optimization), the asm statements hopefully convinces the compiler not to do something else with RBX or RDI in that window between the asm statement and the call, if it is moving code around to try to schedule it more efficiently.