(This question isn't intended to be specific to VirtualBox or x86 per se, but since they're the best examples I'm aware of, I'll be referencing them and asking how VBox handles some scenarios. If you're aware of other solutions that aren't used by VBox, do consider mentioning them as well.)
I've read how VirtualBox does software virtualization, and I don't understand the following.
Before executing ring 0 code, CSAM [Code Scanning and Analysis Manager] scans it recursively to discover problematic instructions. PATM [Patch Manager] then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.
Consider the following sample instruction sequence in ring-0 code:
call foo
foo:
mov EAX, 1234
mov EDX, [ESP]
cmp EDX, EAX
jne bar
call do_something_special_if_return_address_was_1234
bar:
...
The callee here is testing to see if the return address to the caller is 1234, and if it is, it does something special. Clearly patching will change the return address, so we need to be able to handle it.
VirtualBox's documentation says it discovers "problematic" instructions and patches them in-place, but I don't really understand how this can work, for 2 reasons:
It appears that any instruction that exposes the instruction pointer is "problematic", of which
callis probably the most common (and extremely so). Does this mean VirtualBox must analyze, and potentially patch, everycallinstruction it sees in ring 0? Doesn't this make performance drop off a cliff? How do they handle this with high performance? (The cases they mention in their documentation are quite obscure, so I'm confused why they didn't mention such a common instruction if it occurs. And if it isn't a problem, I don't understand why.)If the instruction stream happens to be modified (e.g. dynamically loading/unloading kernel modules), VirtualBox must dynamically detect this and garbage-collect unreachable recompiled instructions. Otherwise, it will have a memory leak. But this means every
movinstruction (andpushinstruction, and everything else that writes to memory) will now have to be analyzed and potentially patched, possibly repeatedly, because it may be modifying code that was patched. This would seem to essentially degenerate all guest ring-0 code into near-full software emulation (since the target of a move is not known during recompilation), which would make the virtualization cost skyrocket, and yet this is not the impression I get from reading the documentation. Is this not a problem? How is this handled efficiently?
Please note that I am not asking about hardware-assisted virtualization like Intel VT or AMD-V and I am not interested in reading about those. I'm well aware that they avoid these problems altogether, but my question is about pure software virtualization.
At least for
QEMU, it seems like the answer is that even in the translated code, there is a separate emulated "stack" that is set up with the same values as the code would have when running natively, and this "stack" is the one read by the emulated code, which sees the same values as if it were running natively.This means that emulated code can't be translated to use
call,ret, or any other stack-using instructions directly, since those would not use the emulated stack. Therefore, those calls are replaced by jumps to various bits of thunk code that does the right thing in terms of calling the equivalent translated code.Details for QEMU
The OP's (reasonable) assumption seems to be that
callandretinstructions would appear in the translated binary, and the stack would reflect the addresses of the dynamically translated code. What actually happens (in QEMU) is that thecallandretinstructions are removed and replaced with control flow that doesn't use the stack, and values on the stack are set to have the same values as they would in native code.That is, the OP's mental model is that the result of the code translation is somewhat like the native code, with some patches and modifications. At least in the case of QEMU, that's not really the case—every basic block is heavily translated via the Tiny Code Generator (TCG), first to an intermediate representation and then to the target architecture (even if the source and destination archs are the same, as in my case). This deck has a good overview of a lot of the technical details, including an overview of TCG as shown below.
The resulting code is usually nothing like the input code, and typically increases in size by about 3 times. Registers are often used fairly sparingly and you often see back-to-back redundant sequences. Particularly relevant to this question, is that essentially all control flow instructions are quite differently, so
retandcallinstructions in the native code pretty much never translate into plaincallorretin the translated code.An example: First, some C code with a
return_address()call that simply returns the return address, and amain()that prints this function:The
noinlinehere is important since otherwisegccjust may just inline the function and hardcode the address directly into the assembly without making acallor accessing the stack at all!With
gcc -g -O1 -march=nativethis compiles to:Note that
return_address()returns[rsp]just like the OP's example. Themain()function sticks it inrdx, whereprintfwill read it from.We expect the caller's return address as seen by
return_addressto be the instruction following the call,0x400559:... and indeed that's what we see when we run it natively:
Lets try it in QEMU:
It works! Note that QEMU is by default translating all code and putting it far away from the usual native location (as we will see shortly), so we don't need any special instructions to trigger translation.
How does that work under the covers? We can use the
-d in_asm,out_asmoption with QEMU to see what it is making of this code.First, the code leading up to the call (the
INportion is the native code, and theOUTis what QEMU translates it to—sorry for AT&T syntax, I can't figure out how to change that in QEMU):A key part is here:
You can see that it is actually manually putting the return address from the native code onto the "stack" (which seems generally to be accessed with
rbp). Following that, note that there is nocallinstruction toreturn_address. Rather, we have:In most of the code,
r14seems to be a pointer to some internal QEMU data structure (i.e., not used to hold values from the emulated program). The above shoves0x400546(which is the address of thereturn_addressfunction in native code) into a field of the structure pointed to byr14, sticks0x7f177ad8a690inrax, and jumps to0x557c9cef8196. This last address appears all over the place in the generated code (but its definition does not) and seems to be some kind of internal dispatch or thunk method. Presumably, it uses either the native address or, more likely, the mystery value shoved inraxto dispatch to the translatedreturn_addressmethod, which looks like this:The first bit of code seems to set up the user "stack" in
ebp(getting it fromr14 + 0x20, which is probably the emulated machine state structure) and ultimately reads from the "stack" (the linemov 0x0(%rbp),%rbx) and stores that away into the area pointed to byr14(mov %rbx,0x80(%r14)).Finally, it gets to
jmpq 0x55c131ebe196, which transfers to the QEMU epilogue routine:Note that I use the word "stack" in quotes above. This is because this "stack" is the emulation of the stack as seen by the emulated program, not the true stack pointed to by
rsp. The true stack pointed to byrspis controlled by QEMU to implement the emulated control flow, and the emulated code doesn't access it directly.Some Things Can Change
We see above that "stack" contents as seen by the emulated process are the same under QEMU, but details of the stack do change. For example, the address of the stack looks different under emulation than natively (i.e., the value of
rsprather than the stuff pointed to by[rsp]).This function:
normally returns addresses like
0x7fffad33c100but returns addresses like0x40007ffd00under QEMU. This should not be a problem, though, because no valid program should depend on the exact absolute value of the stack address. Not only is it generally not defined and unpredictable, but on recent operating systems it really is designed to be totally unpredictable due to stack ASLR (Linux and Windows both implement this). The program above returns a different address each time I run it natively (but the same address under QEMU).Self Modifying Code
You also mentioned a concern about when the instruction stream is modified, and gave an example of loading a kernel module. First, at least for QEMU, code is only translated "on demand". Functions that could be called, but aren't in some particular run, are never translated (you can try it with a function that is called conditionally depending on
argc). So in general, loading new code into the kernel, or into a process in user-mode emulation, is handled by the same mechanism: the code will simply be translated the first time it is called.If the code is actually self modifying—i.e., the process writes to its own code—then something has to be done, since without help QEMU would keep using the old translation. So, to detect self-modifying code without penalizing every write to memory, the native code lives in pages with R+X permissions only. The consequence is that a write raises a GP fault which QEMU handles by noting that the code has modified itself, invalidating the translation, and so on. Plenty of details can be found in this thread and elsewhere.
It's a reasonable mechanism, and I expect other code-translation VMs do something similar.
Note that in the case of self-modifying code, the "garbage collection" problem is simple: the emulator is informed about the SMC event as described above, and since it has to re-translate at this point, it throws away the old translation.