My understanding is that VMMs such as VMware's ESXi Server maintain shadow page tables to map virtual page addresses of guest operating systems directly to machine (hardware) addresses. I've been told that shadow page tables are then used directly by the processor's paging hardware to allow memory accesses in the VM to execute without translation overhead.
I would like to understand a bit more about how the shadow page table mechanism works in a VMM. Is my high level understanding above correct? If so,
What kind of data structures are used in the implementation of shadow page tables?
What is the flow of control from the guest operating system to the hardware?
Short of straight up reading the source code of an open source VMM, what resources can I look into to learn more about hardware virtualization?
Here is what I can tell. Please correct me if I am wrong. Shadow page table is created and maintained by Hypervisor/VMM. It is the table which contains guest virtual addresses and machine physical addresses. Imagine without shadow page table, to get into machine physical address we have to first get virtual address then walk through the OS(guest) page table to get guest physical address, then we need to convert guest physical address into machine physical address. So here is what happening, see how one guest virtual address get translated into machine physical address under the senario of shadow page table:
First physical processor will see the virutal address, and its destination is to get machine physical address. The first thing it do is trying to look at TLB(Translation look aside buffer) if the entry is in TLB we are now get the machine address. This is the most simple case which we called a TLB hit case. There is no performance issue at all. It will run in what ever call a native speed.
If there is no entry in TLB, the processor will do a page table walk in shadow page table. Assuming that there is a corresponding mapping(Guest VA to Machine Physical address), the processor will insert the value in TLB then restart the execution and we are good to go this case. This is one other good case. It may take around 10 cycle to do a look up in shadow page table, so performance wise we dont have to worry much.
Processor is doing a look up in shadow page table and it could not find the entry. Well in this case as the look up is privilege there will be a fault. The VMM(Virtual Machine Monitor) will look up into the guest page table to resolve the issue. This case is a little complicate. Any way when the VMM walk through the guest page table there will be two possibilities.
In the case of the look up found the entry: When the look up found the entry, we can only walk in the guest page table to finally get guest physical address. Hey our target is to get the physical machine address. How do we get there. The monitor will take the guest physical address and will do the look up into their PMap table(or structure). If it found the entry, it will insert the value (basically guest virtal address, machine physical address) in to the shadow page table. Now we have the entry in shadow page table, we are good to go as when the processor restart the instruction it can get the mapping from the shadow page table. . Ah! forget to mention this case the monitor is doing a hidden page fault to resolve the issue by using PMap or PhysMap to get the corresponding machine physical address.
In the case of the look up not found the entry the monitor(VMM) will inject a virtual guest page fault. Now inside the guest it see that there is a page fault. OS will come and resolve the issue. This can take thousand to hundred thousand cycle or more in case of the page was swap out to the disk by the guest. Now assuming that the OS(guest OS) resolve the issue. We can restart the 3.1 steps.
Well the whole flow is a little complicate. I hope you will understand the process. . Note: Shadow page table is implemented in a software like: VMware, Microsof. It is only used in Binary Translation Mode(BT). With Nested Page Table we dont need a shadow page table at all.
There are some issue with shadow page table.
We are rely on the guest to invalidate the TLB. The thing is we want to keep the consistence between the guest page table and the shadow page table. Imagine what happen if the guest is update the page table, what happen if the guest is switching the process. It has to switch the page table. In this case it has to inform the hardware hey I update entry in page table and I invalidate it.
Aggressive shadow page table caching is necessary: We need to cach the shadow page table. See what happen if guest doing context switch and we have a lot of guest processes. It has to inform the hardware that it has to change it shadow page table pointer. Every switch will flash the TLB. Traditionally we have a shadow page table for every running process but we dont have as many as shadow page table compare to the processes have it table.
Write protect to guest page table (another word is tracing) to see what happen incase of example the page got lock by operating system for some reason, we have to get inform.