Is there a limit on the number of hugepage entries that can be stored in the TLB

1.3k Views Asked by At

I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like

cache and TLB information (2):
  0x63: data TLB: 1G pages, 4-way, 4 entries
  0x03: data TLB: 4K pages, 4-way, 64 entries
  0x76: instruction TLB: 2M/4M pages, fully, 8 entries
  0xff: cache data is in CPUID 4
  0xb5: instruction TLB: 4K, 8-way, 64 entries
  0xf0: 64 byte prefetching
  0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries

Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?

1

There are 1 best solutions below

3
On

Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.

Every TLB in every architecture has an upper limit on the number of entries it can hold.

For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.

It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.

However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.

As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages). A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.