I'm doing some research in C++ green threads, mostly boost::coroutine2
and similar POSIX functions like makecontext()/swapcontext()
, and planning to implement a C++ green thread library on top of boost::coroutine2
. Both require the user code to allocate a stack for every new function/coroutine.
My target platform is x64/Linux. I want my green thread library to be suitable for general use, so the stacks should expand as required (a reasonable upper limit is fine, e.g. 10MB), it would be great if the stacks could shrink when too much memory is unused (not required). I haven't figured out an appropriate algorithm to allocate stacks.
After some googling, I figured out a few options myself:
- use split stack implemented by the compiler (gcc -fsplit-stack), but split stack has performance overhead. Go has already moved away from split stack due to performance reasons.
- allocate a large chunk of memory with
mmap()
hope the kernel is smart enough to leave the physical memory unallocated and allocate only when the stacks are accessed. In this case, we are at the mercy of the kernel. - reserve a large memory space with
mmap(PROT_NONE)
and setup aSIGSEGV
signal handler. In the signal handler, when theSIGSEGV
is caused by stack access (the accessed memory is inside the large memory space reserved), allocate needed memory withmmap(PROT_READ | PROT_WRITE)
. Here is the problem for this approach:mmap()
isn't asynchronous safe, cannot be called inside a signal handler. It still can be implemented, very tricky though: create another thread during program startup for memory allocation, and usepipe() + read()/write()
to send memory allocation information from the signal handler to the thread.
A few more questions about option 3:
- I'm not sure the performance overhead of this approach, how well/bad the kernel/CPU performs when the memory space is extremely fragmented due to thousands of
mmap()
call ? - Is this approach correct if the unallocated memory is accessed in kernel space ? e.g. when
read()
is called ?
Are there any other (better) options for stack allocation for green threads ? How are green thread stacks allocated in other implementations, e.g. Go/Java ?
Others have mentioned
MAP_GROWSDOWN
.MAP_GROWSDOWN
can conflict with other mapped memory regions (see this correspondence between a RedHat employee with lots of Linux kernel familiarity and some prominent Linux kernel maintainers). It is also hard to know how far your mapping will be allowed to grow. For example, ifmmap()
chooses to place the first page of your stack just three 4kb pages above the next mapping, your stack can only grow to three memory pages. Additionally, if you need tomunmap()
the stack, you will have to somehow determine how large the stack has grown to unmap it.You can instead rely on the fact that any OS worth its salt (including all major OSs) will not actually map physical pages when you call
mmap()
, unless you tellmmap()
to pre-fault the pages (e.g. by using thMAP_LOCKED
flag). The OS won't map physical memory until a mapped page is touched, meaning a load or store is made to an address in that page. At that point, the CPU will trigger a page fault and call into the OS. The OS will see that you mapped the page withmmap()
and then create the mapping to physical memory. Thus, you canmmap()
an 8MB stack for a green thread and if the green thread only ever uses 500 bytes of the stack, only one page of memory will be used.One more thing: you probably want a guard page at the end of your stack to prevent a program from overgrowing the stack into another mapped region of memory (instead, it should segfault because it overflowed the stack). The guard page won't have any physical memory associated with it, so it won't actually take up any physical memory. You can achieve this using a combination of
mmap()
andmprotect()
like so:Depending on the situation, you may want to use
mlock2()
withMCL_ONFAULT
to tell the OS to not swap the stack's pages and instead keep them in physical memory, but be careful with this as you may start gettingmmap()
failures if the cumulative size of all the thread stacks exceeds the size of physical memory.As a bonus, here is that same thing, but for the Windows API using
VirtualAlloc()
andVirtualProtect()
:To briefly answer your question about performance overhead, I wouldn't worry about address space fragmentation on a 64-bit CPU (unless you are mapping hundreds of terabytes of memory). Thousands of
mmap()
calls is nothing. The virtual-to-physical memory mapping can be arbitrary; your OS will take care of physical memory fragmentation (it can even move pages of physical memory around without you knowing it).