System stalls when allocating a large block of shared memory while performing a large number of IO operations

145 Views Asked by At

I have a workload that involves background threads performing numerous random 4k reads on an NVME SSD file using AIO. The IOPS is approximately 40k. To bypass the filesystem page cache, I opened the file with O_DIRECT.

However, when attempting to allocate a large chunk of shared memory (e.g. allocating 100GB memory on a 1000GB machine with 400GB free memory) using shm_open and ftruncate, the entire system becomes stuck for a few seconds. I'm not entirely sure why this is happening, but I suspect that the memory system may be doing some cleanup work.

My typical memory usage:

  1. Allocate a large chunk of memory (e.g. 100GB) and many small chunks of memory (about 20MB each) in shared memory;
  2. Use them for a while;
  3. Deallocate them and goto #1 again with another size of memory.

These three steps occur in one process, and there are multiple such processes in the system at the same time, and the timing of their memory allocation is not uniform. Whenever large memory is allocated, the system almost always gets stuck.

I noticed that if I remove the I/O task mentioned above, this phenomenon has been greatly alleviated. This is strange because I have bypassed the filesystem cache.

  • Linux kernel version: 5.4.56
  • CPU: Intel 8336c
  • getconf PAGE_SIZE: 4096
  • NO swap memory
0

There are 0 best solutions below