Why does jemalloc take more time to allocate 4096 bytes of memory than other SMALL memory?

620 Views Asked by At

In the process of testing the performance of jemalloc-5.2.0 to allocate small_class memory, it was found that the memory allocation time of 4096 bytes was significantly higher than that of other small class memory. Is there any special handling for 4096 bytes memory allocation in jemalloc? Or is there any other reason?

Test results:

Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 256 KiB (x16)
  L3 Unified 20480 KiB (x2)
Load Average: 15.72, 14.21, 14.26
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
BM_SomeFunction/1792/iterations:500/threads:24      0.095 ms         2.12 ms        12000
BM_SomeFunction/1856/iterations:500/threads:24      0.175 ms         4.10 ms        12000
BM_SomeFunction/1920/iterations:500/threads:24      0.178 ms         4.13 ms        12000
BM_SomeFunction/1984/iterations:500/threads:24      0.177 ms         4.14 ms        12000
BM_SomeFunction/2048/iterations:500/threads:24      0.181 ms         4.18 ms        12000
BM_SomeFunction/2048/iterations:500/threads:24      0.177 ms         4.16 ms        12000
BM_SomeFunction/2176/iterations:500/threads:24      0.116 ms         2.67 ms        12000
BM_SomeFunction/2304/iterations:500/threads:24      0.113 ms         2.64 ms        12000
BM_SomeFunction/2432/iterations:500/threads:24      0.118 ms         2.75 ms        12000
BM_SomeFunction/2560/iterations:500/threads:24      0.113 ms         2.65 ms        12000
BM_SomeFunction/2560/iterations:500/threads:24      0.114 ms         2.68 ms        12000
BM_SomeFunction/2688/iterations:500/threads:24      0.133 ms         3.13 ms        12000
BM_SomeFunction/2816/iterations:500/threads:24      0.132 ms         3.08 ms        12000
BM_SomeFunction/2944/iterations:500/threads:24      0.131 ms         3.09 ms        12000
BM_SomeFunction/3072/iterations:500/threads:24      0.132 ms         3.10 ms        12000
BM_SomeFunction/3072/iterations:500/threads:24      0.132 ms         3.11 ms        12000
BM_SomeFunction/3200/iterations:500/threads:24      0.117 ms         2.72 ms        12000
BM_SomeFunction/3328/iterations:500/threads:24      0.113 ms         2.66 ms        12000
BM_SomeFunction/3456/iterations:500/threads:24      0.111 ms         2.61 ms        12000
BM_SomeFunction/3584/iterations:500/threads:24      0.112 ms         2.63 ms        12000
BM_SomeFunction/3584/iterations:500/threads:24      0.112 ms         2.63 ms        12000
BM_SomeFunction/3712/iterations:500/threads:24      0.271 ms         6.35 ms        12000
BM_SomeFunction/3840/iterations:500/threads:24      0.270 ms         6.35 ms        12000
BM_SomeFunction/3968/iterations:500/threads:24      0.274 ms         6.42 ms        12000
BM_SomeFunction/4096/iterations:500/threads:24      0.276 ms         6.49 ms        12000
BM_SomeFunction/4096/iterations:500/threads:24      0.273 ms         6.41 ms        12000
BM_SomeFunction/4352/iterations:500/threads:24      0.151 ms         3.53 ms        12000
BM_SomeFunction/4608/iterations:500/threads:24      0.146 ms         3.45 ms        12000
BM_SomeFunction/4864/iterations:500/threads:24      0.142 ms         3.36 ms        12000
BM_SomeFunction/5120/iterations:500/threads:24      0.144 ms         3.40 ms        12000
BM_SomeFunction/5120/iterations:500/threads:24      0.146 ms         3.40 ms        12000
BM_SomeFunction/5376/iterations:500/threads:24      0.196 ms         4.57 ms        12000
BM_SomeFunction/5632/iterations:500/threads:24      0.187 ms         4.39 ms        12000
BM_SomeFunction/5888/iterations:500/threads:24      0.191 ms         4.47 ms        12000
BM_SomeFunction/6144/iterations:500/threads:24      0.188 ms         4.39 ms        12000

test report:

BM_SomeFunction/1792/iterations:500/threads:24      0.095 ms         2.12 ms        12000

means allocating 1792 byte of memory consumes 2.12 ms CPU time.

Test code

#include "benchmark/benchmark.h"
#include "jemalloc/jemalloc.h"

static size_t kBatchSize = 10000;

static void alloc_mem_n(size_t size) {
    std::vector<char*> kVec(kBatchSize, 0);
    for (int i = 0; i < kBatchSize; ++i) {
        auto p = new char[size];
        p[0] = i;
        benchmark::ClobberMemory();
        kVec[i] = p;
    }
    for (auto &p : kVec) {
        delete p;
        p = nullptr;
    }
}

static void BM_SomeFunction(benchmark::State& state) {
    for (auto _ : state) {
        alloc_mem_n(state.range(0));
    }
}


BENCHMARK(BM_SomeFunction)
    ->Unit(benchmark::kMillisecond)
    ->Iterations(500)
    ->Threads(24)
    ->DenseRange(1792, 2048, 64)
    ->DenseRange(2048, 2560, 128)
    ->DenseRange(2560, 3072, 128)
    ->DenseRange(3072, 3584, 128)
    ->DenseRange(3584, 4096, 128)
    ->DenseRange(4096, 5120, 256)
    ->DenseRange(5120, 6144, 256);

BENCHMARK_MAIN();
2

There are 2 best solutions below

0
On

Jemalloc uses slab for allocating small sizes, and size of slab is equal to the least common multiple of the allocated size and page size.

In this case, allocated_size == page_size == slab_size == 4096, which means that a slab of 4096 bytes can satisfy only one allocation of 4096 bytes.

0
On

In jemalloc, size < 16K considers small allocate, and will share extents with other memory allocation. For example, allocation of 5120 will find an extent of 16K and the rest of extent will be used for other allocations, resulting in memory fragmentation. However, for multiple of 4K, Jemalloc will apply an extent of exact same size, and there will be no memory fragmentation. You can see in Jemalloc stats, the util of arena bins of 4K, 8K and 12K is always 1, while other bins is usually less than 1.