In the process of testing the performance of jemalloc-5.2.0 to allocate small_class memory, it was found that the memory allocation time of 4096 bytes was significantly higher than that of other small class memory. Is there any special handling for 4096 bytes memory allocation in jemalloc? Or is there any other reason?
Test results:
- Use google benchmark with multithreaded test (24threads).
Run on (32 X 3400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 256 KiB (x16)
L3 Unified 20480 KiB (x2)
Load Average: 15.72, 14.21, 14.26
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
BM_SomeFunction/1792/iterations:500/threads:24 0.095 ms 2.12 ms 12000
BM_SomeFunction/1856/iterations:500/threads:24 0.175 ms 4.10 ms 12000
BM_SomeFunction/1920/iterations:500/threads:24 0.178 ms 4.13 ms 12000
BM_SomeFunction/1984/iterations:500/threads:24 0.177 ms 4.14 ms 12000
BM_SomeFunction/2048/iterations:500/threads:24 0.181 ms 4.18 ms 12000
BM_SomeFunction/2048/iterations:500/threads:24 0.177 ms 4.16 ms 12000
BM_SomeFunction/2176/iterations:500/threads:24 0.116 ms 2.67 ms 12000
BM_SomeFunction/2304/iterations:500/threads:24 0.113 ms 2.64 ms 12000
BM_SomeFunction/2432/iterations:500/threads:24 0.118 ms 2.75 ms 12000
BM_SomeFunction/2560/iterations:500/threads:24 0.113 ms 2.65 ms 12000
BM_SomeFunction/2560/iterations:500/threads:24 0.114 ms 2.68 ms 12000
BM_SomeFunction/2688/iterations:500/threads:24 0.133 ms 3.13 ms 12000
BM_SomeFunction/2816/iterations:500/threads:24 0.132 ms 3.08 ms 12000
BM_SomeFunction/2944/iterations:500/threads:24 0.131 ms 3.09 ms 12000
BM_SomeFunction/3072/iterations:500/threads:24 0.132 ms 3.10 ms 12000
BM_SomeFunction/3072/iterations:500/threads:24 0.132 ms 3.11 ms 12000
BM_SomeFunction/3200/iterations:500/threads:24 0.117 ms 2.72 ms 12000
BM_SomeFunction/3328/iterations:500/threads:24 0.113 ms 2.66 ms 12000
BM_SomeFunction/3456/iterations:500/threads:24 0.111 ms 2.61 ms 12000
BM_SomeFunction/3584/iterations:500/threads:24 0.112 ms 2.63 ms 12000
BM_SomeFunction/3584/iterations:500/threads:24 0.112 ms 2.63 ms 12000
BM_SomeFunction/3712/iterations:500/threads:24 0.271 ms 6.35 ms 12000
BM_SomeFunction/3840/iterations:500/threads:24 0.270 ms 6.35 ms 12000
BM_SomeFunction/3968/iterations:500/threads:24 0.274 ms 6.42 ms 12000
BM_SomeFunction/4096/iterations:500/threads:24 0.276 ms 6.49 ms 12000
BM_SomeFunction/4096/iterations:500/threads:24 0.273 ms 6.41 ms 12000
BM_SomeFunction/4352/iterations:500/threads:24 0.151 ms 3.53 ms 12000
BM_SomeFunction/4608/iterations:500/threads:24 0.146 ms 3.45 ms 12000
BM_SomeFunction/4864/iterations:500/threads:24 0.142 ms 3.36 ms 12000
BM_SomeFunction/5120/iterations:500/threads:24 0.144 ms 3.40 ms 12000
BM_SomeFunction/5120/iterations:500/threads:24 0.146 ms 3.40 ms 12000
BM_SomeFunction/5376/iterations:500/threads:24 0.196 ms 4.57 ms 12000
BM_SomeFunction/5632/iterations:500/threads:24 0.187 ms 4.39 ms 12000
BM_SomeFunction/5888/iterations:500/threads:24 0.191 ms 4.47 ms 12000
BM_SomeFunction/6144/iterations:500/threads:24 0.188 ms 4.39 ms 12000
test report:
BM_SomeFunction/1792/iterations:500/threads:24 0.095 ms 2.12 ms 12000
means allocating 1792 byte of memory consumes 2.12 ms CPU time.
Test code
#include "benchmark/benchmark.h"
#include "jemalloc/jemalloc.h"
static size_t kBatchSize = 10000;
static void alloc_mem_n(size_t size) {
std::vector<char*> kVec(kBatchSize, 0);
for (int i = 0; i < kBatchSize; ++i) {
auto p = new char[size];
p[0] = i;
benchmark::ClobberMemory();
kVec[i] = p;
}
for (auto &p : kVec) {
delete p;
p = nullptr;
}
}
static void BM_SomeFunction(benchmark::State& state) {
for (auto _ : state) {
alloc_mem_n(state.range(0));
}
}
BENCHMARK(BM_SomeFunction)
->Unit(benchmark::kMillisecond)
->Iterations(500)
->Threads(24)
->DenseRange(1792, 2048, 64)
->DenseRange(2048, 2560, 128)
->DenseRange(2560, 3072, 128)
->DenseRange(3072, 3584, 128)
->DenseRange(3584, 4096, 128)
->DenseRange(4096, 5120, 256)
->DenseRange(5120, 6144, 256);
BENCHMARK_MAIN();
Jemalloc uses slab for allocating small sizes, and size of slab is equal to the least common multiple of the allocated size and page size.
In this case, allocated_size == page_size == slab_size == 4096, which means that a slab of 4096 bytes can satisfy only one allocation of 4096 bytes.