I have a server that has 80 logical core (model: DL580 G7). I'm running a single thread per core.
Each thread doing INTEL MKL FFT, Convolution and many Allocation and DeAllocation from heap with malloc
.
I previously have server with 16 logical core and there was no problem. Each thread work on its core with 100% CPU usage.
When I moved my application from 16-core server to 80-core server with NUMA architecture, after creating the first thread, that thread works on 100% (kernel time 0%) and with the addition of each thread, performance of other thread decrease, until the CPU usage downgrade to 40% (39% kernel time).
Because kernel time is increased, I think the reason for this event is heap sequential mechanism and heap lock. Because of the increasing demand for memory allocation, each request increased the waiting time. Alternatively, maybe remote memory access to other NUMA node memory lead to degrade performance.
I use createheap()
on each thread to eliminate wait for unlocking heap memory, but heapalloc
can alloc memory up to 512KB, which is insufficient for me.
I use virtuallalloc
, but it led to decrease thread performance.
What should I do to fix this problem?