Python 3.5.x multiprocessing throwing "OSError: [Errno 12] Cannot allocate memory" on a 74% free RAM system

1.5k Views Asked by At

I wrote a Python 3.5 application that spawns several processes using multiprocessing library. It runs on Ubuntu 16.04.3 LTS on a dedicated server machine that ships with 2 x Intel Xeon E2690 v1 (8 cores each) and 96GB of RAM.

This system runs a PostgreSQL instance which is configured to use a maximum of about 32GB of RAM (effective_cache_size is set to 32GB), but that's only a reminder for the purposes of my question (I tried several combinations of effective_cache_size, work_mem, shared_buffers, etc.).

Each process opens a connection to the database and reuses it many times.

Here's a simplified portion of code that shows how i spawn a new process:

from multiprocessing import Process
import time

process = Process(target=start_algorithm, args=(arg1, arg2))
process.start()

def start_algorithm():
    while True :
        time.sleep(1)
    return True

After spawning more than 200 processes (the exact number isn't always the same), the application throws an exception when trying to spawn a new process:

OSError: [Errno 12] Cannot allocate memory

ulimit -a output is:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 384500
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 384500
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

In /etc/sysctl.conf I set the following parameters which work well for mdadm and PostgreSQL:

# Allows for 84GB shared_buffers in PostgreSQL
kernel.shmmax = 90914313216
kernel.shmall = 22020096

# Various PostgreSQL optimizations
vm.overcommit_memory = 2
vm.overcommit_ratio = 90
vm.swappiness = 4
vm.zone_reclaim_mode = 0
vm.dirty_ratio = 15
vm.dirty_background_ratio = 3

# mdadm optimizations
vm.min_free_kbytes=262144
kernel.sched_migration_cost_ns = 5000000
kernel.sched_autogroup_enabled = 0
dev.raid.speed_limit_max=1000000
dev.raid.speed_limit_min=1000000

I tried also to set and unset vm.nr_hugepages but it didn't get rid of the problem.

Before starting my Python application, the RAM usage is about 500 MB over 96 GB, so I can see that the overall RAM is quite empty. After spawning those 200+ processes, the RAM starts to fill and it reaches a maximum at about 20 GB (the remanining 74 GB are still free) and then it throws the Cannot allocate memory exception.

The question is: why?

I tried to measure the footprint of the overall processes and found memory_profiler, a library/tool for Python. I was able to get this graph:

Memory footprint of 287 processes.

If I'm not wrong, those are about 47500 MiB of memory, so about 50 GB of "occupied" RAM. Each process footprintf should be about 170 MB. The problem is that I'm not able to see that amount of occupied RAM anywhere. Here are some outputs:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            94G         18G         74G        570M        1,6G         73G
Swap:           15G          0B         15G

$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  75879     81   1518    0    0     3    40  229  157 12  1 86  0  0

$ top
Tasks: 1457 total,   2 running, 1455 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3,1 us,  0,9 sy,  0,0 ni, 95,8 id,  0,0 wa,  0,0 hi,  0,2 si,  0,0 st
KiB Mem : 98847552 total, 77698752 free, 19509612 used,  1639188 buff/cache
KiB Swap: 15825916 total, 15825916 free,        0 used. 77580104 avail Mem 

I was able to start 287 processes by lowering the amount of memory required by PostgreSQL, but this leads always to A LOT of free RAM (74GB). Here's my configuration file for PostgreSQL 9.6 (postgresql.conf):

max_connections=2000
listen_addresses = '127.0.0.1,192.168.2.90'
shared_buffers = 1GB
work_mem = 42MB
port=5433
maintenance_work_mem = 256MB
checkpoint_completion_target = 0.9
effective_cache_size = 32GB
default_statistics_target = 1000
random_page_cost=1.2
seq_page_cost=1.0
max_files_per_process = 500 # default 1000
huge_pages = off

EDIT

I found this answer on SO and found a way to isntantly measure the overall memory usage.

Python (288 processes spawned):

$ ps aux | grep python3 | awk '{sum=sum+$6}; END {print sum/1024 " MB"}'
53488.1 MB

PostgreSQL:

$ ps aux | grep postgres | awk '{sum=sum+$6}; END {print sum/1024 " MB"}'
20653.4 MB

I still don't understand why the usual tools (vmstat, free, top, glances) shows another amount of used RAM.

0

There are 0 best solutions below