I use a Django backend which has 2 main workloads:

  1. API server for the Angular UI Frontend and Django Admin Screens
  2. Scheduler, which kick off about 8 difference scheduled services
    • Metric collection: API calls against a fleet of servers for regular status updates, then stores the responses in the PostgreSQL database.
    • Runner: Runs Ansible jobs against our fleet of servers

Everything is tied to the same PostgreSQL database and often crossed referenced. So don't think of these as logical different applications, it's not.

The problem all stems around Gunicorn webserver and Timeouts. Gunicorn has 6 workers. With the default settings Gunicorn loves to kill processes that have not been used in the last ~30 seconds by default. Which works great if your workload is just HTTP calls and not background processes. But since I clearly have both I have issues with gunicorn killing or not killing my processes.

I had the gunicorn timeout really high at 21600 secs(6 hours) for years. With this setting it would let my Ansible jobs finish and my scheduler jobs finish. But database would have a large number of inactive (2000~4000) and active (1~20) sessions. The database would be sluggish and connections would be slow to die off.

Recently I lowered it 60 seconds, and all my Scheduler and Ansible jobs would fail. I then went to 800 seconds. Most Scheduler jobs finish, but Ansible jobs will just die when that 800 seconds rolls around if my HTTP calls are slow to come in.

Now I clearly see I have two workload concerns. 1) API calls and 2) Backend processing

Current Gunicorn service

[Unit]
Description = RocketDBaaS_runner
Requires=runner.socket cntlm.service
After=network.target cntlm.service

[Service]
PIDFile=/run/runner.pid
User=root
Group=dbaas
RuntimeDirectory=runner
Environment=http_proxy=http://localhost:3128
Environment=https_proxy=http://localhost:3128
LimitNOFILE=131072
LimitNPROC=64000
LimitMEMLOCK=infinity
ExecStart=/opt/dbaas/RocketDBaaS_api/venv/bin/gunicorn \
         --pid /run/runner.pid \
         --access-logfile None \
         --workers 6 \
         --bind unix:/run/runner.socket \
         --pythonpath /opt/dbaas/RocketDBaaS_api \
         --timeout 300 \
         --graceful-timeout 30 \
         RocketDBaaS_api.wsgi
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
PrivateTmp=true
KillMode=mixed
KillSignal=SIGKILL

So what to do???

My top idea is to create three services calling the same code base. But pass in a variable called service_function='Scheduler|API|Runner'

  1. API: Basically use what I have above but pass in service_function='API' and lower my timeout to 60 seconds
  2. Scheduler: I'm torn
    A) Create service that calls Django directly and pass service_function='Scheduler'. But a I've heard it's not wise to run Django this way because it insecure. But I would not be taking in HTTP request. FYI: The scheduler starts and handles it's own threads.
    B) Just set up a new Gunicorn server with no workers or threads and timeout=0
  3. Runner: Create service like #2 and pass service_function='Runner'

I could combine #2 & #3, but thinking it might be way easier to search journalctl with 3 distinct services. But my log files are already separated, so it's not a show stopper.

In the wsgi.py file, I plan to read that input parameter and use it as a global variable. This variable would then tell Django when it should act as an API, Scheduler, or Runner server.

Currently my scheduler uses a first come first server locking mechanism until that process dies off. Which is kind of downfall of of the current design, since all have to flow through all the logic to see if the schedule startup process just to find the lock file exist and the process in the lock file is active. This solution would solve that with a simple If statement.

Please I'm a team of 1. I need other viewpoints before going down this road. I have googled a bunch but never really found a question like this.

Stack:

UI: Angular
API: Django
Automation: Ansible
Database: PostgreSQL ~1TB
Interacting with: ~600 VMs

Thanks in advance.

1

There are 1 best solutions below

0
On

I decided to go with option 2B, "new Gunicorn server with no workers or threads and timeout=0".
This seemed the most straight forward. And maybe someday I would want to use a test URL to kick something off, thus Gunicorn would be more secure then.

So the API gunicorn service file now looks like:

[Unit]
Description = RocketDBaaS_api
Requires=api.socket cntlm.service
After=network.target cntlm.service

[Service]
PIDFile=/run/api.pid
User=root
Group=dbaas
RuntimeDirectory=api
Environment=http_proxy=http://localhost:3128
Environment=https_proxy=http://localhost:3128
LimitNOFILE=131072
LimitNPROC=64000
LimitMEMLOCK=infinity
ExecStart=/opt/dbaas/RocketDBaaS_api/venv/bin/gunicorn \
         --env WorkloadType=RocketApi \
         --name api \
         --pid /run/api.pid \
         --access-logfile None \
         --workers 6 \
         --bind unix:/run/api.socket \
         --pythonpath /opt/dbaas/RocketDBaaS_api \
         --timeout 300 \
         --graceful-timeout 30 \
         RocketDBaaS_api.wsgi
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
PrivateTmp=true
KillMode=mixed
KillSignal=SIGKILL

[Install]
WantedBy=multi-user.target

And the Scheduler gunicorn service file looks like the following, notice the "WorkloadType=RocketScheduler", "--timeout 0", and no "--workers".

[Unit]
Description = RocketDBaaS_Scheduler
Requires=rocket.socket cntlm.service
After=network.target cntlm.service

[Service]
PIDFile=/run/rocket.pid
User=root
Group=dbaas
RuntimeDirectory=rocket
Environment=http_proxy=http://localhost:3128
Environment=https_proxy=http://localhost:3128
LimitNOFILE=131072
LimitNPROC=64000
LimitMEMLOCK=infinity
ExecStart=/opt/dbaas/RocketDBaaS_api/venv/bin/gunicorn \
         --env WorkloadType=RocketScheduler \
         --name rocket \
         --pid /run/rocket.pid \
         --access-logfile None \
         --bind unix:/run/rocket.socket \
         --pythonpath /opt/dbaas/RocketDBaaS_api \
         --timeout 0 \
         --graceful-timeout 30 \
         RocketDBaaS_api.wsgi
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
PrivateTmp=true
KillMode=mixed
KillSignal=SIGKILL

[Install]
WantedBy=multi-user.target

As you can see in both files I included an environmental variable called "WorkloadType". I found this the most straight-forward way of passing in a variable into Python.

In my python wsgi.py file I have code that looks like the following.
The "workload_type = os.getenv('WorkloadType')" is how I read the environmental variable. Then I start the scheduler depending upon the value.
I can also include this variable in other python files with a simple import and then some logic depending upon what needs to happen. Maybe disable all routes???

import os
import logging

from django.core.wsgi import get_wsgi_application
log = logging.getLogger(__name__)


def start_app(_workload_type: str):
    if _workload_type == 'RocketScheduler':
        from dbaas.services.scheduler.start_the_scheduler import main_scheduler
        log.info('[RocketScheduler]: Starting main_scheduler()')
        main_scheduler()
    elif _workload_type == 'RocketApi':
        log.info('[RocketApi]: Starting')
    else:
        log.error(f'Started the RocketDBaaS without a valid workload_type({_workload_type})')


os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'RocketDBaaS_api.settings')

application = get_wsgi_application()

from dbaas.rocket_config import *
set_globals()

pid = os.getpid()

workload_type = os.getenv('WorkloadType')
start_app(workload_type)

I have run this for a few days now and have not had any timeout exceptions. :)