Capacity Planning for Task/Message Queue

211 Views Asked by Kracekumar At 18 August 2025 at 02:10

There is a celery task in Django setup which uses Redis as a backend. I want to improve throughput, responsiveness, and predict task completion time for capacity planning. The task takes the data from the internal system and syncs it to an external system. Each task will only sync to an external system out of 5.

The currents stats of the system

Everyday on an average, 3500 tasks arrive in the queue.
Latency breakdown: p50 - 0.91s, p75 - 5.09s, p90 - 17.88s, p95: 22.56s, p99 - 51.32s, max - 11.93 min
3 workers run in parallel for this queue
Hourly arrival rates are different. 5 hours in a day constitute to 52% of the items in the queue
- 13th hour - 22.07%
- 4 - 10.98%
- 14 - 9.80%
- 11 - 6.14%
- 12 - 5.94%
- 20 - 4.85 %
- 15 - 4.78%
- Rest(17 hours) - 35.44%

Responsiveness

To make the system(queue workers) responsive, each external system specific queue and workers should be sufficient with retry queues.

Throughput

# p50, p75, p90, p95, p99, max
>>> (probs = {0.50: 0.91, 0.25: 5.09, 0.15: 17.88, 0.05: 22.56, 0.04: 51.32, 0.01: 11.93 * 60}
>>>val = 0
>>>for k, v in probs.items():
            val += k * v * 3500
>>>mean_service_time = val/3500
>>>mean_service_time

Average/Mean time to complete the task is 14.748s.
Mean is higher than the median (p50 - 0.88s)
Is there any other better metrics to calculate the throughput? What are the right metrics? Does hourly throughput rate a better metric since the load distribution is uneven?

Queue wait-time

After reading a little about queuing theory, I could find a formula to calculate the queue wait time, queue length, average time spent in the queue.
The online calculator gives me results for my case.
Arrival Rate = 0.0405(3500/86400 (total seconds in a day))
Service Rate = 0.06 (1/14.748)
Number of servers = 3 (3 parallel workers)
Time Period = seconds

Results from the calculator

Queue Length = 39 units
Average Time Spent in the System = 962.9 s
Utilization Factor = 0.675
Probablility of 0 units in the system = 0.5079

Is this calculation correct and valid? Is there a better way to calculate metrics?

The intention of doing all the calculations as to how to make the queue workers faster when the production rate grows by 3X, reduce queue waiting time, length of the queue considering other resources aren't bottlenecked like Database.

Is there a better way to do capacity planning?

Note: There is a trial and method approach one can follow by increasing the capacity by 5X for 3X growth but I would like to see if the problem can be approached using queueing theory and maths.

Original Q&A

Capacity Planning for Task/Message Queue

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in MESSAGE-QUEUE

Related Questions in THROUGHPUT

Related Questions in CAPACITY-PLANNING

Trending Questions

Popular # Hahtags

Popular Questions