Capacity Planning for Task/Message Queue

220 Views Asked by At

There is a celery task in Django setup which uses Redis as a backend. I want to improve throughput, responsiveness, and predict task completion time for capacity planning. The task takes the data from the internal system and syncs it to an external system. Each task will only sync to an external system out of 5.

The currents stats of the system

  • Everyday on an average, 3500 tasks arrive in the queue.
  • Latency breakdown: p50 - 0.91s, p75 - 5.09s, p90 - 17.88s, p95: 22.56s, p99 - 51.32s, max - 11.93 min
  • 3 workers run in parallel for this queue
  • Hourly arrival rates are different. 5 hours in a day constitute to 52% of the items in the queue
    • 13th hour - 22.07%
    • 4 - 10.98%
    • 14 - 9.80%
    • 11 - 6.14%
    • 12 - 5.94%
    • 20 - 4.85 %
    • 15 - 4.78%
    • Rest(17 hours) - 35.44%

Responsiveness

  • To make the system(queue workers) responsive, each external system specific queue and workers should be sufficient with retry queues.

Throughput

# p50, p75, p90, p95, p99, max
>>> (probs = {0.50: 0.91, 0.25: 5.09, 0.15: 17.88, 0.05: 22.56, 0.04: 51.32, 0.01: 11.93 * 60}
>>>val = 0
>>>for k, v in probs.items():
            val += k * v * 3500
>>>mean_service_time = val/3500
>>>mean_service_time

  • Average/Mean time to complete the task is 14.748s.
  • Mean is higher than the median (p50 - 0.88s)
  • Is there any other better metrics to calculate the throughput? What are the right metrics? Does hourly throughput rate a better metric since the load distribution is uneven?

Queue wait-time

Results from the calculator

  • Queue Length = 39 units
  • Average Time Spent in the System = 962.9 s
  • Utilization Factor = 0.675
  • Probablility of 0 units in the system = 0.5079

Is this calculation correct and valid? Is there a better way to calculate metrics?

The intention of doing all the calculations as to how to make the queue workers faster when the production rate grows by 3X, reduce queue waiting time, length of the queue considering other resources aren't bottlenecked like Database.

Is there a better way to do capacity planning?

Note: There is a trial and method approach one can follow by increasing the capacity by 5X for 3X growth but I would like to see if the problem can be approached using queueing theory and maths.

0

There are 0 best solutions below