Designing SLO based on Prometheus Counters

307 Views Asked by At

I want to design an SLI/SLO based on the two counters described below:

requestedCounter= Prometheus counter that gets incremented every time a request is sent to downstream service

confirmedCounter = Prometheus counter that gets incremented every time a confirmation is received notifying that a downstream service has processed a request

Would it make sense to something like = 1- [ sum(rate(confirmedCounter)) / sum(rate(requestedCounter)) ] to model bad events/total events? or would using something like a count_over_time make more sense rather than rate?

Any other suggestions would be appreciated too as I'm new to Prometheus SLI/SLOs.

2

There are 2 best solutions below

0
On

Prometheus counters count the number of events. count_over_time() function counts the number of raw samples stored in the database per each matching time series. So this function isn't applicable for Prometheus counter metrics. You need to use increase() for calculating the number of events over the specified lookbehind window in square brackets. For example, increase(http_requests_total[1h]) calculates the number of http requests over the last hour.

So, for your case the following query should return the share of failed requests over the last hour:

1 - increase(confirmedCouner[1h]) / increase(requestedCounter[1h])
0
On

count_over_time would not work for your use case, as it counts the number of samples for each series over the specified time period.

As an example, check out this query here.

It seems like you are interested in the ratio of the rate of increase of both counter metrics, and hence using rate makes more sense.

One thing to be cautious about when constructing your PromQL query is to careful to understand how Operators work (see docs here).

For division, your numerator or denominator could evaluate to a scalar or vector depending on the query. I'd recommend trying to evaluate both the numerator and denominator by itself independently, in the Prometheus expression browser first, so that you ensure your final query (after doing division or multiplication) is correct.