I want to design an SLI/SLO based on the two counters described below:
requestedCounter= Prometheus counter that gets incremented every time a request is sent to downstream service
confirmedCounter = Prometheus counter that gets incremented every time a confirmation is received notifying that a downstream service has processed a request
Would it make sense to something like = 1- [ sum(rate(confirmedCounter)) / sum(rate(requestedCounter)) ] to model bad events/total events? or would using something like a count_over_time make more sense rather than rate?
Any other suggestions would be appreciated too as I'm new to Prometheus SLI/SLOs.
Prometheus counters count the number of events. count_over_time() function counts the number of raw samples stored in the database per each matching time series. So this function isn't applicable for Prometheus counter metrics. You need to use increase() for calculating the number of events over the specified lookbehind window in square brackets. For example,
increase(http_requests_total[1h])
calculates the number of http requests over the last hour.So, for your case the following query should return the share of failed requests over the last hour: