Request-based SLO for cloud run service

107 Views Asked by At

I'm trying to deploy a simple SLO for my cloud run service (with Terraform). This is what I have at the moment:

resource "google_monitoring_slo" "request_based_slo" {
  for_each = toset(var.services_w_slos)

  service      = each.value
  slo_id       = format("requests-slo-%s", each.value)
  display_name = format("Request-based SLO for %s", each.value)

  goal                = 0.999
  rolling_period_days = 28

  request_based_sli {
    good_total_ratio {
      good_service_filter = join(" AND ", [
        "metric.label.\"response_code_class\"=\"2xx\"",
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        format("resource.label.\"project_id\"=\"%s\"", var.project_id),
      ])
      total_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        format("resource.label.\"project_id\"=\"%s\"", var.project_id),
      ])
    }
  }
}

This SLO is really simple: for a rolling period of 28 days, count the number of good requests (2xx status code) over the total number of requests. Nothing more, nothing less. It does deploy, but it doesn't do exactly what I expected:

screenshot

I sent a few good requests. The SLI was at 100%. Then I sent a bad request (on purpose). The SLI dropped to 80%. I then sent a couple of good requests. The SLI jumped back to 100%.

It looks like the SLI measures the "instantanerous" reliability, without taking into account the previous requests. I suspect the metrics I used aren't really time series.

Could you give me a hand please?

0

There are 0 best solutions below