I'm trying to deploy a simple SLO for my cloud run service (with Terraform). This is what I have at the moment:
resource "google_monitoring_slo" "request_based_slo" {
for_each = toset(var.services_w_slos)
service = each.value
slo_id = format("requests-slo-%s", each.value)
display_name = format("Request-based SLO for %s", each.value)
goal = 0.999
rolling_period_days = 28
request_based_sli {
good_total_ratio {
good_service_filter = join(" AND ", [
"metric.label.\"response_code_class\"=\"2xx\"",
"metric.type=\"run.googleapis.com/request_count\"",
"resource.type=\"cloud_run_revision\"",
format("resource.label.\"project_id\"=\"%s\"", var.project_id),
])
total_service_filter = join(" AND ", [
"metric.type=\"run.googleapis.com/request_count\"",
"resource.type=\"cloud_run_revision\"",
format("resource.label.\"project_id\"=\"%s\"", var.project_id),
])
}
}
}
This SLO is really simple: for a rolling period of 28 days, count the number of good requests (2xx status code) over the total number of requests. Nothing more, nothing less. It does deploy, but it doesn't do exactly what I expected:
I sent a few good requests. The SLI was at 100%. Then I sent a bad request (on purpose). The SLI dropped to 80%. I then sent a couple of good requests. The SLI jumped back to 100%.
It looks like the SLI measures the "instantanerous" reliability, without taking into account the previous requests. I suspect the metrics I used aren't really time series.
Could you give me a hand please?