Alert on Absent Data for Combined Metric in GCP Monitoring

452 Views Asked by At

I have created an alert policy in GCP MOnitoring which will notify me when a certain kind of log message stops appearing (a dead man's switch). I have create a logs-based metric with a label, "client", which I use to group the metric and get a timeseries per client. I have been using "absence of data" as the trigger for the alert. This has all been working well, until...

After a recent change, the logs now also com from different resources, so there is a need to combine the metric across those resources. I can achieve this using QML

{ fetch gce_instance::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
  | every 30m
; fetch global::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
  | every 30m }
| union

Notice that I need to align the two series with the same bucket size (30m) to be able to join them, which makes sense. I notice that the value for a timeseries is "undefined" in those buckets where the metric data was absent (by downloading a CSV of the query).

To create an alert using this query, I tried something like this:

{ fetch gce_instance::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
  | every 30m
; fetch global::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
  | every 30m }
| union
| absent_for 1h

If I look at the CSV output for this query it doesn't reflect the absence of metric data for a timeseries, and this is presumably because a value of "undefined" doesn't qualify as absent data.

Is there a way to detect for absence of data for a "unioned" metric (and therefore aligned) across multiple resources?


Update 1

I have tried this, which seems to get me some of the way there. I'd really appreciate comments on this approach.

{
  fetch gce_instance::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
  ;
  fetch global::logging.googleapis.com/user/ping
  | group_by [metric.client], sum(val())
}
| union
| absent_for 1h
1

There are 1 best solutions below

0
On

I have settled on a solution as follows,

{
  fetch gce_instance::logging.googleapis.com/user/ping
  | group_by [metric.client]
  ;
  fetch global::logging.googleapis.com/user/ping
  | group_by [metric.client]
}
| union
| absent_for 1h
| every 30m

Note:

  • group_by [metric.client] conforms the tables from different resource, which allows the union to work
  • absent_for does align input timeseries using the default period or one specified by a following every

I found it really hard to debug these MQL queries, in particular to confirm that absent_for was going to trigger an alert. I realised that I could use value [active] to show a plot of the active column (which absent_for produces) and that gave me confidence that my alert was actually going to work.

{
  fetch gce_instance::logging.googleapis.com/user/ping
  | group_by [metric.client]
  ;
  fetch global::logging.googleapis.com/user/ping
  | group_by [metric.client]
}
| union
| absent_for 1h
| value [active]