I have created an alert policy in GCP MOnitoring which will notify me when a certain kind of log message stops appearing (a dead man's switch). I have create a logs-based metric with a label, "client", which I use to group the metric and get a timeseries per client. I have been using "absence of data" as the trigger for the alert. This has all been working well, until...
After a recent change, the logs now also com from different resources, so there is a need to combine the metric across those resources. I can achieve this using QML
{ fetch gce_instance::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
| every 30m
; fetch global::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
| every 30m }
| union
Notice that I need to align the two series with the same bucket size (30m) to be able to join them, which makes sense. I notice that the value for a timeseries is "undefined" in those buckets where the metric data was absent (by downloading a CSV of the query).
To create an alert using this query, I tried something like this:
{ fetch gce_instance::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
| every 30m
; fetch global::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
| every 30m }
| union
| absent_for 1h
If I look at the CSV output for this query it doesn't reflect the absence of metric data for a timeseries, and this is presumably because a value of "undefined" doesn't qualify as absent data.
Is there a way to detect for absence of data for a "unioned" metric (and therefore aligned) across multiple resources?
Update 1
I have tried this, which seems to get me some of the way there. I'd really appreciate comments on this approach.
{
fetch gce_instance::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
;
fetch global::logging.googleapis.com/user/ping
| group_by [metric.client], sum(val())
}
| union
| absent_for 1h
I have settled on a solution as follows,
Note:
group_by [metric.client]
conforms the tables from different resource, which allows theunion
to workabsent_for
does align input timeseries using the default period or one specified by a following everyI found it really hard to debug these MQL queries, in particular to confirm that
absent_for
was going to trigger an alert. I realised that I could usevalue [active]
to show a plot of the active column (whichabsent_for
produces) and that gave me confidence that my alert was actually going to work.