How to solve an inhibited alert still alerting due to race condition

108 Views Asked by At

I currently am running into an issue where I have two alerts where one will always fire if the other is also firing.

First Alert (ServerOffline): Detects based on a ping exporter if a server has gone offline. This alert is setup to query every 1s and generally will fire within 2-3 minutes of actual loss of ping. sum(ping_loss_ratio >= .7) by (instance, instance_region) >= 2

Second Alert (NoVectorData): Detects if my metrics have stopped flowing from the servers. This alert is setup to track a range of 12h worth of data and if it stops flowing for 15min, my alerting system will fire an alert if that condition is true after 10 minutes. This will take roughly 30~ mins to fire due to the group_wait that is setup for this alert. count by (instance, instance_region)(lag(vector_component_sent_events_total{component_id="vector_sink"}[12h]) > 15m) >= 1

I have a simple inhibition rule setup that works just fine for preventing the NoVectorData alert from being sent if a server is marked offline.

inhibit_rules:
  - source_matchers: [alertname="ServerOffline"]
    target_matchers: [alertname="NoVectorData"]
    equal: ['instance']

The problem with this is if the server was offline for more than the 30min, but comes back up it recovers the ServerOffline alert faster than it recovers the NoVectorData alert. This will then make the NoVectorData alert fire as the silence that the inhibit rules setup would no longer be active. I am struggling to find a proper solution to add some sort of buffer to prevent this. Here is my current route settings for these alerts.

   - matchers:
        - alertname="ServerOffline"
      group_by: ['alertname', 'instance_region']
      receiver: 'slack.hook.resolve'
      group_wait: 30s
      repeat_interval: 24h
    - matchers:
        - alertname="NoVectorData"
      group_by: ['alertname', 'instance_region']
      group_wait: 5m
      repeat_interval: 1h

I can provide any further details on the setup if needed.

0

There are 0 best solutions below