If you notice a delay between an event and the first notification, read the following blog post => https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html.
{% highlight yaml %}
global: scrape_interval: 20s
evaluation_interval: 20s ...
rule_files:
- 'alerts/*.yml'
scrape_configs: ...
{% endhighlight %}
{% highlight yaml %}
groups:
- name: ExampleRedisGroup
rules:
- alert: ExampleRedisDown expr: redis_up{} == 0 for: 2m labels: severity: critical annotations: summary: "Redis instance down" description: "Whatever"
{% endhighlight %}
{% highlight yaml %} {% raw %}
route:
group_wait: 10s
group_interval: 30s
repeat_interval: 30m
receiver: "slack"
routes: - receiver: "slack" group_wait: 10s match_re: severity: critical|warning continue: true
- receiver: "pager"
group_wait: 10s
match_re:
severity: critial
continue: true
receivers:
-
name: "slack" slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx' send_resolved: true channel: 'monitoring' text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"
-
name: "pager" webhook_configs:
- url: http://a.b.c.d:8080/send/sms send_resolved: true
{% endraw %} {% endhighlight %}
For expansive or frequent PromQL queries, Prometheus allows to precompute rules.
{% highlight yaml %} {% raw %} groups:
- name: ExampleRecordedGroup
rules:
- record: job:rabbitmq_queue_messages_delivered_total:rate:5m expr: rate(rabbitmq_queue_messages_delivered_total[5m])
- name: ExampleAlertingGroup
rules:
- alert: ExampleRabbitmqLowMessageDelivery expr: sum(job:rabbitmq_queue_messages_delivered_total:rate:5m) < 10 for: 2m labels: severity: critical annotations: summary: "Low delivery rate in Rabbitmq queues" {% endraw %} {% endhighlight %}
If the notification takes too much time to be triggered, check the following delays:
scrape_interval = 20s
(prometheus.yml)evaluation_interval = 20s
(prometheus.yml)increase(mysql_global_status_slow_queries[1m]) > 0
(alerts/example-mysql.yml)for: 5m
(alerts/example-mysql.yml)group_wait = 10s
(alertmanager.yml)
Also read https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html.