I'm Paul Traylor in LINE Fukuoka's Development Department where I focus on tooling to monitor many servers used to support LINE family apps. One of my main tasks is to maintain Promgen which is a tool for managing Prometheus targets and routing alerts to the correct team.
How Prometheus and AlertManager process alerts
Prometheus uses labels for many things, from writing PromQL to routing alerts, so understanding how labels are used is important to understanding how alerts are routed.
Let's begin by looking at an alert for Nginx as the following:
alert: NginxDown
expr: rate(nginx_exporter_scrape_failures_total{}[30s]) > 0
for: 5m
labels:
severity: major
If we run the query—line 2 of the block above—in Prometheus, we would receive a result that would look similar to the following:
{project="Project A, service="Service 1", job="nginx", instance="one.example.com:9113"} 1
{project="Project B, service="Service 2", job="nginx", instance="two.example.com:9113"} 1
When Prometheus fires an alert, it will start sending alert details like the following to the AlertManager.
[
{
"labels": {
"alertname": "NginxDown",
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"instance": "one.example.com:9113",
"severity": "major"
},
"annotations": {
"<name>": "<value>"
},
"startsAt": "2016-04-21T20:14:37.698Z",
"endsAt": "2016-04-21T20:15:37.698Z",
"generatorURL": "<generator_url>"
},
{
"labels": {
"alertname": "NginxDown",
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"instance": "two.example.com:9113",
"severity": "major"
},
"annotations": {
"<name>": "<value>"
},
"startsAt": "2016-04-21T20:14:37.698Z",
"endsAt": "2016-04-21T20:15:37.698Z",
"generatorURL": "<generator_url>"
}
]
The AlertManager will collect the active alerts and after accounting for duplicate alerts and batching recent alerts together, it will use it's webhook notifier to send a message to Promgen, which is to be routed. Since Promgen is primarily concerned about projects and services, our AlertManager configuration groups them together, as you can see under "groupLabels"
and "commonLabels"
properties in the following alert content.
{
"receiver": "promgen",
"status": "firing",
"alerts": [
{
"labels": {
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"instance": "one.example.com:9113",
"alertname": "NginxDown",
"severity": "major"
},
"annotations": {
"<name>": "<value>"
},
"startsAt": "2016-04-21T20:14:37.698Z",
"endsAt": "2016-04-21T20:15:37.698Z",
"generatorURL": "<generator_url>"
},
{
"labels": {
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"alertname": "NginxDown",
"instance": "two.example.com:9113",
"severity": "major"
},
"annotations": {
"<name>": "<value>"
},
"startsAt": "2016-04-21T20:14:37.698Z",
"endsAt": "2016-04-21T20:15:37.698Z",
"generatorURL": "<generator_url>"
}
],
"groupLabels": {
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"alertname": "NginxDown",
},
"commonLabels": {
"project": "Project A",
"service": "Service 1",
"job": "nginx",
"alertname": "NginxDown",
},
"commonAnnotations": {},
"externalURL": "alertmanager.example.com",
"version": "3",
"groupKey": 12345
}
Promgen then looks at the "commonAnnotations"
property to see what projects and services may be associated with this alert. Promgen will also add in it's own annotations, by using "commonLabels"
in a query to link projects and services to objects that exist in Promgen itself.
What happens when Promgen cannot route a message?
When Promgen cannot route a message, the first thing to check are the labels. Often when we write an aggregation query, our labels can change in unexpected ways. Let's have a look at a different example of an alert rule:
alert: ExportersDown
expr: count(up==0) > 5
for: 5m
labels:
severity: major
annotations:
summary: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minute.'
If we run the query—line 2 of the block above—then Prometheus will give us a single value as shown below.
{} 123
For those who are more familiar with SQL, we could write our query like this:
SELECT COUNT(*)
FROM table0
WHERE up = 0
When Prometheus runs our query and creates the alert, we get an alert details to be sent to the AlertManager, like the following:
[
{
"labels": {
"alertname": "ExportersDown",
"severity": "major"
},
"annotations": {
"summary": " of job has been down for more than 5 minute."
},
"startsAt": "2016-04-21T20:14:37.698Z",
"endsAt": "2016-04-21T20:15:37.698Z",
"generatorURL": "<generator_url>"
}
]
We can see that the alert we created contains neither service nor project, so the alert sent to Promgen will be missing the information required to route our alert.
By telling Prometheus how we want to group our labels with our query, we can keep important labels.
expr: count(up==0) by (service, project) > 5
In a similar way, when we write an SQL query, we want to keep our service and project fields, so we add group by
to our query.
SELECT SERVICE, PROJECT, COUNT(*)
FROM table0
WHERE up = 0
GROUP BY SERVICE, PROJECT
In some cases, we cannot change our query or perhaps we want to route alerts to a different team so we need to explicitly set our routing targets in the alert rule itself.
alert: ExportersDown
expr: count(up==0) > 5
for: 5m
labels:
severity: major
service: Operations # Explicitly set the service we want to route our messages to
annotations:
summary: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minute.'
Prometheus will take the labels from our PromQL result, and then update the labels with the ones we have explicitly defined in our rule.
Why is my alert going to the wrong team?
For this example, we want to write a rule to help notify when there may be a mistake in instrumentation and alert for potential cardinality problems (too many unique values for our label).
alert: SuddenChangeInMetricsCount
expr: abs(scrape_samples_scraped - scrape_samples_scraped offset 1d > 1000)
for: 1h
labels:
severity: warning
While the exact thresholds are a bit naive, we have a different problem. Our intention was to notify the operations team when there is a problem, but if we write our rule in this way, it will notify the developers directly. If we want to only notify the operations teamㅡand save the developers some inbox spam—then we need to overwrite both service and project labels.
alert: SuddenChangeInMetricsCount
expr: abs(scrape_samples_scraped - scrape_samples_scraped offset 1d > 1000)
for: 1h
labels:
project: observation
service: operations
severity: warning
By overwriting both or project and service labels, we can ensure that the developer will not be notified until the operations team has been able to confirm the problem.