How Promgen Routes Notifications

I’m Paul Traylor with LINE Fukuoka’s Development Department where I focus on tooling to monitoring the many servers used to support LINE Family Apps. One of my main tasks is to maintain Promgen which is a tool for managing Prometheus targets and routing alerts to the correct team.

How Prometheus and Alertmanager process alerts

Prometheus uses labels for many things, from writing PromQL to routing alerts, so understanding how labels are used is important to understanding how alerts are routed.

An example alert for Nginx may look like the following

alert: NginxDown
expr: rate(nginx_exporter_scrape_failures_total{}[30s]) > 0
for: 5m
labels:
  severity: major

If we run this query in Prometheus, we would receive a result that would look similar to the following.

{project="Project A, service="Service 1", job="nginx", instance="one.example.com:9113"} 1
{project="Project B, service="Service 2", job="nginx", instance="two.example.com:9113"} 1

When Prometheus fires an alert, it will start sending messages to Alertmanager.

[
  {
    "labels": {
      "alertname": "NginxDown",
      "project": "Project A",
      "service": "Service 1",
      "job": "nginx",
      "instance": "one.example.com:9113",
      "severity": "major"
    },
    "annotations": {
      "<name>": "<value>"
    },
    "startsAt": "2016-04-21T20:14:37.698Z",
    "endsAt": "2016-04-21T20:15:37.698Z",
    "generatorURL": "<generator_url>"
  },
  {
    "labels": {
      "alertname": "NginxDown",
      "project": "Project A",
      "service": "Service 1",
      "job": "nginx",
      "instance": "two.example.com:9113",
      "severity": "major"
    },
    "annotations": {
      "<name>": "<value>"
    },
    "startsAt": "2016-04-21T20:14:37.698Z",
    "endsAt": "2016-04-21T20:15:37.698Z",
    "generatorURL": "<generator_url>"
  }
]

Alertmanager will collect the active alerts and after accounting for duplicate alerts and batching recent alerts together, it will use it’s webhook notifier to send a message to Promgen to be routed. Since Promgen is primarily concerned about projects and services our Alertmanager configuration groups them together, as you can see under groupLabels and commonLabels.

{
  "receiver": "promgen",
  "status": "firing",
  "alerts": [
    {
      "labels": {
        "project": "Project A",
        "service": "Service 1",
        "job": "nginx",
        "instance": "one.example.com:9113",
        "alertname": "NginxDown",
        "severity": "major"
      },
      "annotations": {
        "<name>": "<value>"
      },
      "startsAt": "2016-04-21T20:14:37.698Z",
      "endsAt": "2016-04-21T20:15:37.698Z",
      "generatorURL": "<generator_url>"
    },
    {
      "labels": {
        "project": "Project A",
        "service": "Service 1",
        "job": "nginx",
        "alertname": "NginxDown",
        "instance": "two.example.com:9113",
        "severity": "major"
      },
      "annotations": {
        "<name>": "<value>"
      },
      "startsAt": "2016-04-21T20:14:37.698Z",
      "endsAt": "2016-04-21T20:15:37.698Z",
      "generatorURL": "<generator_url>"
    }
  ],
  "groupLabels": {
    "project": "Project A",
    "service": "Service 1",
    "job": "nginx",
    "alertname": "NginxDown",
  },
  "commonLabels": {
    "project": "Project A",
    "service": "Service 1",
    "job": "nginx",
    "alertname": "NginxDown",
  },
  "commonAnnotations": {},
  "externalURL": "alertmanager.example.com",
  "version": "3",
  "groupKey": 12345
}

Promgen then looks at commonAnnotations to see what Projects and Services may be associated with this alert. Promgen will also add in it’s own annotations, by using commonLabels as a query to link to objects that exist in Promgen itself.

What happens when Promgen cannot route a message?

When Promgen cannot route a message, the first thing to check are the labels. Often when we write an aggregation query, our labels can change in unexpected aways.

alert: ExportersDown
expr: count(up==0) > 5
for: 5m
labels:
  severity: major
annotations:
  summary: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minute.'

If we run this query, then Prometheus will give us a single value

{} 123

For those who are more familiar with SQL, we could write our example like this

SELECT COUNT(*)
  FROM table0
 WHERE up = 0

When Prometheus runs our query and creates the alert, we get something that looks like this.

[
  {
    "labels": {
      "alertname": "ExportersDown",
      "severity": "major"
    },
    "annotations": {
      "summary": " of job  has been down for more than 5 minute."
    },
    "startsAt": "2016-04-21T20:14:37.698Z",
    "endsAt": "2016-04-21T20:15:37.698Z",
    "generatorURL": "<generator_url>"
  }
]

We can see that this alert contains neither service nor project so once the alert is sent to Promgen, it will be missing the information it requires to route our alert.

By telling Prometheus how we want to group our labels, we can fix our query to keep important labels.

count(up==0) by (service, project) > 5

In a similar way, when we write a SQL query, we want to keep our service and project fields so we add group by to our query.

SELECT SERVICE, PROJECT, COUNT(*)
  FROM table0
 WHERE up = 0
 GROUP BY SERVICE, PROJECT

In some cases, we can not change our query or perhaps we want to route alerts to a different team so we need to explicitly set our routing targets in the alert rule itself.

alert: ExportersDown
expr: count(up==0) > 5
for: 5m
labels:
  severity: major
  service: Operations # Explicitly set the service we want to route our messages to
annotations:
  summary: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minute.'

Prometheus will take the labels from our PromQL result, and then update the labels with the ones we have explicitly defined in our rule.

Why is my alert going to the wrong team?

For this example, we want to write a rule to help notify when there may be a mistake in instrumentation and alert for potential cardinality problems (too many unique values for our label).

alert: SuddenChangeInMetricsCount
expr: abs(scrape_samples_scraped - scrape_samples_scraped offset 1d > 1000)
for: 1h
labels:
  severity: warning

While the exact thresholds are a bit naive, we have a different problem. Our intention was to notify the operations team when there is a problem, but if we write our rule in this way, it will notifiy the developers directly. If we want to only notify the operations team (and save the developers some inbox spam) then we need to overwrite both service and project labels.

alert: SuddenChangeInMetricsCount
expr: abs(scrape_samples_scraped - scrape_samples_scraped offset 1d > 1000)
for: 1h
labels:
  project: observation
  service: operations
  severity: warning

By overwriting both or project and service labels, we can ensure that the developer will not be notified until the operations team has been able to confirm the problem.

Related Post