This post is also available in the following languages. Korean

Who murdered my lovely Prometheus container in Kubernetes cluster?

littledriver2020-06-01

The most courageous infrastructure engineer on the Verda Team. Active contributor in open source projects.

This blog post is about an interesting experience I had while investigating and resolving a problem that happened in the Prometheus container which was still in the CrashLoopBackoff state. The phenomenon itself and the solution are obvious and simple; so simple that you may feel that it is not worth it to spend time on this issue. However, I think there must be a lot of people like me who want to understand the details of each step in the investigation process.

After reading this post, you will:

Learn new knowledge on Prometheus, Kubernetes, container-runtime, and OS kernels.
Learn how we troubleshoot as infrastructure engineers on the Verda team.
Learn methodology that can help you in your work. Technology will be outdated, but methodology will not.

Crime scene - What’s the current status of Prometheus?

When I came into the office on Monday, I wanted to open the Prometheus dashboard to check if there are any alerts that were triggered in our Kubernetes cluster. Unfortunately, I couldn’t access the Prometheus dashboard as the server seemed to be down. Upon realizing that, I checked three separate places in Prometheus, starting with the container state:

> kubectl -n Prometheus get pods -o wide | grep Prometheus
cluster-monitoring-Prometheus-d56bb6d46-tzmcj     1/2     CrashLoopBackOff   950      3d20h   <IP>  worker3


> kubectl -n Prometheus describe pods cluster-monitoring-Prometheus-d56bb6d46-tzmcj
...
Containers:
  Prometheus-configmap-reload:
    Image:         jimmidyson/configmap-reload:v0.1
    Args:
      --volume-dir=/etc/config
      --webhook-url=http://localhost:9090/-/reload
    State:          Running
      Started:      Wed, 01 Apr 2020 08:15:52 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  100Mi
    Requests:
      cpu:        100m
      memory:     50Mi
  Prometheus:
    Image:         prom/Prometheus:v2.14.0
    Args:
      --config.file=/etc/config/Prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/Prometheus/console_libraries
      --web.console.templates=/etc/Prometheus/consoles
      --web.enable-lifecycle
      --storage.tsdb.retention.size=25GB
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
    Ready:          False
    Restart Count:  20

It seemed the Prometheus server container was in the CrashLoopBackOff state. The CrashLoopBackOff state means that you have a pod starting, crashing, starting again, and then crashing in a continuous loop. The exit code was 137.

Once I was done checking the container state, I moved on to check if there was enough free space on the server. Based on my previous troubleshooting experience, crashes can be caused by not having enough free disk space due to the size of historical metrics that Prometheus scrapes from targets exceeding the available disk space. So I checked storage resource usage according to the PVC name of the Prometheus pod.

>  kubectl -n vks-system get pvc -o wide | grep Prometheus
cluster-monitoring-Prometheus-storage     Bound    pvc-384dacb8-3e75-11ea-a71c-fa163f8730b0   30Gi       RWO            cinder-ssd     71d


# log into the VM host where the Prometheus container was deployed
> df  -H | grep pvc
/dev/vdd         32G   11G   21G  35% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-384dacb8-3e75-11ea-a71c-fa163f8730b0/globalmount

It seemed like the there was enough disk space for the Prometheus server container.

Lastly, I checked the log as it is the most direct approach to finding out what’s wrong with your service.

> kubectl -n vks-system logs cluster-monitoring-Prometheus-d56bb6d46-tzmcj -f -c Prometheus
level=info ts=2020-04-05T04:53:16.664Z caller=web.go:496 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-04-05T04:53:16.664Z caller=main.go:657 msg="Starting TSDB ..."
level=info ts=2020-04-05T04:53:16.793Z caller=head.go:535 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-04-05T04:53:22.395Z caller=head.go:583 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=12

No obvious error messages existed. The Prometheus server worked fine before it crashed. It was replaying the data from the WAL file to its memory space.

Case investigation

Before starting a deeper investigation, we need to first confirm some basics for this cluster and Prometheus.

Cluster info: This cluster version is 1.13.5. It includes 235 nodes, 7 control plane nodes, 5 etcd nodes, and worker nodes. All nodes are VM nodes.
Host info: The host where the Prometheus server was deployed on has 4 vCPUs, 8 GB of memory, and a 100 GB SSD. All hosts close the swap function.
Prometheus info: The version of the Prometheus image is 2.14.0. It has set the storage.tsdb.retention.size parameter to 25 GB.

Clue 1: CrashLoopBackOff

As the Prometheus container was discovered to be in the CrashLoopBackOff state, let's first figure out what CrashLoopBackOff means.

As I have mentioned before, the CrashLoopBackOff state illustrates that the pod is in a loop of crashing and restarting again. It is related to the restartPolicy field of the Prometheus pod. Generally, there are 3 candidate values of restartPolicy:

Always
OnFailure
Never

If the restartPolicy was assigned the "Always" value, the kubelet will restart the containers whenever the container exited with zero code or non-zero code. The restartPolicy in pod.Spec will adapt to all containers of the pod.

Why the Prometheus pod crashed

Using command kubectl -n prometheus describe pod cluster-monitoring-Prometheus-d56bb6d46-tzmcj, we can see the Prometheus server container exited with error code 137; an unexpected exit situation.

Exit code 137 means the container (process) received the SIGKILL signal and was terminated by the OS kernel. There are three possibilities as to how the SIGKILL signal was produced.

Manually: such as performing the kill command on the host for the container process. (The container is a kind of OS process. Any command for the OS process can be adapted to the container.)
By container runtime: such as performing the docker kill command for the specific container
By OS kernel: If the OS kernel thinks the specific process may affect another process, it will end it to protect others.

The first possibility was immediately ruled out as it was obviously impossible that the Prometheus container was killed manually, since no one but I can log in to that host with the root permissions that can kill it.

Since we didn't perform the kill command for the Prometheus container ourselves, it may have been killed by the container runtime. In the Kubernetes cluster, the kubelet is always responsible for managing containers on the host by calling the interface of container-runtime.

Now, let's detect the kubelet log on the host where the Prometheus server container was deployed on.

...
[prober.go:111]Readiness probe for "cluster-monitoring-Prometheus-d56bb6d46-tzmcj:Prometheus" failed (failure): HTTP probe failed with statuscode: 503

[oom_watcher.go:68] Got sys oom event from cadvisor

[prober.go:111] Liveness probe for "cluster-monitoring-Prometheus-d56bb6d46-tzmcj:Prometheus" failed (failure): Get http://IP:9090/-/healthy: read: connection reset by peer

[kubelet.go:1953] SyncLoop (PLEG): "cluster-monitoring-Prometheus-d56bb6d46-tzmcj", event: u0026pleg.PodLifecycleEvent{ID:"b051e8a5-4dff-11ea-a71c-fa163f8730b0", Type:"ContainerDied", Data:"4da915ce720e18b45d9828116e61988c997a72b2a168b66dbc381b2f0474c86d"}
...

The Prometheus pod didn't pass the Readiness Probe after it was created so the endpoint of Prometheus was removed. The kubelet then got the "An OOM event was triggered" message from cadvisor. Finally, when the Prometheus server container was considered dead, the kubelet restarts it because the restartPolicy was set to "Always."

According to the log, I can assume the Prometheus container was not actively killed by the kubelet. It seems like Prometheus triggered the OOM event. In that case, I suspect it was killed directly by the OS kernel. Before we start to validate our assumptions, we first need to understand what cAdvisor is and how it watches OOM events.

cAdvisor is an open-source agent integrated into the kubelet binary that monitors resource usage and analyzes the performance of containers. It collects statistics about the CPU, memory, file, and network usage for all containers running on a given node (it does not operate at the pod level). In addition to core metrics, it also monitors system events as well.

I read some parts of the kubelet implementation to find that cAdvisor was created as one field called CAdvisorInterface of kubelet.Dependencies. kubelet.Dependencies can be considered as a set of assistant tools for the kubelet.

// Detailed Implementation: https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.13.5/-/blob/cmd/kubelet/app/server.go#L648
if kubeDeps.CAdvisorInterface == nil {
    imageFsInfoProvider := cadvisor.NewImageFsInfoProvider(s.ContainerRuntime, s.RemoteRuntimeEndpoint)
    kubeDeps.CAdvisorInterface, err = cadvisor.New(imageFsInfoProvider, s.RootDirectory, cgroupRoots, cadvisor.UsingLegacyCadvisorStats(s.ContainerRuntime, s.RemoteRuntimeEndpoint))
    if err != nil {
      return err
    }
}

When the kubelet started, it performed the Start() method of cAdvisor. In the implementation of cAdvisor.Start(), it performed the same Start() method of the object called manager whose type is Manager. The manager object was constructed when cAdvisor was initialized.

func (kl *kubelet) initializeRuntimeDependentModules() {
  if err := kl.cadvisor.Start(); err != nil {
    // Fail kubelet and rely on the babysitter to retry starting kubelet.
    // TODO(random-liu): Add backoff logic in the babysitter
    klog.Fatalf("Failed to start cAdvisor %v", err)
  }
  ...
}

func (cc *cadvisorClient) Start() error {
  return cc.Manager.Start()
}

// New creates a cAdvisor and exports its API on the specified port if port > 0.
func New(imageFsInfoProvider ImageFsInfoProvider, rootPath string, usingLegacyStats bool) (Interface, error) {
...
  // Create and start the cAdvisor container manager.
  m, err := manager.New(memory.New(statsCacheDuration, nil), sysFs, maxHousekeepingInterval, allowDynamicHousekeeping, includedMetrics, http.DefaultClient, rawContainerCgroupPathPrefixWhiteList)
...
}

type manager struct {
  ...
  cadvisorContainer        string
  eventHandler             events.EventManager
  containerWatchers        []watcher.ContainerWatcher
  eventsChannel            chan watcher.ContainerEvent
  ...
}

// Start the container manager.
func (self *manager) Start() error {
  ...
  // Watch for OOMs.
  err = self.watchForNewOoms()
  if err != nil {
    klog.Warningf("Could not configure a source for OOM detection, disabling OOM events: %v", err)
  }
  ...
}

func (self *manager) watchForNewOoms() error {
  klog.V(2).Infof("Started watching for new ooms in manager")
  outStream := make(chan *oomparser.OomInstance, 10)
  oomLog, err := oomparser.New()
  if err != nil {
    return err
  }
  go oomLog.StreamOoms(outStream)

  go func() {
    for oomInstance := range outStream {
      // Surface OOM and OOM kill events.
      newEvent := &info.Event{
        ContainerName: oomInstance.ContainerName,
        Timestamp:     oomInstance.TimeOfDeath,
        EventType:     info.EventOom,
      }
      err := self.eventHandler.AddEvent(newEvent)
      if err != nil {
        klog.Errorf("failed to add OOM event for %q: %v", oomInstance.ContainerName, err)
      }
      klog.V(3).Infof("Created an OOM event in container %q at %v", oomInstance.ContainerName, oomInstance.TimeOfDeath)

      newEvent = &info.Event{
        ContainerName: oomInstance.VictimContainerName,
        Timestamp:     oomInstance.TimeOfDeath,
        EventType:     info.EventOomKill,
        EventData: info.EventData{
          OomKill: &info.OomKillEventData{
            Pid:         oomInstance.Pid,
            ProcessName: oomInstance.ProcessName,
          },
        },
      }
      err = self.eventHandler.AddEvent(newEvent)
      if err != nil {
        klog.Errorf("failed to add OOM kill event for %q: %v", oomInstance.ContainerName, err)
      }
    }
  }()
  return nil
}

You can see that when the method watchForNewOoms was called in manager.Start(), because it created an object called oomparser through the line: oomLog, err := oomparser.New(). Actually, the oomparser wraps an important object called kmsgparser.

Now, the question is "what is kmsg?" It is related to the logging mechanism of Linux OS.

Generally, the Linux kernel outputs logs to the memory buffer. User applications can access it in two ways:

Accessing /proc/kmsg: Due to the working mechanisms of /proc/kmsg is more like pseudo-FIFO, the logs in /proc/kmsg cannot be read by multiple clients because it is read-once.
Accessing /dev/kmsg: This is a read-write interface that is shareable among multiple client processes. If multiple processes share it, they all read the same complete data stream, unaffected by one another. If they open it for write access, they can also inject messages into the kernel's log stream, as if they were generated by the kernel.

func NewParser() (Parser, error) {
    f, err := os.Open("/dev/kmsg")
    if err != nil {
        return nil, err
    }
...
}

func (p *parser) Parse() <-chan Message {
 
    output := make(chan Message, 1)
 
    go func() {
        defer close(output)
        msg := make([]byte, 8192)
        for {
            // Each read call gives us one full message.
            // https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg
            n, err := p.kmsgReader.Read(msg)
            if err != nil {
                if err == syscall.EPIPE {
                    p.log.Warningf("short read from kmsg; skipping")
                    continue
                }
 
                if err == io.EOF {
                    p.log.Infof("kmsg reader closed, shutting down")
                    return
                }
 
                p.log.Errorf("error reading /dev/kmsg: %v", err)
                return
            }
 
            msgStr := string(msg[:n])
 
            message, err := p.parseMessage(msgStr)
            if err != nil {
                p.log.Warningf("unable to parse kmsg message %q: %v", msgStr, err)
                continue
            }
 
            output <- message
        }
    }()
 
    return output
}

After the kmsgparser object is created, the cAdvisor.manager performs the oomLog.StreamOoms(outStream) method. The outStream is a channel object. Once the kmsgparser gets the system log, it sends them to the outStream.

// StreamOoms writes to a provided a stream of OomInstance objects representing
// OOM events that are found in the logs.
// It will block and should be called from a goroutine.
func (self *OomParser) StreamOoms(outStream chan<- *OomInstance) {
  kmsgEntries := self.parser.Parse()
  defer self.parser.Close()

  for msg := range kmsgEntries {
    in_oom_kernel_log := checkIfStartOfOomMessages(msg.Message)
    if in_oom_kernel_log {
      oomCurrentInstance := &OomInstance{
        ContainerName: "/",
        TimeOfDeath:   msg.Timestamp,
      }
      for msg := range kmsgEntries {
        err := getContainerName(msg.Message, oomCurrentInstance)
        if err != nil {
          klog.Errorf("%v", err)
        }
        finished, err := getProcessNamePid(msg.Message, oomCurrentInstance)
        if err != nil {
          klog.Errorf("%v", err)
        }
        if finished {
          oomCurrentInstance.TimeOfDeath = msg.Timestamp
          break
        }
      }
      outStream <- oomCurrentInstance
    }
  }
  // Should not happen
  klog.Errorf("exiting analyzeLines. OOM events will not be reported.")
}

While reading the OOM logs from the outStream channel, watchForNewOoms generates two events

OOMEvent
OOMKillEvent

Both of them were added by eventHandler through calling the AddEvent method. The eventHandler was created by the function called NewEventManager.

// returns a pointer to an initialized Events object.
func NewEventManager(storagePolicy StoragePolicy) *events {
  return &events{
    eventStore:    make(map[info.EventType]*utils.TimedStore, 0),
    watchers:      make(map[int]*watch),
    storagePolicy: storagePolicy,
  }
}


// method of Events object that adds the argument Event object to the
// eventStore. It also feeds the event to a set of watch channels
// held by the manager if it satisfies the request keys of the channels
func (self *events) AddEvent(e *info.Event) error {
    self.updateEventStore(e)
    self.watcherLock.RLock()
    defer self.watcherLock.RUnlock()
    watchesToSend := self.findValidWatchers(e)
    for _, watchObject := range watchesToSend {
        watchObject.eventChannel.GetChannel() <- e
    }
    klog.V(4).Infof("Added event %v", e)
    return nil
}
 
 
func (self *events) findValidWatchers(e *info.Event) []*watch {
    watchesToSend := make([]*watch, 0)
    for _, watcher := range self.watchers {
        watchRequest := watcher.request
        if checkIfEventSatisfiesRequest(watchRequest, e) {
            watchesToSend = append(watchesToSend, watcher)
        }
    }
    return watchesToSend
}

The AddEvent function performs two actions:

Store the event to the eventStore map
Find the available watcher which watches this kind of event and sends the event to its event channel

In the findValidWatchers function, we can see it mainly checks if the type of event is the same with what the watcher is watching. If valid watchers are found, the event is sent into the watcher's channel.

According to the definition of watcher below, we can see that every watcher needs to initialize itself by calling the WatchEvent function. The watcher is registered into the watcher map called watchers of eventHandler.

// initialized by a call to WatchEvents(), a watch struct will then be added
// to the events slice of *watch objects. When AddEvent() finds an event that
// satisfies the request parameter of a watch object in events.watchers,
// it will send that event out over the watch object's channel. The caller that
// called WatchEvents will receive the event over the channel provided to
// WatchEvents
type watch struct {
    // request parameters passed in by the caller of WatchEvents()
    request *Request
    // a channel used to send event back to the caller.
    eventChannel *EventChannel
}

We have covered half of the process of how OOM events are watched; from the system log to the event buffer which store all events. Now let's move to the next part: How cAdvisor registered a watcher object to the watcher map. The kubelet log provides a hint yet again.

[oom_watcher.go:68] Got sys oom event from cadvisor

It seems like cAdvisor created a watcher instance called oom_watcher which was implemented in the oom_watcher.go file.

// NewOOMWatcher creates and initializes a OOMWatcher based on parameters.
func NewOOMWatcher(cadvisor cadvisor.Interface, recorder record.EventRecorder) OOMWatcher {
  return &realOOMWatcher{
    cadvisor: cadvisor,
    recorder: recorder,
  }
}

const systemOOMEvent = "SystemOOM"

// Watches cadvisor for system oom's and records an event for every system oom encountered.
func (ow *realOOMWatcher) Start(ref *v1.ObjectReference) error {
  request := events.Request{
    EventType: map[cadvisorapi.EventType]bool{
      cadvisorapi.EventOom: true,
    },
    ContainerName:        "/",
    IncludeSubcontainers: false,
  }
  eventChannel, err := ow.cadvisor.WatchEvents(&request)
  if err != nil {
    return err
  }

  go func() {
    defer runtime.HandleCrash()

    for event := range eventChannel.GetChannel() {
      klog.V(2).Infof("Got sys oom event from cadvisor: %v", event)
      ow.recorder.PastEventf(ref, metav1.Time{Time: event.Timestamp}, v1.EventTypeWarning, systemOOMEvent, "System OOM encountered")
    }
    klog.Errorf("Unexpectedly stopped receiving OOM notifications from cAdvisor")
  }()
  return nil
}

We can see the oomWatcher registering itself to the eventHandler and reading the OOM event from the event channel. Actually, the oomWatcher was initialized when the kubelet started and oomWatcher.Start() method was also called in that time.

The whole process of producing and watching OOM events can be seen in the graph below.

Now that we know how OOM events are produced and watched in the user space, there are three key points we need to keep in mind before continuing our investigation.

The OOM event was produced by the OS. It means the OS kernel detected that the memory resource used by the Prometheus process (container) exceeds limitations
cadvisor watches OOM events from the /dev/kmsg drive and put it into the channel of oom_watcher
oomWatcher watches OOM events from the event channel and records OOM events

By now, you must know the Prometheus container wasn't killed by the kubelet. It was killed by the OS kernel directly due to an OOM event. We can confirm this situation through the system log.

Why Prometheus consumes a lot of memory resources

Although we now know the OS kernel killed Prometheus due to an OOM event, we don't know how much memory Prometheus actually used. How much is too much? I took a look at the memory resource usage of the Prometheus container to find some answers.

I enabled the Grafana dashboard of Prometheus-self and detected the number of time series increases dramatically up to 2,162,256 series only after five minutes of starting Prometheus. Of course, Prometheus was killed almost immediately. I also found a calculator which can help us calculate the estimated memory quota for Prometheus based on the number of series.

Although 5.7 GB is the estimated value, the 8 GB on our host does not provide enough overhead. And 5.7 GB is what is required to run for only 5 minutes.

To validate my assumption, I removed the majority of scrap configs from Prometheus's configMap resource and then Prometheus didn't crash anymore.

Until now, it seems like using too much memory for Prometheus triggers an OOM event and the OS kernel kills Prometheus. However, I still have several doubts about the OOM event itself

How is an OOM event triggered in the kernel space? Is memory usage the only factor affecting it?
Which function of the OS kernel is in charge of handling the OOM event?
How does the whole process from the OS kernel detecting abnormal memory usage of Prometheus up to triggering the OOM event go?

With these questions in mind, I continued to investigate more deeply, so that we can figure out what happened after Prometheus consumed a lot of memory resources on the host.

Clue 2: OOM Event

The best method of finding the root cause of a problem is reproducing it and watching what actually happened. So I wanted to perform the docker restart <Prometheus_container_name> command to restart the Prometheus container.

Before that, let's keep the current memory resource usage in mind.

After the Prometheus container restarted, I found it almost consumed all memory resources on the host. It triggered an OOM event, causing the OS kernel to kill Prometheus.

An OOM event occurs when a process uses a large amount of memory, and the system's memory becomes insufficient, causing the OS forcefully terminate it. This specific case is judged as an OOM event. However, OS kernels often over-allocate memory to processes. For example, if the capacity of the physical memory only is 5 GB, the OS kernel might allocate 5.5 GB of memory to processes. Why? Because the requested memory isn't always equal to the memory actually used. For example, process A may request 1 GB of memory, but it only uses 500 MB memory in its lifetime. This initial request is usually a large request that the process will not immediately or indeed ever use at all. To take advantage of memory resources, OS kernels often leave this kind of overhead.

In a normal situation, everything works fine. But if some processes consume too much memory and there is not enough memory to support it. The OS has to handle it immediately, otherwise, other processes are affected.

The solution employed by OS kernels is to invoke the specific function called "OOM Killer" or "Out Of Memory Killer." The OOM killer usually reviews all running processes in the OS. Its two main tasks are

Reviewing all running processes and calculating the OOM score for every process according to their memory usage
Picking up the processes whose score is the highest and killing it if the OS requires more memory

How OOM Killer works

You can detect the info related to the oom_score under the /proc/<pid> directory for every process

oom_adj (oom_adjust)
oom_score_adj
oom_score

As I assumed, oom_score is enough for oom_killer. Then why does the system need oom_adj and oom_score_adj? Let's use man proc to check.

/proc/[pid]/oom_adj (since Linux 2.6.11): It can be used to adjust the score used to select which process should be killed in an OOM situation. Valid values are in the range of -16 to +15, plus the special value -17, which disables OOM-killing altogether for this process. A positive score increases the likelihood of this process being killed by the OOM-killer; a negative score decreases the likelihood. Since Linux 2.6.36, use of this file is deprecated in favor of /proc/[pid]/oom_score_adj.
/proc/[pid]/oom_score (since Linux 2.6.11): It displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer.
/proc/[pid]/oom_score_adj (since Linux 2.6.36): It can be used to adjust the badness heuristic used to select which process gets killed in OOM conditions. The value of oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (oom_score_adj_MIN) to +1000 (oom_score_adj_MAX). This allows userspace to control the preference for OOM-killing. For backward compatibility with previous kernels, /proc/[pid]/oom_adj can still be used to tune the badness score. Its value is scaled linearly with oom_score_adj. Writing to /proc/[pid]/oom_score_adj or /proc/[pid]/oom_adj will change the other with its scaled value.

Based on the explanation above, we now know the oom_score is a unique piece of evidence that OOM killer relies on. We can use oom_adj or oom_score_adj to adjust it.

Because the kernel version we are using is kernel-3.10.0-957.el7, I found the kernel source code in the matched version from Linux's repository.

static int proc_oom_score(struct task_struct *task, char *buffer)
{
  unsigned long totalpages = totalram_pages + total_swap_pages;
  unsigned long points = 0;

  read_lock(&tasklist_lock);
  if (pid_alive(task))
    points = oom_badness(task, NULL, NULL, totalpages) *
            1000 / totalpages;
  read_unlock(&tasklist_lock);
  return sprintf(buffer, "%lu
", points);
}

struct limit_names {
  char *name;
  char *unit;
};

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
        const nodemask_t *nodemask, unsigned long totalpages)
{
  long points;
  long adj;

  if (oom_unkillable_task(p, memcg, nodemask))
    return 0;

  p = find_lock_task_mm(p);
  if (!p)
    return 0;

  adj = (long)p->signal->oom_score_adj;
  if (adj == oom_score_adj_MIN) {
    task_unlock(p);
    return 0;
  }

  /*
   * The baseline for the badness score is the proportion of RAM that each
   * task's rss, pagetable and swap space use.
   */
  points = get_mm_rss(p->mm) + p->mm->nr_ptes +
     get_mm_counter(p->mm, MM_SWAPENTS);
  task_unlock(p);

  /*
   * Root processes get 3% bonus, just like the __vm_enough_memory()
   * implementation used by LSMs.
   */
  if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    adj -= 30;

  /* Normalize to oom_score_adj units */
  adj *= totalpages / 1000;
  points += adj;

  /*
   * Never return 0 for an eligible task regardless of the root bonus and
   * oom_score_adj (oom_score_adj can't be oom_score_adj_MIN here).
   */
  return points > 0 ? points : 1;
}

The calculating logic can be seen in the graph below:

There are several key points to learn from the calculating process:

OS kernels only consider memory usage and oom_score_adj. oom_adj wasn't used in this version of the kernel, but it still has the value. I checked the oom_score_adj of Prometheus and its value was 999.

The next question is: Who can modify oom_score_adj?

Who can modify oom_score_adj

We discovered that the value of oom_score depends on the memory usage amount and oom_score_adj. The actual memory usage amount is related to the specific process. We already know Prometheus used too much memory, but I still wanted to confirm if there is another service that can cause an OOM event on Prometheus by adjusting oom_score_adj except for the OS kernel.

There are two opportunities to configure oom_score_adj for Prometheus:

Preparation Phase: The kubelet prepares necessary configurations and parameters for the Prometheus container process. It can set an initial value for oom_score_adj, or the kernel gives each process a default high value.
Running Phase: After the Prometheus container process starts to run, some services or users can set the value to /proc/<proemtheus_pid>/oom_score_adj.

Based on the explanation above, if I want to detect if there is any service that can modify oom_score_adj while Prometheus runs, I need to watch the changing status of the file /proc/<proemtheus_pid>/oom_score_adj. In that case, I want to utilize SystemTap: SystemTap is a kind of tool that can help us detect the behavior of the OS kernel. We can write a SystemTap script to watch the /proc/<pid>/oom_score_adj file.

I made a little switch to solve this problem. Because the content change of any file must be done by a kind of modification system call. So I implemented a SystemTap script to watch the "Write" system call on the file /proc/<proemtheus_pid>/oom_score_adj.

probe syscall.write {
        if (isinstr(filename, "oom_score_adj")) {
                printf("%s(%d) %s modify  %s
", execname(), pid(), pexecname(),filename)
        }
}

probe syscall.open {
        if (isinstr(filename, "oom_score_adj")) {
                printf("%s(%d) %s open  %s
", execname(), pid(), pexecname(),filename)
        }
}

After I ran this script and the Prometheus container at the same time, there wasn't any process or service changing oom_score_adj until Prometheus was killed due to OOM.

I made a little experiment to test if the OS kernel gives a high oom_score_adj value to the process by default. I implemented a Golang demo that calls time.Sleep (3600 * time.Seconds). After it runs, I checked that its oom_score_adj value was 0. It means if I don't specify an explicit value to the oom_score_adj of a process, the OS kernel doesn't modify it.

For the kubelet, I thought the configuration should happen in the process of creating the container process. Let's recall the process first.

After we created the workload resource object by sending the request to kube-apiserver, the workload creates pods instead. kube-scheduler schedules those pods to available nodes. When the kubelet watched the pod whose nodeName field value is the same with the node where the kubelet is located, it creates containers based on the specs of the pod.

Creating or starting containers on the host is related to container-runtime. In our environment, we use Docker as the container engine. But the container-runtime Docker used is containerd. Containerd relies on the runc project.

We can ignore CRI and runc temporarily as CRI is a kind of abstract layer that just decouples Kubernetes and container-runtime, while runc just provides isolation for containers.

There are only two components which have the chance to configure oom_score_adj:

kubelet
containerd

As I mentioned before, containerd will create the container and run it through calling the interface of runc. So I suspected there is a default value of oom_score_adj for every container if we don't specify the explicit value. Let's keep this assumption in mind and take a closer look at the source code of the NewContainer function.

// NewContainer will create a new container in container with the provided id
// the id must be unique within the namespace
func (c *Client) NewContainer(ctx context.Context, id string, opts ...NewContainerOpts) (Container, error) {
    ctx, done, err := c.WithLease(ctx)
    if err != nil {
        return nil, err
    }
    defer done(ctx)
 
    container := containers.Container{
        ID: id,
        Runtime: containers.RuntimeInfo{
            Name: c.runtime,
        },
    }
    for _, o := range opts {
        if err := o(ctx, c, &container); err != nil {
            return nil, err
        }
    }
    r, err := c.ContainerService().Create(ctx, container)
    if err != nil {
        return nil, err
    }
    return containerFromRecord(c, r), nil
}

This function prepares all configurations for the container and stores it. The next step is to run this container which means the oom_score_adj parameter should also be set in here. The opts variable gives me some hints, especially about its many functions. The NewContainer function runs every method inside of opts, which is then passed as a parameter.

So, Who calls this function to create a container? If you notice the graph about the process of starting the container I mentioned, you must know the answer is CRI. However, CRI is just a kind of standard or definition and any container-runtime has a special service that implements all interfaces of CRI for adapting with the kubelet. In containerd, this component is containerd-shim.

containerd-shim is a kind of gRPC server that implements all interfaces of CRI. When the kubelet wants to operate the container, it calls the CRI interface as a gRPC request, the containerd-shim then handles it. So, let's jump to the caller of NewContainer.

In the CreatContainer function of containerd, I found containerd uses the WithSpec() function to create anonymous functions which construct the containerSpec object that includes necessary configurations of the container process. This anonymous function was assigned to opts.

// CreateContainer creates a new container in the given PodSandbox.
func (c *criService) CreateContainer(ctx context.Context, r *runtime.CreateContainerRequest) (_ *runtime.CreateContainerResponse, retErr error) {
  config := r.GetConfig()
  log.G(ctx).Debugf("Container config %+v", config)
  ...
  spec, err := c.containerSpec(id, sandboxID, sandboxPid, sandbox.NetNSPath, config, sandboxConfig,
    &image.ImageSpec.Config, append(mounts, volumeMounts...), ociRuntime)
  if err != nil {
    return nil, errors.Wrapf(err, "failed to generate container %q spec", id)
  }
  ...
  opts = append(opts,
    containerd.WithSpec(spec, specOpts...),
    containerd.WithRuntime(sandboxInfo.Runtime.Name, runtimeOptions),
    containerd.WithContainerLabels(containerLabels),
    containerd.WithContainerExtension(containerMetadataExtension, &meta))
  var cntr containerd.Container
  if cntr, err = c.client.NewContainer(ctx, id, opts...); err != nil {
    return nil, errors.Wrap(err, "failed to create containerd container")
  }
  ...
}

// WithSpec sets the provided spec on the container
func WithSpec(s *oci.Spec, opts ...oci.SpecOpts) NewContainerOpts {
  return func(ctx context.Context, client *Client, c *containers.Container) error {
    if err := oci.ApplyOpts(ctx, client, c, s, opts...); err != nil {
      return err
    }

    var err error
    c.Spec, err = typeurl.MarshalAny(s)
    return err
  }
}

So, what are the contents of the spec object? Let's detect the logic of the containerSpec() function.

func (c *criService) containerSpec(id string, sandboxID string, sandboxPid uint32, netNSPath string,
  config *runtime.ContainerConfig, sandboxConfig *runtime.PodSandboxConfig, imageConfig *imagespec.ImageConfig,
  extraMounts []*runtime.Mount, ociRuntime config.Runtime) (*runtimespec.Spec, error) {
...
  specOpts = append(specOpts,
    customopts.WithOOMScoreAdj(config, c.config.RestrictOOMScoreAdj),
    customopts.WithPodNamespaces(securityContext, sandboxPid),
    customopts.WithSupplementalGroups(supplementalGroups),
    customopts.WithAnnotation(annotations.ContainerType, annotations.ContainerTypeContainer),
  )

Here is a function called WithOOMScoreAdj. I think it implements the logic of assigning value to oom_source_adj for containers.

// WithOOMScoreAdj sets the oom score
func WithOOMScoreAdj(config *runtime.ContainerConfig, restrict bool) oci.SpecOpts {
    return func(ctx context.Context, client oci.Client, c *containers.Container, s *runtimespec.Spec) error {
        if s.Process == nil {
            s.Process = &runtimespec.Process{}
        }
 
        resources := config.GetLinux().GetResources()
        if resources == nil {
            return nil
        }
        adj := int(resources.GetOomScoreAdj())
        if restrict {
            var err error
            adj, err = restrictOOMScoreAdj(adj)
            if err != nil {
                return err
            }
        }
        s.Process.OOMScoreAdj = &adj
        return nil
    }
}

Now, we found the place where the oom_source_adj of the container process was assigned, but we don't know where the value of oom_source_adj comes from. From the WithOOMScoreAdj function , the oom_source_adj comes from a parameter called config. Let's go back to the CreateContainer function to check who passed the value to it.

// CreateContainer creates a new container in the given PodSandbox.
func (c *criService) CreateContainer(ctx context.Context, r *runtime.CreateContainerRequest) (_ *runtime.CreateContainerResponse, retErr error) {
  config := r.GetConfig()
  log.G(ctx).Debugf("Container config %+v", config)
  ...
  spec, err := c.containerSpec(id, sandboxID, sandboxPid, sandbox.NetNSPath, config, sandboxConfig,
    &image.ImageSpec.Config, append(mounts, volumeMounts...), ociRuntime)
  if err != nil {
    return nil, errors.Wrapf(err, "failed to generate container %q spec", id)
  }
 ...

The value of oom_source_adj should come from the CreateContainerRequest.ContainerConfig. So, let's recall who can post the ContainerCreateRequest to the containerd-shim. The answer is the kubelet.

Actually, inside the ContainerConfig field of CreateContainerRequest, we can see that not only the oom_score_adj can be set, but also parameters related to the CGroup can be assigned.

type CreateContainerRequest struct { 
  ...
  // Config of the container.
  Config *ContainerConfig protobuf:"bytes,2,opt,name=config,proto3" json:"config,omitempty"
  ...
}

type ContainerConfig struct {
  ...
  Linux *LinuxContainerConfig protobuf:"bytes,15,opt,name=linux,proto3" json:"linux,omitempty"
  ...
}

type LinuxContainerConfig struct {
  // Resources specification for the container.
  Resources *LinuxContainerResources protobuf:"bytes,1,opt,name=resources,proto3" json:"resources,omitempty"
  ...
}

type LinuxContainerResources struct {
  // CPU CFS (Completely Fair Scheduler) period. Default: 0 (not specified).
  CpuPeriod int64 protobuf:"varint,1,opt,name=cpu_period,json=cpuPeriod,proto3" json:"cpu_period,omitempty"
  // CPU CFS (Completely Fair Scheduler) quota. Default: 0 (not specified).
  CpuQuota int64 protobuf:"varint,2,opt,name=cpu_quota,json=cpuQuota,proto3" json:"cpu_quota,omitempty"
  // CPU shares (relative weight vs. other containers). Default: 0 (not specified).
  CpuShares int64 protobuf:"varint,3,opt,name=cpu_shares,json=cpuShares,proto3" json:"cpu_shares,omitempty"
  // Memory limit in bytes. Default: 0 (not specified).
  MemoryLimitInBytes int64 protobuf:"varint,4,opt,name=memory_limit_in_bytes,json=memoryLimitInBytes,proto3" json:"memory_limit_in_bytes,omitempty"
  // OOMScoreAdj adjusts the oom-killer score. Default: 0 (not specified).
  OomScoreAdj int64 protobuf:"varint,5,opt,name=oom_score_adj,json=oomScoreAdj,proto3" json:"oom_score_adj,omitempty"
  // CpusetCpus constrains the allowed set of logical CPUs. Default: "" (not specified).
  CpusetCpus string protobuf:"bytes,6,opt,name=cpuset_cpus,json=cpusetCpus,proto3" json:"cpuset_cpus,omitempty"
  // CpusetMems constrains the allowed set of memory nodes. Default: "" (not specified).
  CpusetMems string protobuf:"bytes,7,opt,name=cpuset_mems,json=cpusetMems,proto3" json:"cpuset_mems,omitempty"
  // List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified).
  HugepageLimits       []*HugepageLimit protobuf:"bytes,8,rep,name=hugepage_limits,json=hugepageLimits,proto3" json:"hugepage_limits,omitempty"
  XXX_NoUnkeyedLiteral struct{}         json:"-"
  XXX_sizecache        int32            json:"-"
}

Why the oom_score_adj of Prometheus was set to 999

After I confirmed the oom_score_adj value of the container was assigned by the kubelet, I still wanted to know why the kubelet gives a high value of oom_score_adj to the Prometheus container.

Before we dive into the code base of the kubelet, let's memorize the resource management of pods in Kubernetes.

As we know, the resources from the OS perspective can be categorized into two types:

Computing Resources: CPU, MEM
Storage Resources: Disk

But from the perspective of Kubernetes resource are categorized into two new types:

Compressible: Even if the resources used by a pod exceeds the limit, it can use more if there are resources available. Kubernetes currently only supports the CPU as a compressible resource.
Incompressible: If the resources used by pod (not container) exceeds the limit, the container inside of this pod which used the most resources is terminated. Even if the resources used by the pod exceeds the request amount but less than the limit, it may be terminated if some other "important" pods need the resources. Kubernetes currently only supports MEM as an incompressible resource.

Kubernetes categorizes quality of service (QoS) for pods into three types based on the rules above. When Kubernetes creates a pod it assigns one of these QoS classes to the pod:

Guaranteed: Pods are considered top-priority and are guaranteed to not be killed until they exceed their limits. If limits and requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed.
Burstable: Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist. At least one container in the pod has a memory or CPU request.
Best-Effort: Pods will be treated as the lowest priority. Containers in those pods are the first to get killed if the system runs out of memory. Those containers can use any amount of free memory in the node though. If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.

You can make QoS as the priority to understand. Guaranteed is the best option where once the node encounters an insufficient situation, those pods are guaranteed to run through killing other pods. Best-Effort seems good as well because it can use any amount of free memory on the host. However, once a sufficient situation happens, it will be killed first.

Now we know that the QoS of a pod may affect the terminate policy especially in the memory, as the CPU is a kind of compressible resource and the OS won't kill the container process when it used too much CPU resources. However, QoS and oom_score_adj are only two parameters the user can specify.

So, is there a closing relationship between QoS and oom_score_adj? Let's delve into the kubelet source code to find the final answer.

Here is a function called GetContainerOOMScoreAdjust in the kubelet code base. It calculates oom_score_adj value to the container based on the QoS of the pod. We can see that the Burstable calculation method is more complicated, and the other two are assigned a default value.

// GetContainerOOMScoreAdjust returns the amount by which the OOM score of all processes in the
// container should be adjusted.
// The OOM score of a process is the percentage of memory it consumes
// multiplied by 10 (barring exceptional cases) + a configurable quantity which is between -1000
// and 1000. Containers with higher OOM scores are killed if the system runs out of memory.
// See https://lwn.net/Articles/391222/ for more information.
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
    if types.IsCriticalPod(pod) {
        // Critical pods should be the last to get killed.
        return guaranteedOOMScoreAdj
    }
 
    switch v1qos.GetPodQOS(pod) {
    case v1.PodQOSGuaranteed:
        // Guaranteed containers should be the last to get killed.
        return guaranteedOOMScoreAdj
    case v1.PodQOSBestEffort:
        return besteffortOOMScoreAdj
    }
 
    // Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
    // we want to protect Burstable containers that consume less memory than requested.
    // The formula below is a heuristic. A container requesting for 10% of a system's
    // memory will have an OOM score adjust of 900. If a process in container Y
    // uses over 10% of memory, its OOM score will be 1000. The idea is that containers
    // which use more than their request will have an OOM score of 1000 and will be prime
    // targets for OOM kills.
    // Note that this is a heuristic, it won't work if a container has many small processes.
    memoryRequest := container.Resources.Requests.Memory().Value()
    oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
    // A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
    // that burstable pods have a higher OOM score adjustment.
    if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
        return (1000 + guaranteedOOMScoreAdj)
    }
    // Give burstable pods a higher chance of survival over besteffort pods.
    if int(oomScoreAdjust) == besteffortOOMScoreAdj {
        return int(oomScoreAdjust - 1)
    }
    return int(oomScoreAdjust)
}

I also found the default value at the top of this source file.

const (
    // kubeletOOMScoreAdj is the OOM score adjustment for kubelet
    kubeletOOMScoreAdj int = -999
    // KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
    KubeProxyOOMScoreAdj  int = -999
    guaranteedOOMScoreAdj int = -998
    besteffortOOMScoreAdj int = 1000
)

According to the current implementation, the Guaranteed and Best Effort Pods have separate default values of oom_score_adj: -998 and 1000 respectively. However, the oom_score_adj of the Prometheus container was set to 999, indicating the Prometheus pod may actually belong to Burstable.

I used the deployment workload to deploy our Prometheus on user's cluster. Now, let me take a look at Prometheus's orchestration file.

spec:
  containers:
    imagePullPolicy: IfNotPresent
    name: Prometheus-configmap-reload
    resources:
      limits:
        memory: "100Mi"
        cpu: "200m"
      requests:
        memory: "50Mi"
        cpu: "100m"
        ...
  - args:
  ...
    - --storage.tsdb.retention.size=25GB
    image: prom/Prometheus:v2.14.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /-/healthy
        port: 9090
        scheme: HTTP
    name: Prometheus
    ...

It seems like I only set a request and limit for the config-reload container. In that case, the QoS of the Prometheus pod will be considered as Burstable, which means oom_score_adj needs to be calculated by the kublet. The final result is 999, which can be confirmed from the container specs.

...
"IpcMode": "container:eceeb982740fcabc2630dc6647568640202318d95eb81206ad696210e2732ce3",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 999,
...

Conclusion

As I mentioned before, the end of this troubleshooting process is obvious and simple. There are two killers that exist in this case:

Wrong resource configuration: Wrong resource requests and limit configurations for Prometheus pods results in a high oom_score_adj value being assigned to the Prometheus server container.
Wrong estimation of memory resources: The memory resources required by Prometheus are not estimated in advance, especially in large-scale clusters.

Below are some possible solutions:

Set correct resource request and limit configurations for Prometheus pods. Although it cannot resolve the current problem completely, it can save time spent troubleshooting if similar problems occur.
Adjust the scrape items or scrape interval to reduce unnecessary metrics.
Estimate the required memory resource of Prometheus and prepare some new hosts and give the specific node label to them, such as deploy-topology: Prometheus. Then add nodeSelector to the Prometheus orchestration file and only allow Prometheus pods that are scheduled to those hosts.

If you read until the end of this post, I think you must be an engineer with a great passion for technology. If you want a deeper understanding of Kubernetes and are interested in solving problems and making a huge system more stable, we are always looking for new engineers to add to the team. I hope we may someday work together!

Blog