Nick Travers

Reading process memory

Here's another fun tale from the JVM depths of despair.

A couple of containers in a Pod were consistently being killed due to exceeding cgroup limits set by Kubernetes. The reason for the exit was clear from the status for the containers in the Pods (set by Docker on the nodes):


- name: foo
  ...
  lastState:
    terminated:
      exitCode: 137
      reason: OOMKilled
  restartCount: 8
  ...

As a little aside, I saw logs for the OOM events in the API server logs. Doing some digging in the Kubelet code for where these logs were generated revealed an interesting little notification system for container memory threshold events that leverages epoll. Neat!

When the Kubelet spins a up a new container, it registers a watch on the cgroup. On Linux, this is done by registering an epoll group that watching a eventfd for notification of the cgroup crossing the threshold specified in the memory.usage_in_bytes file of the cgroup:


// https://github.com/kubernetes/kubernetes/blob/v1.13.7/pkg/kubelet/eviction/threshold_notifier_linux.go#L48
func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) {
    ...
    // the cgroup file we're going to watch
    watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY, 0)
    ...
    // a handle on the cgroup event control file - we're going to write to this soon
    controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY, 0)
    ...
    // the fd that will be called by the cgroup notification system
    eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC)
    ...
    // a fd to use for epoll
    epfd, err = unix.EpollCreate1(0)
    ...
    // set up notifications by posting to the cgroup control file
    config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold)
    _, err = unix.Write(controlfd, []byte(config))
    ...
}


// https://github.com/kubernetes/kubernetes/blob/v1.13.7/pkg/kubelet/eviction/threshold_notifier_linux.go#L101
func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
  // set up an epoll control group on the fd we created earlier
  err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
    Fd:     int32(n.eventfd),
    Events: unix.EPOLLIN,
  })
  ...
  for {
    // keep waiting for events on the epoll fd or for a shutdown signal
    select {
    case <-n.stop:
      return
    default:
    }
    event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
    if err != nil {
      ...
      return
    } else if !event {
      ...
      // we timeod out waiting for the event
      continue
    }
    // we got some events, consume it and notifiy the channel
    buf := make([]byte, eventSize)
    _, err = unix.Read(n.eventfd, buf)
    ...
    eventCh <- struct{}{}
  }
}

More on the rough API for the notifications can be found in the Linux cgroup documentation here:

9. Memory thresholds

Memory cgroup implements memory thresholds using the cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses.

To register a threshold, an application must:
- create an eventfd using eventfd(2);
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
  cgroup.event_control.

Application will be notified through eventfd when memory usage crosses
threshold in any direction.

Back to the problem at hand ... why were these containers crashing.

The container runs a Java process, set with a limit of 2Gi. The heap is allowed to take up to 75% of the container (set via the -XX:MaxRAMPercentage=75.0 flag).

Metrics showed that the JVM was never anywhere close to filling up its heap, though it had committed this memory inside the container. Off heap regions in the JVM were small (Codecache, metapace, etc.).

However, there was committed memory coming from somewhere in the container. This smelled like an off heap leak, and these things can be gross to track down.

Thankfully, at a previous job someone showed me how you can use pmap and the proc filesystem to take a peek inside a running process's memory at runtime!

pmap showed the following memory regions (sorted by committed usage):

$ pmap -x 1 | tail -n+3 | head -n-2 | sort -k 3 -n  | tail
Address           Kbytes     RSS   Dirty Mode  Mapping
...
00007f394d702000   41976    8280    8280 rw---   [ anon ]
00007f3981310000   12640   10944       0 r-x-- libjvm.so
00007f38566e5000   67600   17648   17648 rw---   [ anon ]
00007f3928000000   24336   24244   24244 rw---   [ anon ]
00007f3940000000   39460   26728   26728 rw---   [ anon ]
00007f397c000000   46656   33696   33696 rw---   [ anon ]
00007f3934000000   52600   41604   41604 rw---   [ anon ]
00007f396432c000   55116   53104   53104 rw---   [ anon ]
00007f396cde4000   67712   67112   67112 rwx--   [ anon ]
00000000a0000000 1585072 1581384 1581384 rw---   [ anon ]

The largest region (1585072) is the JVM heap, and as mentioned earlier, was more or less fully committed (~0.75 * 2Gi).

The rwx region, it turns out, is the JVM Code Cache, where Just-In-Time (JIT) compiled code is stored. Hence the executable bit on the region.

A number of the remaining regions were slowly growing in size, and I wanted to take a look at what was inside!

A useful trick I learned from the same coworker at $job-1 is that you can attach to the process and read out the memory contents. Taking inspiration from this Stackexchange post, I rigged up a little binary that I could run, that would take an address from pmap and read the contents and dump to stdout:

package main

import (
	"bufio"
	"flag"
	"fmt"
	"io"
	"log"
	"os"
	"strconv"
	"syscall"
)

func main() {
	var (
		pid     int
		addrStr string
		length  int
	)
	flag.IntVar(&pid, "pid", -1, "PID of the process to attach to")
	flag.StringVar(&addrStr, "address", "", "Memory address to read")
	flag.IntVar(&length, "n", -1, "Number of byte to read")
	flag.Parse()

	// validate args
	if pid <= 0 {
		fmt.Println("PID must be > 0")
		flag.Usage()
		os.Exit(1)
	}
	if addrStr == "" {
		fmt.Println("address must be specified")
		flag.Usage()
		os.Exit(1)
	}
	if length <= 0 {
		fmt.Println("number of bytes must be > 0")
		flag.Usage()
		os.Exit(1)
	}

	// confirm the PID exists - on Unix, just send SIG 0
	proc, err := os.FindProcess(pid)
	if err != nil {
		log.Fatal(err)
	}
	if err := proc.Signal(syscall.Signal(0)); err != nil {
		log.Fatalf("process does not exist: %s", err)
	}

	// address must parse
	addr, err := strconv.ParseInt(addrStr, 0, 64)
	if err != nil {
		log.Fatal(err)
	}

	// ptrace the file - always detach
	err = syscall.PtraceAttach(proc.Pid)
	if err != nil {
		log.Fatalf("ptrace: %s", err)
	}
	defer syscall.PtraceDetach(proc.Pid)

	// open memory
	f, err := os.Open(fmt.Sprintf("/proc/%d/mem", proc.Pid))
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	// seek to our location in memory
	_, err = f.Seek(addr, 0)
	if err != nil {
		log.Fatal()
	}

	// read the memory contents
	r := bufio.NewReader(f)
	b := make([]byte, length)
	_, err = io.ReadFull(r, b)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(b))
}

I ran this over a few memory regions, passing the contents through strings, and what fell out was most definitely related to the application layer logic. Some of the regions seemed related to Redis, while others seemed to contain HTTP/2 related strings. Given the application was a gRPC server, this had me wondering if gRPC was storing data outsideof the JVM heap.

To try and prove that these regions were related to gRPC, I attached a debugger locally while also watching the output of strace in the container (remember to add the SYS_PTRACE capability!), filtering for mmap syscalls in the container (allocating memory typically manifests as this raw syscall - i.e. malloc is not a syscall!):

$ strace -f -p 1 -e trace=mmap 2>&1 | grep mmap

I noticed certain larger mmap allocations whenever gRPC would call into Netty to allocate a direct buffer:

// https://github.com/netty/netty/blob/netty-4.1.33.Final/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L448
static ByteBuffer allocateDirectNoCleaner(int capacity) {
    // Calling malloc with capacity of 0 may return a null ptr or a memory address that can be used.
    // Just use 1 to make it safe to use in all cases:
    // See: http://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html
    return newDirectBuffer(UNSAFE.allocateMemory(Math.max(1, capacity)), capacity);
}

$ strace -f -p 1 -e trace=mmap 2>&1 | grep mmap
...
[pid    38] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbee2fff000
[pid    41] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbee05e4000
[pid    76] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebeafe000
[pid    59] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebd6fb000
[pid    79] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebc5f9000
[pid    64] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb6fff000
[pid    80] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb5ffe000
[pid    81] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb4ffd000
...

So Netty is allocating 8 off-JVM heap regions, each with size ~16MB. Why eight though? Turns out Netty uses the concept of an "arena", the same as in jemalloc and tcmalloc (read more about that here). Each of the 16MB blocks represents an arena. By default (i.e. without configuration), our Netty runtime was configured to run with 4 arenas per allocator.

And why 8, not 4? Turns out we had both Netty and a shaded version of Netty on the classpath (gRPC can be run with its own bundled version of Netty), and each was bringing with it its own allocator, each with 4 arenas for native (or direct) memory-backed buffers.

Using the program I cooked up, I was able to observe the mmap call, and use the return value (the allocated address) as the input to dump the contents:

root@b4b9f230310f:/app# ./mem-reader --pid 1 --n 16781312 --address 0x7fbebeafe000 | strings
PRI * HTTP/2.0
:authority
localhost:50051@
trailers@
content-type
application/grpc@
user-agent3grpc-node/1.24.2 grpc-c/8.0.0 (osx; chttp2; ganges)@
grpc-accept-encoding
identity,deflate,gzip@
...

Aha! The contents of these mysterious regions line up with the contents of the regions I first observed when poking around in memory.

At this point it started to make sense that we could be running out of room in the container cgroup due to pressure from a combination of sources - JVM heap (~1.5Gi), off heap JVM areas (~100Mi) and up 8 regions of up to 64Mi in size (~512Mi).

Some lessons learned debugging this issue:

Debugging the JVM in production never ceases to provide valuable, low-level systems learnings!