Reading process memory
Here's another fun tale from the JVM depths of despair.
A couple of containers in a Pod were consistently being killed due to
exceeding cgroup limits set by Kubernetes. The reason for the exit was clear
from the status
for the containers in the Pods (set by Docker
on the nodes):
- name: foo
...
lastState:
terminated:
exitCode: 137
reason: OOMKilled
restartCount: 8
...
As a little aside, I saw logs for the OOM events in the API server logs.
Doing some digging in the Kubelet code for where these logs were generated
revealed an interesting little notification system for container memory
threshold events that leverages epoll
. Neat!
When the Kubelet spins a up a new container, it registers a watch on the
cgroup. On Linux, this is done by registering an epoll group that watching a
eventfd
for notification of the cgroup crossing the threshold
specified in the memory.usage_in_bytes
file of the cgroup:
// https://github.com/kubernetes/kubernetes/blob/v1.13.7/pkg/kubelet/eviction/threshold_notifier_linux.go#L48
func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) {
...
// the cgroup file we're going to watch
watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY, 0)
...
// a handle on the cgroup event control file - we're going to write to this soon
controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY, 0)
...
// the fd that will be called by the cgroup notification system
eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC)
...
// a fd to use for epoll
epfd, err = unix.EpollCreate1(0)
...
// set up notifications by posting to the cgroup control file
config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold)
_, err = unix.Write(controlfd, []byte(config))
...
}
// https://github.com/kubernetes/kubernetes/blob/v1.13.7/pkg/kubelet/eviction/threshold_notifier_linux.go#L101
func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
// set up an epoll control group on the fd we created earlier
err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
Fd: int32(n.eventfd),
Events: unix.EPOLLIN,
})
...
for {
// keep waiting for events on the epoll fd or for a shutdown signal
select {
case <-n.stop:
return
default:
}
event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
if err != nil {
...
return
} else if !event {
...
// we timeod out waiting for the event
continue
}
// we got some events, consume it and notifiy the channel
buf := make([]byte, eventSize)
_, err = unix.Read(n.eventfd, buf)
...
eventCh <- struct{}{}
}
}
More on the rough API for the notifications can be found in the Linux cgroup documentation here:
9. Memory thresholds
Memory cgroup implements memory thresholds using the cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses.
To register a threshold, an application must:
- create an eventfd using eventfd(2);
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
cgroup.event_control.
Application will be notified through eventfd when memory usage crosses
threshold in any direction.
Back to the problem at hand ... why were these containers crashing.
The container runs a Java process, set with a limit of 2Gi. The heap is
allowed to take up to 75% of the container (set via the
-XX:MaxRAMPercentage=75.0
flag).
Metrics showed that the JVM was never anywhere close to filling up its heap, though it had committed this memory inside the container. Off heap regions in the JVM were small (Codecache, metapace, etc.).
However, there was committed memory coming from somewhere in the container. This smelled like an off heap leak, and these things can be gross to track down.
Thankfully, at a previous job someone showed me how you can use
pmap
and the proc
filesystem to take a peek inside
a running process's memory at runtime!
pmap
showed the following memory regions (sorted by committed
usage):
$ pmap -x 1 | tail -n+3 | head -n-2 | sort -k 3 -n | tail
Address Kbytes RSS Dirty Mode Mapping
...
00007f394d702000 41976 8280 8280 rw--- [ anon ]
00007f3981310000 12640 10944 0 r-x-- libjvm.so
00007f38566e5000 67600 17648 17648 rw--- [ anon ]
00007f3928000000 24336 24244 24244 rw--- [ anon ]
00007f3940000000 39460 26728 26728 rw--- [ anon ]
00007f397c000000 46656 33696 33696 rw--- [ anon ]
00007f3934000000 52600 41604 41604 rw--- [ anon ]
00007f396432c000 55116 53104 53104 rw--- [ anon ]
00007f396cde4000 67712 67112 67112 rwx-- [ anon ]
00000000a0000000 1585072 1581384 1581384 rw--- [ anon ]
The largest region (1585072
) is the JVM heap, and as mentioned
earlier, was more or less fully committed (~0.75 * 2Gi
).
The rwx
region, it turns out, is the JVM Code Cache, where
Just-In-Time (JIT) compiled code is stored. Hence the executable bit on the
region.
A number of the remaining regions were slowly growing in size, and I wanted to take a look at what was inside!
A useful trick I learned from the same coworker at $job-1
is
that you can attach to the process and read out the memory contents. Taking
inspiration from this
Stackexchange post, I rigged up a little binary that I could run, that would
take an address from pmap
and read the contents and dump to
stdout
:
package main
import (
"bufio"
"flag"
"fmt"
"io"
"log"
"os"
"strconv"
"syscall"
)
func main() {
var (
pid int
addrStr string
length int
)
flag.IntVar(&pid, "pid", -1, "PID of the process to attach to")
flag.StringVar(&addrStr, "address", "", "Memory address to read")
flag.IntVar(&length, "n", -1, "Number of byte to read")
flag.Parse()
// validate args
if pid <= 0 {
fmt.Println("PID must be > 0")
flag.Usage()
os.Exit(1)
}
if addrStr == "" {
fmt.Println("address must be specified")
flag.Usage()
os.Exit(1)
}
if length <= 0 {
fmt.Println("number of bytes must be > 0")
flag.Usage()
os.Exit(1)
}
// confirm the PID exists - on Unix, just send SIG 0
proc, err := os.FindProcess(pid)
if err != nil {
log.Fatal(err)
}
if err := proc.Signal(syscall.Signal(0)); err != nil {
log.Fatalf("process does not exist: %s", err)
}
// address must parse
addr, err := strconv.ParseInt(addrStr, 0, 64)
if err != nil {
log.Fatal(err)
}
// ptrace the file - always detach
err = syscall.PtraceAttach(proc.Pid)
if err != nil {
log.Fatalf("ptrace: %s", err)
}
defer syscall.PtraceDetach(proc.Pid)
// open memory
f, err := os.Open(fmt.Sprintf("/proc/%d/mem", proc.Pid))
if err != nil {
log.Fatal(err)
}
defer f.Close()
// seek to our location in memory
_, err = f.Seek(addr, 0)
if err != nil {
log.Fatal()
}
// read the memory contents
r := bufio.NewReader(f)
b := make([]byte, length)
_, err = io.ReadFull(r, b)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(b))
}
I ran this over a few memory regions, passing the contents through
strings
, and what fell out was most definitely related to the
application layer logic. Some of the regions seemed related to Redis, while
others seemed to contain HTTP/2 related strings. Given the application was a
gRPC server, this had me wondering if gRPC was storing data outsideof
the JVM heap.
To try and prove that these regions were related to gRPC, I attached a
debugger locally while also watching the output of strace
in
the container (remember to add the SYS_PTRACE
capability!),
filtering for mmap
syscalls in the container (allocating memory
typically manifests as this raw syscall - i.e. malloc
is not a syscall!):
$ strace -f -p 1 -e trace=mmap 2>&1 | grep mmap
I noticed certain larger mmap
allocations whenever gRPC would
call into Netty to allocate a direct buffer:
// https://github.com/netty/netty/blob/netty-4.1.33.Final/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L448
static ByteBuffer allocateDirectNoCleaner(int capacity) {
// Calling malloc with capacity of 0 may return a null ptr or a memory address that can be used.
// Just use 1 to make it safe to use in all cases:
// See: http://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html
return newDirectBuffer(UNSAFE.allocateMemory(Math.max(1, capacity)), capacity);
}
$ strace -f -p 1 -e trace=mmap 2>&1 | grep mmap
...
[pid 38] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbee2fff000
[pid 41] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbee05e4000
[pid 76] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebeafe000
[pid 59] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebd6fb000
[pid 79] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbebc5f9000
[pid 64] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb6fff000
[pid 80] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb5ffe000
[pid 81] mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbeb4ffd000
...
So Netty is allocating 8 off-JVM heap regions, each with size ~16MB. Why
eight though? Turns out Netty uses the concept of an "arena",
the same as in jemalloc
and tcmalloc
(read more
about that here).
Each of the 16MB blocks represents an arena. By default (i.e. without
configuration), our Netty runtime was configured to run with 4 arenas per
allocator.
And why 8, not 4? Turns out we had both Netty and a shaded version of Netty on the classpath (gRPC can be run with its own bundled version of Netty), and each was bringing with it its own allocator, each with 4 arenas for native (or direct) memory-backed buffers.
Using the program I cooked up, I was able to observe the mmap
call, and use the return value (the allocated address) as the input to dump
the contents:
root@b4b9f230310f:/app# ./mem-reader --pid 1 --n 16781312 --address 0x7fbebeafe000 | strings
PRI * HTTP/2.0
:authority
localhost:50051@
trailers@
content-type
application/grpc@
user-agent3grpc-node/1.24.2 grpc-c/8.0.0 (osx; chttp2; ganges)@
grpc-accept-encoding
identity,deflate,gzip@
...
Aha! The contents of these mysterious regions line up with the contents of the regions I first observed when poking around in memory.
At this point it started to make sense that we could be running out of room in the container cgroup due to pressure from a combination of sources - JVM heap (~1.5Gi), off heap JVM areas (~100Mi) and up 8 regions of up to 64Mi in size (~512Mi).
Some lessons learned debugging this issue:
Cgroups have a neat notification mechanism for events, and K8s takes advantage of this.
The JVM maintains an executable region in memory for JIT-compiled code.
Go has pretty good support for interacting with processes directly - I guess this is why some people like to call it a "systems language".
Netty, by default, allocates off-heap memory buffers.
If you're using a shaded version of Netty (i.e. using gRPC), and you also have a library bringing in vanilla Netty, chances are you're allocating twice as much reserved memory (and potentially using it).
When running the JVM in containerized environment, ensure you provide enough "breathing room" for the heap as well as off-heap regions (JVM and non-JVM).
Debugging the JVM in production never ceases to provide valuable, low-level systems learnings!