Backtraces with strace
I discovered strace somewhere between my first part time web development part time job in 2005 and my first full time “software engineering” job in 2008, and it seemed like a superpower giving me x-ray vision into running infrastructure. When a process was stuck, or existing after a cryptic error message, instead of grepping around I could get a pretty good timeline of what the process was up to.
It has some fatal flaws, the ptrace
based technology can cause performance issues so it’s mostly not suited to running in production (unless you were
one of my personal heros: the old mysql@facebook team which liked to live dangerously and often used ptrace
based debuggers to obtain life profiling data).
If you’re intro history of performance engineering before perf
and ebpf
and friends, I highly recommend reading up on poor mans profiler.
After 15+ years I thought I knew most of the useful features, but last month I found a new incredibly useful flag: --stack-traces
. It appears
it was added in 2014 and I’m just finding out about it now! When you specify the --stack-traces
flag strace will print the stack trace that
resulted in the system call. I was debugging some complex signal handling interactions between Go and Cgo and while I knew someone was messing
with the signal I didn’t know who. After a couple hours of grepping the entire codebase I just wasn’t sure how the code ended up doing what it did.
But I did something that works surprisingly often on the internet: I imagined the feature I wanted existed and googled for it (or looked at the man page),
so after a quick look for “strace stack traces” I found the feature existed! When enabled it will cause strace
to print stacktraces so you can see
how the code got where it got. Let’s look at an example with a small go
program to just get a feel for how a single DNS resolution is handled and how --stack-traces
can help us understrand the architecture of an application.
stracing the pure go DNS resolver⌗
Here is a simple go program that looks up the IP addresses associated with the shane.ai
domain.
import (
"net"
"fmt"
)
func main() {
names, err := net.LookupHost("shane.ai")
if err != nil {
panic(err)
}
for _, name := range names {
fmt.Printf("%s\n", name)
}
}
Now let’s inspect what this program does using strace. Today I want to know when it connects
to a service and what code paths lead there. I’m going to
use the Go DNS resolver as an example because Go ships with 2 DNS resolvers that we can compare via stack traces:
- A pure Go resolver (generally used when static linking and cross compiling) where go handles all I/O and incidentally avoids a common issue with evented runtimes and libc dns
- A cgo based resolver which uses libc
First lets look at the highlights from the pure Go version. We see a couple of SOCK_DGRAM
sockets
opened in non-blocking mode (which makes sense since Go’s big thing is kind of non-blocking network I/O).
We connect to 127.0.0.53
which is the system configured resolver on my machine. Note that my main
function isn’t in the stack trace, so Go uses a separate goroutine for the lookup. My current understanding
is this is part of ipv6 support and the happy eyeballs algorithm which involves concurrent lookups for ipv4 & ipv6.
export GOMACPROCS=1
export GODEBUG=netdns=go
go build -trimpath -o dnscheck .
strace -fze socket,connect --stack-trace ./dnscheck 2>&1 | sed -e 's|/home/shane/src/||g' | head -50
strace: Process 88648 attached
strace: Process 88649 attached
strace: Process 88650 attached
strace: Process 88651 attached
strace: Process 88652 attached
[pid 88647] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
> dnscheck/dnscheck(runtime/internal/syscall.Syscall6+0xe) [0x3aee]
> dnscheck/dnscheck(syscall.RawSyscall+0x15) [0x7b5f5]
> dnscheck/dnscheck(syscall.socket+0x25) [0x7afa5]
> dnscheck/dnscheck(syscall.Socket+0x3d) [0x79f7d]
> dnscheck/dnscheck(net.sysSocket+0x3a) [0xad65a]
> dnscheck/dnscheck(net.socket+0x65) [0xadc25]
> dnscheck/dnscheck(net.internetSocket+0xf1) [0xa77f1]
> dnscheck/dnscheck(net.(*sysDialer).dialUDP+0xa5) [0xb1145]
> dnscheck/dnscheck(net.(*sysDialer).dialSingle+0x2b2) [0x9a2b2]
> dnscheck/dnscheck(net.(*sysDialer).dialSerial+0x248) [0x99b88]
> dnscheck/dnscheck(net.(*sysDialer).dialParallel+0x3e5) [0x99245]
> dnscheck/dnscheck(net.(*Dialer).DialContext+0x6de) [0x98c9e]
> dnscheck/dnscheck(net.(*Resolver).dial+0xa5) [0xa96a5]
> dnscheck/dnscheck(net.(*Resolver).exchange+0x36a) [0x9bbea]
> dnscheck/dnscheck(net.(*Resolver).tryOneName+0x466) [0x9cd06]
> dnscheck/dnscheck(net.(*Resolver).goLookupIPCNAMEOrder.func3.1+0x85) [0xa0165]
> dnscheck/dnscheck(net.(*Resolver).goLookupIPCNAMEOrder.func3.2+0x27) [0xa00a7]
> dnscheck/dnscheck(runtime.goexit.abi0+0x1) [0x65201]
[pid 88647] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 0
> dnscheck/dnscheck(runtime/internal/syscall.Syscall6+0xe) [0x3aee]
> dnscheck/dnscheck(syscall.Syscall+0x46) [0x7b646]
> dnscheck/dnscheck(syscall.connect+0x29) [0x7ab49]
> dnscheck/dnscheck(syscall.Connect+0x46) [0x79bc6]
> dnscheck/dnscheck(net.(*netFD).connect+0x78) [0xa1958]
> dnscheck/dnscheck(net.(*netFD).dial+0x3b2) [0xae2f2]
> dnscheck/dnscheck(net.socket+0x294) [0xade54]
> dnscheck/dnscheck(net.internetSocket+0xf1) [0xa77f1]
> dnscheck/dnscheck(net.(*sysDialer).dialUDP+0xa5) [0xb1145]
> dnscheck/dnscheck(net.(*sysDialer).dialSingle+0x2b2) [0x9a2b2]
> dnscheck/dnscheck(net.(*sysDialer).dialSerial+0x248) [0x99b88]
> dnscheck/dnscheck(net.(*sysDialer).dialParallel+0x3e5) [0x99245]
> dnscheck/dnscheck(net.(*Dialer).DialContext+0x6de) [0x98c9e]
> dnscheck/dnscheck(net.(*Resolver).dial+0xa5) [0xa96a5]
> dnscheck/dnscheck(net.(*Resolver).exchange+0x36a) [0x9bbea]
> dnscheck/dnscheck(net.(*Resolver).tryOneName+0x466) [0x9cd06]
> dnscheck/dnscheck(net.(*Resolver).goLookupIPCNAMEOrder.func3.1+0x85) [0xa0165]
> dnscheck/dnscheck(net.(*Resolver).goLookupIPCNAMEOrder.func3.2+0x27) [0xa00a7]
> dnscheck/dnscheck(runtime.goexit.abi0+0x1) [0x65201]
[pid 88650] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
> dnscheck/dnscheck(runtime/internal/syscall.Syscall6+0xe) [0x3aee]
> dnscheck/dnscheck(syscall.RawSyscall+0x15) [0x7b5f5]
> dnscheck/dnscheck(syscall.socket+0x25) [0x7afa5]
> dnscheck/dnscheck(syscall.Socket+0x3d) [0x79f7d]
> dnscheck/dnscheck(net.sysSocket+0x3a) [0xad65a]
stracing the cgo libc DNS resolver⌗
Let’s see the difference when we use cgo
on this ubuntu machine using systemd. libc.so
’s getaddrinfo
is in the stack
trace now. Also notice that it’s using a AF_UNIX
unix domain socket for some communication. It looks like on my machine it’s
an unused part of glibc where system dns can be accessed via a unix domain socket: /var/run/nscd/socket
. There’s an explanation of glibc’s behaviour here.
So after glibc wastes some time looking for a service that’s not running, it also opens up some SOCK_DGRAM
sockets and as expected
they are in blocking mode. Which means that like node, when Go uses the system resolver it has to use a threadpool instead
of just using evented I/O. This has generally annoyed people because Go’s home rolled DNS resolver isn’t 100% compatible with glibc and sometimes that breaks things.
Skipping glibc tends to annoy system administrators.
However accessing the “system” resolver via blocking cgo
calls by default causes unexpected resource issues. People are often not really prepared
for their Go programs to be running tons of threads, usually Go keeps threads near GOMAXPROCS
which is by default equal to the number of cores on your
machine. Having an outage due to thread pools where you expected to see a handful but instead saw 10k tends to annoy developers and users.
export GOMACPROCS=1
export GODEBUG=netdns=cgo
go build -trimpath -o dnscheck .
strace -fze socket,connect --stack-trace ./dnscheck 2>&1 | sed -e 's|/home/shane/src/||g' | head -50
strace: Process 90067 attached
strace: Process 90068 attached
strace: Process 90069 attached
strace: Process 90070 attached
strace: Process 90071 attached
[pid 90066] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> /usr/lib/x86_64-linux-gnu/libc.so.6(__socket+0xb) [0x12031b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x38b1) [0x169301]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x3f01) [0x169951]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x4397) [0x169de7]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x1dea) [0x16783a]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0x1b92) [0x103f42]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> /usr/lib/x86_64-linux-gnu/libc.so.6(__socket+0xb) [0x12031b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x38b1) [0x169301]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x45d7) [0x16a027]
> /usr/lib/x86_64-linux-gnu/libc.so.6(svc_run+0x1f07) [0x167957]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0x1b92) [0x103f42]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
> /usr/lib/x86_64-linux-gnu/libc.so.6(__socket+0xb) [0x12031b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_randomid+0x199) [0x1473a9]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_send+0x2bf) [0x147e6f]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_query+0x1c1) [0x145de1]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_search+0x472) [0x146ac2]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_nss_dns_gethostbyname4_r+0x251) [0x13e781]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0x1ec0) [0x104270]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 0
> /usr/lib/x86_64-linux-gnu/libc.so.6(__connect+0x4b) [0x11fd2b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_randomid+0xfd) [0x14730d]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_send+0x2bf) [0x147e6f]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_query+0x1c1) [0x145de1]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__res_context_search+0x472) [0x146ac2]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_nss_dns_gethostbyname4_r+0x251) [0x13e781]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0x1ec0) [0x104270]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3
> /usr/lib/x86_64-linux-gnu/libc.so.6(__socket+0xb) [0x12031b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__idna_from_dns_encoding+0x25b) [0x13be6b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0x1273) [0x103623]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> /usr/lib/x86_64-linux-gnu/libc.so.6(__socket+0xb) [0x12031b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0xd53) [0x103103]
> dnscheck/dnscheck(runtime.text+0x3b) [0x103b]
[pid 90066] connect(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("185.199.111.153")}, 16) = 0
> /usr/lib/x86_64-linux-gnu/libc.so.6(__connect+0x4b) [0x11fd2b]
> /usr/lib/x86_64-linux-gnu/libc.so.6(getaddrinfo+0xa93) [0x102e43]
Wrap up⌗
That’s all folks. I hope that you don’t need it, but if you ever do find yourself asking: “what code is touching that thing!?”
strace --stack-traces
I hope you’ll land on this post and find that strace
is at your service.
Maybe in the future I’ll look at converting strace --stack-traces
output to something compatible with mozilla’s trace viewer. That sounds fun.