Races in birdc causing daemon to crash on low spec routers
Hello all, I'm hitting crash in BIRD 3.3.0 on several routers. The machines are running - NixOS x86_64-linux 7.0.10 - 2-5 BGP IPv4/IPv6 peers per affected host - prometheus-bird-exporter 1.4.5, scraping every 30s prometheus-bird-exporter runs `show protocols all` over /run/bird/bird.ctl every ~30s, and on hosts with active BGP churn that crashes bird inside bgp_show_proto_info(). systemd restarts it, the next scrape kills it again, and NRestarts climbs ~120/hr... toompea and timah took 14 and 13 cores in their last ~20 minutes of uptime, highline took 4 over a slower span. The other seven routers (butte, cradle, baldy, kongo, lantau, roraima, rysy) run the same build and exporter with no cores (most likely because these machines are only getting default routes). The crash needs concurrent BGP state changes. For example: ```console $ colmena exec --verbose --on @router -- 'sudo coredumpctl list || true' timah | highline | kongo | toompea | butte | cradle | baldy | rysy | lantau | roraima | rysy | No coredumps found. rysy | Succeeded toompea | TIME PID UID GID SIG COREFILE EXE SIZE toompea | Thu 2026-05-28 15:41:25 UTC 1185401 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 98.9M toompea | Thu 2026-05-28 15:41:53 UTC 1265107 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 100.3M toompea | Thu 2026-05-28 15:42:22 UTC 1265317 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 100.9M toompea | Thu 2026-05-28 15:42:54 UTC 1265529 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 106.1M toompea | Thu 2026-05-28 15:43:23 UTC 1265740 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 101.8M toompea | Thu 2026-05-28 15:43:54 UTC 1265952 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 105.9M toompea | Thu 2026-05-28 15:44:22 UTC 1266165 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 90.2M toompea | Thu 2026-05-28 15:44:53 UTC 1266380 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 100.6M toompea | Thu 2026-05-28 19:45:24 UTC 1266590 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 102.9M toompea | Thu 2026-05-28 19:45:53 UTC 1355754 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 88.6M toompea | Thu 2026-05-28 19:46:22 UTC 1355959 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 87.4M toompea | Thu 2026-05-28 19:46:53 UTC 1356167 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 100.8M toompea | Thu 2026-05-28 19:47:22 UTC 1356371 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 88.8M toompea | Thu 2026-05-28 19:47:54 UTC 1356389 993 991 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 104.8M toompea | Succeeded butte | No coredumps found. butte | Succeeded timah | TIME PID UID GID SIG COREFILE EXE SIZE timah | Thu 2026-05-28 15:36:51 UTC 1268745 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 96.9M timah | Thu 2026-05-28 15:37:17 UTC 1268960 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 98.6M timah | Thu 2026-05-28 15:37:52 UTC 1268974 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 95.9M timah | Thu 2026-05-28 15:38:18 UTC 1269382 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 89.2M timah | Thu 2026-05-28 15:38:53 UTC 1269397 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 95.4M timah | Thu 2026-05-28 19:39:27 UTC 1269828 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 131.6M timah | Thu 2026-05-28 19:39:48 UTC 1359330 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 76.2M timah | Thu 2026-05-28 19:40:20 UTC 1359346 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 99.8M timah | Thu 2026-05-28 19:40:49 UTC 1359574 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 94.1M timah | Thu 2026-05-28 19:41:20 UTC 1359832 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 98.9M timah | Thu 2026-05-28 19:41:51 UTC 1360097 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 93.9M timah | Thu 2026-05-28 19:42:22 UTC 1360320 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 88.9M timah | Thu 2026-05-28 19:42:48 UTC 1360526 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 78.8M timah | Succeeded highline | TIME PID UID GID SIG COREFILE EXE SIZE highline | Thu 2026-05-28 11:34:31 UTC 1203740 999 999 SIGSEGV present /nix/store/afhrc47kar31rwmvbj0qrymm6xcpglvk-bird-3.3.0/bin/bird 99.4M highline | Thu 2026-05-28 11:37:02 UTC 1204447 999 999 SIGSEGV present /nix/store/afhrc47kar31rwmvbj0qrymm6xcpglvk-bird-3.3.0/bin/bird 81.9M highline | Thu 2026-05-28 11:40:00 UTC 1205349 999 999 SIGSEGV present /nix/store/afhrc47kar31rwmvbj0qrymm6xcpglvk-bird-3.3.0/bin/bird 108.9M highline | Thu 2026-05-28 12:15:56 UTC 1220240 999 999 SIGSEGV present /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird 80.3M highline | Succeeded baldy | No coredumps found. baldy | Succeeded kongo | No coredumps found. kongo | Succeeded roraima | No coredumps found. roraima | Succeeded lantau | No coredumps found. lantau | Succeeded cradle | No coredumps found. cradle | Succeeded | All done! ``` ```console $ sudo coredumpctl info PID: 1356389 (bird) UID: 993 (bird) GID: 991 (bird) Signal: 11 (SEGV) Timestamp: Thu 2026-05-28 19:47:42 UTC (1h 37min ago) Command Line: /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird -c /etc/bird/bird.conf Executable: /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird Control Group: /system.slice/bird.service Unit: bird.service Slice: system.slice Boot ID: fd4c001e44dc44e1a05cbc5d60440ecb Machine ID: 77293224076f4ff7845c2358cb35a4c0 Hostname: toompea Storage: /var/lib/systemd/coredump/core.bird.993.fd4c001e44dc44e1a05cbc5d60440ecb.1356389.1779997662000000.zst (present) Size on Disk: 104.8M Message: Process 1356389 (bird) of user 993 dumped core. Module /nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird without build-id. Module libz.so.1 without build-id. Module libssh.so.4 without build-id. Stack trace of thread 1356389: #0 0x000059d538758d08 bgp_show_proto_info.lto_priv.0 (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0xdcd08) #1 0x000059d538706fb2 proto_cmd_show (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x8afb2) #2 0x000059d5387077c1 proto_apply_cmd.isra.0 (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x8b7c1) #3 0x000059d5386a2de8 cf_parse.isra.0 (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x26de8) #4 0x000059d5386ac462 cli_parse (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x30462) #5 0x000059d5386f96e5 cli_command (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x7d6e5) #6 0x000059d5386f9980 cli_event.lto_priv.0 (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x7d980) #7 0x000059d5386dff35 ev_run_list_limited (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x63f35) #8 0x000059d538693c8c main (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x17c8c) #9 0x00007ae37602b285 __libc_start_call_main (libc.so.6 + 0x2b285) #10 0x00007ae37602b338 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2b338) #11 0x000059d538694885 _start (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x18885) Stack trace of thread 1356391: #0 0x00007ae3760a6922 __syscall_cancel_arch (libc.so.6 + 0xa6922) #1 0x00007ae37609a00c __internal_syscall_cancel (libc.so.6 + 0x9a00c) #2 0x00007ae37609a084 __syscall_cancel (libc.so.6 + 0x9a084) #3 0x00007ae37611778e __poll (libc.so.6 + 0x11778e) #4 0x000059d5387b3bfe bird_thread_main.lto_priv.0 (/nix/store/230z9nyndgbn265mqlbyvf20z5wdciwy-bird-3.3.0/bin/bird + 0x137bfe) #5 0x00007ae37609dd53 start_thread (libc.so.6 + 0x9dd53) #6 0x00007ae37612563c __clone3 (libc.so.6 + 0x12563c) ELF object binary architecture: AMD x86-64 ``` I temporarily worked around this by removing the prefix count showed when invoking `birdc show proto all`. But I’m not sure if this type of patches are acceptable for upstreaming (please see reasoning in the commit message in the patch): https://raw.githubusercontent.com/stepbrobd/inc/refs/heads/master/pkgs/bird3... Best, Yifei
This is a follows up on my earlier report Patch 1/2 is the workaround I already linked. In short, SIGSEGV in bgp_show_proto_info() when prometheus-bird-exporter runs `show protocols all` every 30s under BGP churn (trace in previous mail). The patch drops the per bucket prefix count so the walk no longer derefs a stale bucket prefix slot (this is a very minimal workaround) Patch 2/2 is addressing SIGABRT from an out_limit counter underflow, please see trace below: ```console $ colmena exec --verbose --on butte,timah -- sudo coredumpctl list timah | butte | timah | TIME PID UID GID SIG COREFILE EXE SIZE timah | Wed 2026-06-03 01:50:12 UTC 1177 999 999 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 109.3M timah | Wed 2026-06-03 09:31:36 UTC 1429113 999 999 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 116M timah | Succeeded butte | TIME PID UID GID SIG COREFILE EXE SIZE butte | Wed 2026-06-03 07:49:46 UTC 1569499 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 124.2M butte | Wed 2026-06-03 08:09:28 UTC 1630583 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 120.2M butte | Wed 2026-06-03 08:14:44 UTC 1638833 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 119.9M butte | Wed 2026-06-03 08:40:10 UTC 1641543 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 121.6M butte | Wed 2026-06-03 08:49:22 UTC 1652085 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 120.1M butte | Wed 2026-06-03 10:37:51 UTC 1656139 993 991 SIGABRT present /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird 125M butte | Succeeded | All done! ``` ```console $ sudo coredumpctl info PID: 1656139 (bird) UID: 993 (bird) GID: 991 (bird) Signal: 6 (ABRT) Timestamp: Wed 2026-06-03 10:37:05 UTC (2h 15min ago) Command Line: /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird -c /etc/bird/bird.conf Executable: /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird Control Group: /system.slice/bird.service Unit: bird.service Slice: system.slice Boot ID: dfa9f57ca76345e98c08abe21afcc0bd Machine ID: 7d543c30c6554cebbc9c4c6e94f78247 Hostname: butte Storage: /var/lib/systemd/coredump/core.bird.993.dfa9f57ca76345e98c08abe21afcc0bd.1656139.1780483025000000.zst (present) Size on Disk: 125M Message: Process 1656139 (bird) of user 993 dumped core. Module /nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird without build-id. Module libz.so.1 without build-id. Module libssh.so.4 without build-id. Stack trace of thread 1656139: #0 0x000079eb7089fdcc __pthread_kill_implementation (libc.so.6 + 0x9fdcc) #1 0x000079eb7084265e raise (libc.so.6 + 0x4265e) #2 0x000079eb70829350 abort (libc.so.6 + 0x29350) #3 0x0000595bc856aeca bug (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x13beca) #4 0x0000595bc84d9f85 limit_pop.part.0.lto_priv.0 (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0xaaf85) #5 0x0000595bc84cc5e4 do_rt_notify (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x9d5e4) #6 0x0000595bc84ccbed rt_notify_basic (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x9dbed) #7 0x0000595bc84cd07b channel_notify_optimal_req (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x9e07b) #8 0x0000595bc84cd3c0 channel_notify_optimal (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x9e3c0) #9 0x0000595bc8492f35 ev_run_list_limited (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x63f35) #10 0x0000595bc8447001 main (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x18001) #11 0x000079eb7082b285 __libc_start_call_main (libc.so.6 + 0x2b285) #12 0x000079eb7082b338 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2b338) #13 0x0000595bc8447885 _start (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x18885) Stack trace of thread 1656141: #0 0x000079eb708a6922 __syscall_cancel_arch (libc.so.6 + 0xa6922) #1 0x000079eb7089a00c __internal_syscall_cancel (libc.so.6 + 0x9a00c) #2 0x000079eb7089a084 __syscall_cancel (libc.so.6 + 0x9a084) #3 0x000079eb7091778e __poll (libc.so.6 + 0x11778e) #4 0x0000595bc8566b3e bird_thread_main.lto_priv.0 (/nix/store/xyxkk8dfzz668lmksbb73rp5qzz88r2a-bird-3.3.0/bin/bird + 0x137b3e) #5 0x000079eb7089dd53 start_thread (libc.so.6 + 0x9dd53) #6 0x000079eb7092563c __clone3 (libc.so.6 + 0x12563c) ELF object binary architecture: AMD x86-64 ``` Yifei Sun (2): BGP: skip bgp_bucket_pending() in show proto info Table export: don't pop out_limit for a never-exported old route nest/rt-table.c | 4 ++++ proto/bgp/bgp.c | 10 ++++------ 2 files changed, 8 insertions(+), 6 deletions(-) -- 2.54.0
bgp_show_proto_info() walks each BGP channel's tx->bucket_queue under BGP_PTX_LOCK and calls bgp_bucket_pending() per bucket to count the prefixes still queued for transmission. bgp_bucket_pending() descends into the bucket's prefix tree (proto/bgp/attrs.c) and dereferences row->array[pos].pref->cur_buck without tolerating slots whose .pref is NULL or holds a stale/torn value. In production we see this crash continuously on bird 3.3.0 when prometheus-bird-exporter scrapes the daemon every ~30 seconds. All recorded SIGSEGV cores across multiple hosts land in this code path (verified via gdb against fresh core files). The simplest defensive change is to drop the prefix count from "show protocols all". The bucket count is still useful, and the WALK_LIST itself just walks list nodes under the lock, which is safe. Operators lose the "with total N prefixes to send" detail but gain a daemon that does not segfault on every scrape. Downstream stop-gap until the underlying race in the prefix tree walker is properly diagnosed and fixed upstream. --- proto/bgp/bgp.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/proto/bgp/bgp.c b/proto/bgp/bgp.c index 4f726d7ee..6fcdc5f98 100644 --- a/proto/bgp/bgp.c +++ b/proto/bgp/bgp.c @@ -4274,17 +4274,15 @@ bgp_show_proto_info(struct proto *P) BGP_PTX_LOCK(c->tx, tx); + /* bgp_bucket_pending() walks the prefix tree and dereferences slots + * whose .pref can be stale during concurrent TX, segfaulting bird on + * every "show protocols all" scrape. Skip the prefix count. */ uint bucket_cnt = 0; - uint prefix_cnt = 0; struct bgp_bucket *buck; WALK_LIST(buck, tx->bucket_queue) - { bucket_cnt++; - prefix_cnt += bgp_bucket_pending(buck); - } - cli_msg(-1006, " Pending %u attribute sets with total %u prefixes to send", - bucket_cnt, prefix_cnt); + cli_msg(-1006, " Pending %u attribute sets to send", bucket_cnt); } } } -- 2.54.0
Reverts commit 682d83eaa37893dcaf7527c326fd4379ddff4d37 Table export: Drop redundant not-seen old route nullification Forward-ported past refactor 34a8a2749b1cab415c48b2cbdb05bc8faf345374 Table: Optimal and Any Export refactoring --- nest/rt-table.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/nest/rt-table.c b/nest/rt-table.c index 2b35bf110..e1c32b8d7 100644 --- a/nest/rt-table.c +++ b/nest/rt-table.c @@ -1431,6 +1431,10 @@ rt_notify_basic(struct channel *c, const rte *new, const rte *old, const rte *tr /* Treat old rejected as never seen. */ old = NULL; } + else if (!bmap_test(&c->export_accepted_map, old->id)) + /* In neither map => never exported on this channel, so do_rt_notify() + * must not pop out_limit for it (would underflow: limit.h:35). */ + old = NULL; /* Accepted bit is dropped in do_rt_notify() */ } -- 2.54.0
participants (1)
-
Yifei Sun