Hi, We are an IXP running 2x route servers with BIRD, each running separate daemons for IPv4 and IPv6. We are running BIRD 2.0.8-1 on Debian 10 and have around 250 peers, ~150k routes on v4 and ~50k routes on v6. Since upgrading to BIRD 2 nearly 3 years ago, it was really stable until May this year. Since then we've had 3 crashes of the daemon for v4 on one of the servers. The v6 daemon on that server has been fine, as has the second route server, running the same, with the same peers and therefore in theory, the same routes. The first two of these crashes happened a week apart, after which I rebooted the VM to ensure everything was clean and it was fine for 90 days, but then did the same yesterday. Our BIRD configuration is generated by IXP Manager and updated hourly. We then run a "bird re-validate" cron job every hour (at twenty past the hour): /usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all Interestingly all 3 crashes have happened at just after twenty past the hour, i.e soon after this cron job has run. It looks like the following in the logs: Aug 17 17:20:01 rs1 CRON[29229]: (root) CMD (/usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all > /dev/null) Aug 17 17:20:01 rs1 bird: Reloading protocol device1 Aug 17 17:20:01 rs1 bird: Reloading protocol pp_0121_asxx ..etc.. Aug 17 17:20:01 rs1 bird: Reloading protocol pp_1082_asxxxxxx Aug 17 17:20:01 rs1 bird: Reloading protocol pb_1082_asxxxxxx Aug 17 17:20:01 rs1 bird: Tagging invalid ROA 2001:xxxx:xxxx::/48 for ASN xxxxx ..etc.. Aug 17 17:21:17 rs1 bird: Tagging invalid ROA x.x.x.x/23 for ASN xxxx Aug 17 17:21:19 rs1 kernel: [7811815.959943] bird[586]: segfault at f30021 ip 000055a1bf450fc3 sp 00007ffe64f3da98 error 4 in bird[55a1bf42a000+d8000] Aug 17 17:21:19 rs1 kernel: [7811815.966760] Code: 95 78 01 00 00 5b 5d 41 5c c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 85 ff b8 01 00 00 00 74 15 48 85 f6 0f 84 a6 00 00 00 <0f> b6 46 21 0f b6 57 21 29 d0 74 11 f3 c3 0f 1f 44 00 00 66 2e 0f Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Main process exited, code=killed, status=11/SEGV Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Failed with result 'signal'. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Service RestartSec=100ms expired, scheduling restart. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Scheduled restart job, restart counter is at 1. Aug 17 17:21:19 rs1 systemd[1]: Stopped BIRD - ipv4. Aug 17 17:21:19 rs1 systemd[1]: Starting BIRD - ipv4... Aug 17 17:21:22 rs1 systemd[1]: Started BIRD - ipv4. Aug 17 17:21:22 rs1 bird: Started When the second crash happened, we happened to be at RIPE84 so we chatted to Maria in person. She said that it was possible to debug it, but would need a core dump. After looking in to this, I did: ulimit -S -c unlimited and installed the systemd-coredump package. ...which was supposed to dump a core file if a process crashed. I tested this by killing a sleep command from the shell with kill -s 6 and it worked. When the crash happened again yesterday, I hoped to have a core file to send, but there is no sign of it having generated one :( Testing on a test server, killing sleep generates a core file, but not killing bird. So two things - has anyone experienced similar crashes or have any ideas why we might be seeing this? Can anyone advise how to reliably get a core dump if bird crashes? Thanks! Ian
Hi Ian, all, Ian Chilton wrote on 18/08/2022 16:57:
We then run a "bird re-validate" cron job every hour (at twenty past the hour): /usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all
Interestingly all 3 crashes have happened at just after twenty past the hour, i.e soon after this cron job has run.
As you're running Bird 2.0.8 this should be no longer necessary. Per 2.0.8's release logs:
Version 2.0.8 (2021-03-18) o Automatic channel reloads based on RPKI changes
So given all three crashes appear linked to this, stopping those manual reloads should, hopefully, return you to stability. You're also two bugfix releases behind. At INEX we've been running 2.0.9 for ~5/6 months now without issue. There appears to be a lot of bugfixes between 2.0.8 and 2.0.10 so it might be worthwhile updating or checking the git commit logs to see if there's anything relevant to RPKI in there? hth, - Barry
It looks like the following in the logs:
Aug 17 17:20:01 rs1 CRON[29229]: (root) CMD (/usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all > /dev/null) Aug 17 17:20:01 rs1 bird: Reloading protocol device1 Aug 17 17:20:01 rs1 bird: Reloading protocol pp_0121_asxx ..etc.. Aug 17 17:20:01 rs1 bird: Reloading protocol pp_1082_asxxxxxx Aug 17 17:20:01 rs1 bird: Reloading protocol pb_1082_asxxxxxx Aug 17 17:20:01 rs1 bird: Tagging invalid ROA 2001:xxxx:xxxx::/48 for ASN xxxxx ..etc.. Aug 17 17:21:17 rs1 bird: Tagging invalid ROA x.x.x.x/23 for ASN xxxx Aug 17 17:21:19 rs1 kernel: [7811815.959943] bird[586]: segfault at f30021 ip 000055a1bf450fc3 sp 00007ffe64f3da98 error 4 in bird[55a1bf42a000+d8000] Aug 17 17:21:19 rs1 kernel: [7811815.966760] Code: 95 78 01 00 00 5b 5d 41 5c c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 85 ff b8 01 00 00 00 74 15 48 85 f6 0f 84 a6 00 00 00 <0f> b6 46 21 0f b6 57 21 29 d0 74 11 f3 c3 0f 1f 44 00 00 66 2e 0f Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Main process exited, code=killed, status=11/SEGV Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Failed with result 'signal'. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Service RestartSec=100ms expired, scheduling restart. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Scheduled restart job, restart counter is at 1. Aug 17 17:21:19 rs1 systemd[1]: Stopped BIRD - ipv4. Aug 17 17:21:19 rs1 systemd[1]: Starting BIRD - ipv4... Aug 17 17:21:22 rs1 systemd[1]: Started BIRD - ipv4. Aug 17 17:21:22 rs1 bird: Started
When the second crash happened, we happened to be at RIPE84 so we chatted to Maria in person. She said that it was possible to debug it, but would need a core dump.
After looking in to this, I did:
ulimit -S -c unlimited and installed the systemd-coredump package.
...which was supposed to dump a core file if a process crashed. I tested this by killing a sleep command from the shell with kill -s 6 and it worked.
When the crash happened again yesterday, I hoped to have a core file to send, but there is no sign of it having generated one :(
Testing on a test server, killing sleep generates a core file, but not killing bird.
So two things - has anyone experienced similar crashes or have any ideas why we might be seeing this?
Can anyone advise how to reliably get a core dump if bird crashes?
Thanks!
Ian
-- Kind regards, Barry O'Donovan Consultant For and on behalf of INEX https://www.inex.ie/support/ +353 1 531 3339
Hi Barry! On Thu, 18 Aug 2022, at 6:08 PM, Barry O'Donovan (INEX) wrote:
As you're running Bird 2.0.8 this should be no longer necessary. Per 2.0.8's release logs:
Version 2.0.8 (2021-03-18) o Automatic channel reloads based on RPKI changes
Ah that's interesting! - so you've just removed that cron job on yours and it's all fine? - nothing needed to enable that? Maybe it's some conflict between doing a manual reload and the newly added automatic stuff that's the issue. The odd thing is, all 3 of our crashes have been on our 'rs1'. Thankfully, but oddly, 'rs2' has been fine, with the exact same config/peers.
You're also two bugfix releases behind. At INEX we've been running 2.0.9 for ~5/6 months now without issue.
When I spoke to Maria in May, I understood she said that there was nothing major in 2.0.9 that would have an effect on this... but then 2.0.10 has been released since then (June), so we are a bit behind now. I also need to upgrade to Debian 11, so i'll have to arrange a maintenance window to drop in a newly built, up-to-date VM! Thanks, Ian
On Fri, Aug 19, 2022 at 09:13:55AM +0100, Ian Chilton wrote:
Hi Barry!
On Thu, 18 Aug 2022, at 6:08 PM, Barry O'Donovan (INEX) wrote:
As you're running Bird 2.0.8 this should be no longer necessary. Per 2.0.8's release logs:
Version 2.0.8 (2021-03-18) o Automatic channel reloads based on RPKI changes
Ah that's interesting! - so you've just removed that cron job on yours and it's all fine? - nothing needed to enable that?
Maybe it's some conflict between doing a manual reload and the newly added automatic stuff that's the issue. The odd thing is, all 3 of our crashes have been on our 'rs1'. Thankfully, but oddly, 'rs2' has been fine, with the exact same config/peers.
Hi Just a clarification to automatic reload based on RPKI changes - it is enabled by default (controlled by option 'rpki reload'), but for BGP it requires 'import table', which is not enabled by default. If you filter RPKI on pipe from one table to another, you do not need to enable anything, but if you filter RPKI in BGP import filter, you have to enable 'import table' or do it manually as before. https://bird.network.cz/?get_doc&v=20&f=bird-3.html#proto-rpki-reload https://bird.network.cz/?get_doc&v=20&f=bird-6.html#bgp-import-table -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 2022-08-18 17:57, Ian Chilton wrote:
When the crash happened again yesterday, I hoped to have a core file to send, but there is no sign of it having generated one :(
This works for me. What is "coredumpctl" saying about the crash ("coredumpctl info -1")? If you installed bird from a package, you may also want to install bird-dbgsym to help debugging (but this is not necessary to get the coredump).
Hi Vincent, It doesn't - it just shows the test I did by killing sleep, which is the only thing that `coredumpctl list` shows (and there is only that one file in /var/lib/systemd/coredump/). We start BIRD from systemd, with some (custom) unit files. Did you just install systemd-coredump and now it creates core dumps for all processes, including your bird processes without any further changes (that's what I read implied). Thanks, Ian On Thu, 18 Aug 2022, at 7:00 PM, Vincent Bernat wrote:
This works for me. What is "coredumpctl" saying about the crash ("coredumpctl info -1")? If you installed bird from a package, you may also want to install bird-dbgsym to help debugging (but this is not necessary to get the coredump).
On 2022-08-19 10:24, Ian Chilton wrote:
It doesn't - it just shows the test I did by killing sleep, which is the only thing that `coredumpctl list` shows (and there is only that one file in /var/lib/systemd/coredump/).
We start BIRD from systemd, with some (custom) unit files.
Did you just install systemd-coredump and now it creates core dumps for all processes, including your bird processes without any further changes (that's what I read implied).
I have systemd-coredump since quite some time, and yes, it creates core dumps for all processes with a few exceptions (ptraced processes for example). I am using its default configuration.
On Thu, Aug 18, 2022 at 04:57:44PM +0100, Ian Chilton wrote:
Hi,
After looking in to this, I did:
ulimit -S -c unlimited and installed the systemd-coredump package.
...which was supposed to dump a core file if a process crashed. I tested this by killing a sleep command from the shell with kill -s 6 and it worked.
Hi This does not work, because ulimit does not set system-wide limits, but limits of the current shell (and it subprocesses). That is why it worked for sleep command. But if you run BIRD from systemd, then systemctl command to start/restart BIRD just say systemd to start it as a child of init (systemd), not as a child of the current shell. So it will not inherit ulimit of the current shell. You likely need something like modifying BIRD unit to change ulimit. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Ondrej, That makes sense! We do use systemd to manage the bird processes. Do you have it successfully creating core dumps on crashes with systemd? - would you mind sharing your unit file? Thanks! On Fri, 19 Aug 2022, at 12:48 AM, Ondrej Zajicek wrote:
This does not work, because ulimit does not set system-wide limits, but limits of the current shell (and it subprocesses). That is why it worked for sleep command.
But if you run BIRD from systemd, then systemctl command to start/restart BIRD just say systemd to start it as a child of init (systemd), not as a child of the current shell. So it will not inherit ulimit of the current shell. You likely need something like modifying BIRD unit to change ulimit.
Hi Ian, On Fri, Aug 19, 2022 at 09:26:16AM +0100, Ian Chilton wrote:
We do use systemd to manage the bird processes. Do you have it successfully creating core dumps on crashes with systemd? - would you mind sharing your unit file? You have to set ``LimitCORE=unlimited'' (or similar value). Tested it with ``pkill -11 bird'' and got a coredump:
TIME PID UID GID SIG COREFILE EXE SIZE Tue 2022-08-23 18:15:47 CEST 85265 976 976 SIGSEGV present /usr/bin/bird 429.9K The config I'm currently using is: (I only changed the '-R' and added the LimitCORE for this test here) ----------------------------------------------------------------------- # /etc/systemd/system/bird.service [Unit] Description=BIRD routing daemon After=network.target [Service] Type=forking LimitCORE=unlimited ExecStart=/usr/bin/bird -R ExecReload=/usr/bin/birdc configure ExecStop=/usr/bin/birdc down RuntimeDirectory=bird RuntimeDirectoryMode=0750 DynamicUser=true User=bird ProtectSystem=strict ProtectHome=true ProtectKernelTunables=true ProtectControlGroups=true PrivateTmp=true PrivateDevices=true CapabilityBoundingSet=CAP_NET_ADMIN CAP_NET_BIND_SERVICE CAP_NET_RAW AmbientCapabilities=CAP_NET_ADMIN CAP_NET_BIND_SERVICE CAP_NET_RAW [Install] WantedBy=multi-user.target ----------------------------------------------------------------------- Greetings Inrin
participants (5)
-
Barry O'Donovan (INEX) -
Ian Chilton -
Inrin -
Ondrej Zajicek -
Vincent Bernat