Inserting fulltable into kernel FIB makes bird crazy
Hello, Now that the IPv6 bug is supposed to be resolved since 5.8, I tried to upgrade a router from 4.14 to 5.10 Bird starts, however while inserting routes to FIB, I have long I/O loop cycles and at some point bird is unable to keep up. I already recompiled bird in case of a header change or something like that, and to switch to a pre-compiled kernel, neither have any effect. When bird begins to loose track of itself, I have this kind of messages: Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: ... Sep 24 08:44:43 edge04-hostzealot bird: I/O loop cycle took 28703 ms for 1 events Sep 24 08:44:43 edge04-hostzealot bird: Kernel dropped some netlink messages, will resync on next scan. Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: ... Sep 24 08:45:51 edge04-hostzealot bird: I/O loop cycle took 36201 ms for 1 events And then ospf begins to flap and routes are re-calculated based on remaining bgp ones. Sep 24 08:46:54 edge04-hostzealot bird: Next hop address 185.107.95.180 resolvable through recursive route for 185.107.92.0/22 (I have a way more specific route in OSPF) I activated the debug, and I can see that bird is re-scanning the entire kernel table when the “I/O loop” message appears Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.0.0/24: seen Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.4.0/24: seen And it tries to insert already inserted routes Sep 24 09:08:04 edge04-hostzealot bird: kernel_grt_ipv4: 122.76.248.0/23: installing Sep 24 09:08:04 edge04-hostzealot bird: Netlink: File exists And then OSPF is clearly going down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Inactivity timer expired for nbr 45.91.126.248 on gre4 Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 changed state from Full to Down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 removed Here are some more detailed logs: https://paste.swordarmor.fr/raw/HX45 https://paste.swordarmor.fr/raw/oM9s This server isn’t the fastest one on the marked, but stuffed enough to handle full views. And with an older kernel it works very well. I have RRs running on 5.10 kernels, so it’s more likely a kernel issue, but I’m not able to determine if it’s caused by the kernel itself or by the way bird is using netlink. I’m using bird 2.0.8, I didn’t try an older version. Thanks a lot, -- Alarig Le Lay
Hello, I gave the 2.0.9 git snapshot (71c9484b00b4428ae6c7d7c8eea6d96073683a54) a try tonight, and it seems to fix the issue for me. I’ve not tested on 5.10 though, as the LTS is now 5.15. However, I did test 5.15 with 2.0.8 and I had the same behaviour. The VM is up for 6h now and everything is stable. Before, the logs were flooded within an hour. On Fri 24 Sep 2021 23:29:25 GMT, Alarig Le Lay wrote:
Hello,
Now that the IPv6 bug is supposed to be resolved since 5.8, I tried to upgrade a router from 4.14 to 5.10
Bird starts, however while inserting routes to FIB, I have long I/O loop cycles and at some point bird is unable to keep up. I already recompiled bird in case of a header change or something like that, and to switch to a pre-compiled kernel, neither have any effect.
When bird begins to loose track of itself, I have this kind of messages: Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: ... Sep 24 08:44:43 edge04-hostzealot bird: I/O loop cycle took 28703 ms for 1 events Sep 24 08:44:43 edge04-hostzealot bird: Kernel dropped some netlink messages, will resync on next scan. Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: ... Sep 24 08:45:51 edge04-hostzealot bird: I/O loop cycle took 36201 ms for 1 events
And then ospf begins to flap and routes are re-calculated based on remaining bgp ones. Sep 24 08:46:54 edge04-hostzealot bird: Next hop address 185.107.95.180 resolvable through recursive route for 185.107.92.0/22 (I have a way more specific route in OSPF)
I activated the debug, and I can see that bird is re-scanning the entire kernel table when the “I/O loop” message appears Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.0.0/24: seen Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.4.0/24: seen
And it tries to insert already inserted routes Sep 24 09:08:04 edge04-hostzealot bird: kernel_grt_ipv4: 122.76.248.0/23: installing Sep 24 09:08:04 edge04-hostzealot bird: Netlink: File exists
And then OSPF is clearly going down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Inactivity timer expired for nbr 45.91.126.248 on gre4 Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 changed state from Full to Down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 removed
Here are some more detailed logs: https://paste.swordarmor.fr/raw/HX45 https://paste.swordarmor.fr/raw/oM9s
This server isn’t the fastest one on the marked, but stuffed enough to handle full views. And with an older kernel it works very well.
I have RRs running on 5.10 kernels, so it’s more likely a kernel issue, but I’m not able to determine if it’s caused by the kernel itself or by the way bird is using netlink.
I’m using bird 2.0.8, I didn’t try an older version.
Thanks a lot, -- Alarig Le Lay
On Sat, Feb 19, 2022 at 12:44:11AM +0100, Alarig Le Lay wrote:
Hello,
I gave the 2.0.9 git snapshot (71c9484b00b4428ae6c7d7c8eea6d96073683a54) a try tonight, and it seems to fix the issue for me. I’ve not tested on 5.10 though, as the LTS is now 5.15. However, I did test 5.15 with 2.0.8 and I had the same behaviour.
The VM is up for 6h now and everything is stable. Before, the logs were flooded within an hour.
Hello Thanks for confirming it. That is likely an issue investigated and fixed by Tomas Hlavacek: http://trubka.network.cz/pipermail/bird-users/2022-January/015909.html
On Fri 24 Sep 2021 23:29:25 GMT, Alarig Le Lay wrote:
Hello,
Now that the IPv6 bug is supposed to be resolved since 5.8, I tried to upgrade a router from 4.14 to 5.10
Bird starts, however while inserting routes to FIB, I have long I/O loop cycles and at some point bird is unable to keep up. I already recompiled bird in case of a header change or something like that, and to switch to a pre-compiled kernel, neither have any effect.
When bird begins to loose track of itself, I have this kind of messages: Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: ... Sep 24 08:44:43 edge04-hostzealot bird: I/O loop cycle took 28703 ms for 1 events Sep 24 08:44:43 edge04-hostzealot bird: Kernel dropped some netlink messages, will resync on next scan. Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: ... Sep 24 08:45:51 edge04-hostzealot bird: I/O loop cycle took 36201 ms for 1 events
And then ospf begins to flap and routes are re-calculated based on remaining bgp ones. Sep 24 08:46:54 edge04-hostzealot bird: Next hop address 185.107.95.180 resolvable through recursive route for 185.107.92.0/22 (I have a way more specific route in OSPF)
I activated the debug, and I can see that bird is re-scanning the entire kernel table when the “I/O loop” message appears Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.0.0/24: seen Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.4.0/24: seen
And it tries to insert already inserted routes Sep 24 09:08:04 edge04-hostzealot bird: kernel_grt_ipv4: 122.76.248.0/23: installing Sep 24 09:08:04 edge04-hostzealot bird: Netlink: File exists
And then OSPF is clearly going down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Inactivity timer expired for nbr 45.91.126.248 on gre4 Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 changed state from Full to Down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 removed
Here are some more detailed logs: https://paste.swordarmor.fr/raw/HX45 https://paste.swordarmor.fr/raw/oM9s
This server isn’t the fastest one on the marked, but stuffed enough to handle full views. And with an older kernel it works very well.
I have RRs running on 5.10 kernels, so it’s more likely a kernel issue, but I’m not able to determine if it’s caused by the kernel itself or by the way bird is using netlink.
I’m using bird 2.0.8, I didn’t try an older version.
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello, Thanks for giving me the original patch! I backported it along with some of the following commits to 2.0.8 and it seems to work too. The whole diff is https://git.grifon.fr/alarig/SwordArMor-gentoo-overlay/src/branch/master/net... I will test it on old kernels too, and if it works, I’m planning to include it in the gentoo package. On Sat 19 Feb 2022 01:44:43 GMT, Ondrej Zajicek wrote:
On Sat, Feb 19, 2022 at 12:44:11AM +0100, Alarig Le Lay wrote:
Hello,
I gave the 2.0.9 git snapshot (71c9484b00b4428ae6c7d7c8eea6d96073683a54) a try tonight, and it seems to fix the issue for me. I’ve not tested on 5.10 though, as the LTS is now 5.15. However, I did test 5.15 with 2.0.8 and I had the same behaviour.
The VM is up for 6h now and everything is stable. Before, the logs were flooded within an hour.
Hello
Thanks for confirming it. That is likely an issue investigated and fixed by Tomas Hlavacek:
http://trubka.network.cz/pipermail/bird-users/2022-January/015909.html
On Fri 24 Sep 2021 23:29:25 GMT, Alarig Le Lay wrote:
Hello,
Now that the IPv6 bug is supposed to be resolved since 5.8, I tried to upgrade a router from 4.14 to 5.10
Bird starts, however while inserting routes to FIB, I have long I/O loop cycles and at some point bird is unable to keep up. I already recompiled bird in case of a header change or something like that, and to switch to a pre-compiled kernel, neither have any effect.
When bird begins to loose track of itself, I have this kind of messages: Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: Netlink: File exists Sep 24 08:44:43 edge04-hostzealot bird: ... Sep 24 08:44:43 edge04-hostzealot bird: I/O loop cycle took 28703 ms for 1 events Sep 24 08:44:43 edge04-hostzealot bird: Kernel dropped some netlink messages, will resync on next scan. Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: Netlink: File exists Sep 24 08:45:50 edge04-hostzealot bird: ... Sep 24 08:45:51 edge04-hostzealot bird: I/O loop cycle took 36201 ms for 1 events
And then ospf begins to flap and routes are re-calculated based on remaining bgp ones. Sep 24 08:46:54 edge04-hostzealot bird: Next hop address 185.107.95.180 resolvable through recursive route for 185.107.92.0/22 (I have a way more specific route in OSPF)
I activated the debug, and I can see that bird is re-scanning the entire kernel table when the “I/O loop” message appears Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.0.0/24: seen Sep 24 09:07:30 edge04-hostzealot bird: kernel_grt_ipv4: 1.0.4.0/24: seen
And it tries to insert already inserted routes Sep 24 09:08:04 edge04-hostzealot bird: kernel_grt_ipv4: 122.76.248.0/23: installing Sep 24 09:08:04 edge04-hostzealot bird: Netlink: File exists
And then OSPF is clearly going down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Inactivity timer expired for nbr 45.91.126.248 on gre4 Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 changed state from Full to Down Sep 24 09:08:04 edge04-hostzealot bird: ospf_ipv4: Neighbor 45.91.126.248 on gre4 removed
Here are some more detailed logs: https://paste.swordarmor.fr/raw/HX45 https://paste.swordarmor.fr/raw/oM9s
This server isn’t the fastest one on the marked, but stuffed enough to handle full views. And with an older kernel it works very well.
I have RRs running on 5.10 kernels, so it’s more likely a kernel issue, but I’m not able to determine if it’s caused by the kernel itself or by the way bird is using netlink.
I’m using bird 2.0.8, I didn’t try an older version.
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Alarig Le Lay -
Ondrej Zajicek