Dropped netlink updates during scans
Hi, We’re using BIRD to redistribute routes that are programmed into the Linux kernel for routing to local containers or VMs. We set a scan time in the kernel section of the config in order to notice when routes are removed. Normally, BIRD picks up routes that are added extremely quickly. However, if a route is added during a scan, it seems to be missed and it is not picked up until the next scan, many seconds later. Is this a known issue; is there a workaround? For example, is there a way to tell BIRD that a route was removed from my external code rather than having it poll? Thanks, -Shaun
On Thu, Aug 13, 2015 at 03:12:26PM +0000, Shaun Crampton wrote:
Hi,
We’re using BIRD to redistribute routes that are programmed into the Linux kernel for routing to local containers or VMs. We set a scan time in the kernel section of the config in order to notice when routes are removed.
Normally, BIRD picks up routes that are added extremely quickly. However, if a route is added during a scan, it seems to be missed and it is not picked up until the next scan, many seconds later.
Hi I was not aware of this issue but the cause it is pretty clear - scans are implemented in a synchronous way and BIRD ignores all non-related messages during these scans. The proper solution would be to make make the BIRD netlink code fully asynchronous, but that means rewritting half of netlink and route scanning code. As a workaround we could just queue these asynchronous messages and process them after scans (and other netlink operations). BTW, the issue is likely not limited to route scans but may happen with any netlink operation (like request for route change), but other operations are probably too quick to cause the problem in practice. Do you have a simple way how to trigger the issue? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
We set our scan time to 2s and then create many routes using "ip route add². With many (say 50k) routes, the scan starts taking a second or more so BIRD is ignoring about 50% of the route updates and only picking them up on the scan. We¹re running 100ms interval pings between our containers that we¹re routing so we see a a cluster of ping times around 2s and another cluster around 0ms. -Shaun On 13/08/2015 17:00, "Ondrej Zajicek" <santiago@crfreenet.org> wrote:
On Thu, Aug 13, 2015 at 03:12:26PM +0000, Shaun Crampton wrote:
Hi,
We¹re using BIRD to redistribute routes that are programmed into the Linux kernel for routing to local containers or VMs. We set a scan time in the kernel section of the config in order to notice when routes are removed.
Normally, BIRD picks up routes that are added extremely quickly. However, if a route is added during a scan, it seems to be missed and it is not picked up until the next scan, many seconds later.
Hi
I was not aware of this issue but the cause it is pretty clear - scans are implemented in a synchronous way and BIRD ignores all non-related messages during these scans.
The proper solution would be to make make the BIRD netlink code fully asynchronous, but that means rewritting half of netlink and route scanning code. As a workaround we could just queue these asynchronous messages and process them after scans (and other netlink operations).
BTW, the issue is likely not limited to route scans but may happen with any netlink operation (like request for route change), but other operations are probably too quick to cause the problem in practice.
Do you have a simple way how to trigger the issue?
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Thu, Aug 13, 2015 at 04:31:53PM +0000, Shaun Crampton wrote:
We set our scan time to 2s and then create many routes using "ip route add². With many (say 50k) routes, the scan starts taking a second or more so BIRD is ignoring about 50% of the route updates and only picking them up on the scan.
Do you see any warnings in log when a route is missed? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
I don’t think so, but the test I’m running is very big so it’s hard to catch one of these issues as it happens. I have the logging turned down to a low level to achieve the scale I need. -Shaun On 13/08/2015 17:37, "Ondrej Zajicek" <santiago@crfreenet.org> wrote:
On Thu, Aug 13, 2015 at 04:31:53PM +0000, Shaun Crampton wrote:
We set our scan time to 2s and then create many routes using "ip route add². With many (say 50k) routes, the scan starts taking a second or more so BIRD is ignoring about 50% of the route updates and only picking them up on the scan.
Do you see any warnings in log when a route is missed?
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Ondrej Zajicek -
Shaun Crampton