Long io loop / missing netlink messages
Good morning, on a fresh new router running the full routing table, with Alpine, Linux 4.19.80-0-vanilla, bird-2.0.7 I see a lot of these messages: Dec 21 12:37:15 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:37:34 router1 daemon.warn bird: I/O loop cycle took 5000 ms for 1 events Dec 21 12:38:14 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:38:35 router1 daemon.warn bird: I/O loop cycle took 5328 ms for 1 events Dec 21 12:38:54 router1 daemon.warn bird: I/O loop cycle took 5013 ms for 1 events Dec 21 12:39:07 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:39:14 router1 daemon.warn bird: I/O loop cycle took 5053 ms for 1 events Dec 21 12:39:34 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:40:14 router1 daemon.warn bird: I/O loop cycle took 5041 ms for 1 events [12:49] router1.place6:~# ip -6 r | wc -l; ip r | wc -l 78212 779342 With "debug latency;" I get the following additional messages: Dec 21 12:54:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4449 ms Dec 21 12:54:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5608 ms The system is overall idle with bird spiking to 50-100% cpu usage every couple of seconds. I first thougt they are only logged after stating bird (where it might make sense), but the events continue to be logged around every 30s: Dec 21 13:04:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4596 ms Dec 21 13:04:32 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5096 ms Dec 21 13:04:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5102 ms Dec 21 13:05:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4676 ms Dec 21 13:05:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4645 ms Which might loosely correlate to the scan time "20" that is setup for device and kernel protocols. How do I best debug this issue? Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
Hello! How much time does it take to list the kernel table? time ip r > /dev/null How many routes do you have in bird table? show route count And what export filter do you have for the kernel protocol in bird? Thanks Maria On December 21, 2019 1:07:25 PM GMT+01:00, Nico Schottelius <nico.schottelius@ungleich.ch> wrote:
Good morning,
on a fresh new router running the full routing table, with Alpine, Linux 4.19.80-0-vanilla, bird-2.0.7
I see a lot of these messages:
Dec 21 12:37:15 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:37:34 router1 daemon.warn bird: I/O loop cycle took 5000 ms for 1 events Dec 21 12:38:14 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:38:35 router1 daemon.warn bird: I/O loop cycle took 5328 ms for 1 events Dec 21 12:38:54 router1 daemon.warn bird: I/O loop cycle took 5013 ms for 1 events Dec 21 12:39:07 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:39:14 router1 daemon.warn bird: I/O loop cycle took 5053 ms for 1 events Dec 21 12:39:34 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:40:14 router1 daemon.warn bird: I/O loop cycle took 5041 ms for 1 events
[12:49] router1.place6:~# ip -6 r | wc -l; ip r | wc -l 78212 779342
With "debug latency;" I get the following additional messages:
Dec 21 12:54:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4449 ms Dec 21 12:54:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5608 ms
The system is overall idle with bird spiking to 50-100% cpu usage every couple of seconds. I first thougt they are only logged after stating bird (where it might make sense), but the events continue to be logged around every 30s:
Dec 21 13:04:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4596 ms Dec 21 13:04:32 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5096 ms Dec 21 13:04:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5102 ms Dec 21 13:05:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4676 ms Dec 21 13:05:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4645 ms
Which might loosely correlate to the scan time "20" that is setup for device and kernel protocols.
How do I best debug this issue?
Best,
Nico
-- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hello Maria, Maria Matějka <maria.matejka@nic.cz> writes:
Hello!
How much time does it take to list the kernel table?
time ip r > /dev/null
[16:47] router1.place6:~# time ip r >/dev/null real 0m 6.06s user 0m 4.90s sys 0m 1.14s [16:47] router1.place6:~# time ip r >/dev/null real 0m 5.92s user 0m 4.73s sys 0m 1.17s [16:47] router1.place6:~# Question: is a 6 second lookup expected with full IPv4 table?
How many routes do you have in bird table?
show route count
[16:47] router1.place6:~# birdc show route count BIRD 2.0.7 ready. 778706 of 778706 routes for 778704 networks in table master4 89166 of 89166 routes for 78105 networks in table master6 Total: 867872 of 867872 routes for 856809 networks in 2 tables
And what export filter do you have for the kernel protocol in bird?
In short: every route <= /24 (IPv4) and <= /48 (IPv6) of the global routing table. In long: define net_to_router_v4 = [ 0.0.0.0/0+ ]; define net_to_router_v6 = [ ::/0+ ]; protocol kernel kernel_v6 { metric 64; learn; scan time 20; ipv6 { import filter ungleich_networks; export filter to_ungleich_routers; }; } function is_for_router_kernel() { if ((net.type = NET_IP6 && net ~ net_to_router_v6) || (net.type = NET_IP4 && net ~ net_to_router_v4)) then { return true; } if ((net.type = NET_IP6 && net.len <= 29) || (net.type = NET_IP4 && net.len <= 10)) then { return true; } return false; } /* skipping functions that are very small routes */ filter to_ungleich_routers { if(is_inside_ungleich()) then accept; if(is_from_netstream()) then accept; if(is_from_sunrise()) then accept; if(is_default_route()) then accept; if(is_for_router_kernel()) then accept; reject; } Best, Nico
On December 21, 2019 1:07:25 PM GMT+01:00, Nico Schottelius <nico.schottelius@ungleich.ch> wrote:
Good morning,
on a fresh new router running the full routing table, with Alpine, Linux 4.19.80-0-vanilla, bird-2.0.7
I see a lot of these messages:
Dec 21 12:37:15 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:37:34 router1 daemon.warn bird: I/O loop cycle took 5000 ms for 1 events Dec 21 12:38:14 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:38:35 router1 daemon.warn bird: I/O loop cycle took 5328 ms for 1 events Dec 21 12:38:54 router1 daemon.warn bird: I/O loop cycle took 5013 ms for 1 events Dec 21 12:39:07 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:39:14 router1 daemon.warn bird: I/O loop cycle took 5053 ms for 1 events Dec 21 12:39:34 router1 daemon.warn bird: Kernel dropped some netlink messages, will resync on next scan. Dec 21 12:40:14 router1 daemon.warn bird: I/O loop cycle took 5041 ms for 1 events
[12:49] router1.place6:~# ip -6 r | wc -l; ip r | wc -l 78212 779342
With "debug latency;" I get the following additional messages:
Dec 21 12:54:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4449 ms Dec 21 12:54:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5608 ms
The system is overall idle with bird spiking to 50-100% cpu usage every couple of seconds. I first thougt they are only logged after stating bird (where it might make sense), but the events continue to be logged around every 30s:
Dec 21 13:04:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4596 ms Dec 21 13:04:32 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5096 ms Dec 21 13:04:52 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 5102 ms Dec 21 13:05:11 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4676 ms Dec 21 13:05:31 router1 daemon.warn bird: Event 0x000055a21afb8144 0x0000000000000000 took 4645 ms
Which might loosely correlate to the scan time "20" that is setup for device and kernel protocols.
How do I best debug this issue?
Best,
Nico
-- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
-- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
participants (2)
-
Maria Matějka -
Nico Schottelius