Hello, I have three Ubuntu physical machines with quagga (ospf area0/bgp with full route table on 2-4 uplink) and one whit bird (ospf only). On one of the routers (ospf/bgp) replace quagga on bird. Configuration bird attachment. After replacing the CPU load there is increased by 10-15%. On top Bird periodically consumes up to 90% of the CPU, and at a given time is restarted ospf: Inactivity timer expired for nbr. On one interfaces eth1.2508 to another bird router, I increased the Dead interval to 10 seconds. And it helped. What could be my problem? Large filters BGP, smallest ospf hello/dead interval or something else? One of restarts occurred at 13:13:13, OSPF trace in attachment, OSPF net packet in attachment. -- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
When I disabled bgp_avantel & bgp_megacop then CPU load decreased and OSPF is stabilized. 26.10.2015 14:47, Andrey Kitsul пишет:
Hello,
I have three Ubuntu physical machines with quagga (ospf area0/bgp with full route table on 2-4 uplink) and one whit bird (ospf only). On one of the routers (ospf/bgp) replace quagga on bird. Configuration bird attachment. After replacing the CPU load there is increased by 10-15%. On top Bird periodically consumes up to 90% of the CPU, and at a given time is restarted ospf: Inactivity timer expired for nbr. On one interfaces eth1.2508 to another bird router, I increased the Dead interval to 10 seconds. And it helped. What could be my problem? Large filters BGP, smallest ospf hello/dead interval or something else? One of restarts occurred at 13:13:13, OSPF trace in attachment, OSPF net packet in attachment.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
On Mon, Oct 26, 2015 at 02:47:56PM +0600, Andrey Kitsul wrote:
Hello,
I have three Ubuntu physical machines with quagga (ospf area0/bgp with full route table on 2-4 uplink) and one whit bird (ospf only). On one of the routers (ospf/bgp) replace quagga on bird. Configuration bird attachment. After replacing the CPU load there is increased by 10-15%. On top Bird periodically consumes up to 90% of the CPU, and at a given time is restarted ospf: Inactivity timer expired for nbr. On one interfaces eth1.2508 to another bird router, I increased the Dead interval to 10 seconds. And it helped. What could be my problem? Large filters BGP, smallest ospf hello/dead interval or something else?
Hello You could try to enable 'debug latency' option and optionally set 'debug latency limit' to see which events cause latency issues. If you export complete BGP kernel table to the kernel, i would guess the issue is caused by periodic scan of kernel routing table by the kernel protocol. For full BGP table it may took time in order of seconds, which may interfere with low OSPF dead time limits. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello, If I do not need to import routes from kernel to bird, can I disable the import and learn for kernel protocol? What more will it affect or periodic scans will still cause a load on the CPU? 26.10.2015 16:32, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 02:47:56PM +0600, Andrey Kitsul wrote:
Hello,
I have three Ubuntu physical machines with quagga (ospf area0/bgp with full route table on 2-4 uplink) and one whit bird (ospf only). On one of the routers (ospf/bgp) replace quagga on bird. Configuration bird attachment. After replacing the CPU load there is increased by 10-15%. On top Bird periodically consumes up to 90% of the CPU, and at a given time is restarted ospf: Inactivity timer expired for nbr. On one interfaces eth1.2508 to another bird router, I increased the Dead interval to 10 seconds. And it helped. What could be my problem? Large filters BGP, smallest ospf hello/dead interval or something else? Hello
You could try to enable 'debug latency' option and optionally set 'debug latency limit' to see which events cause latency issues.
If you export complete BGP kernel table to the kernel, i would guess the issue is caused by periodic scan of kernel routing table by the kernel protocol. For full BGP table it may took time in order of seconds, which may interfere with low OSPF dead time limits.
On Mon, Oct 26, 2015 at 04:48:13PM +0600, Andrey Kitsul wrote:
Hello,
If I do not need to import routes from kernel to bird, can I disable the import and learn for kernel protocol?
Yes, but it does not help you. Periodic scans took time because you have high number of your (BGP) routes in kernel FIB. One solution would be to set OSPF dead timers to tens of seconds and use BFD for fast failure detection. BFD runs in separate thread and is not affected by the kernel scans. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Is there a possibility to run for OSPF bird and bird separately for BGP? Or you can change the priority of kernel periodic scan? 26.10.2015 17:01, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 04:48:13PM +0600, Andrey Kitsul wrote:
Hello,
If I do not need to import routes from kernel to bird, can I disable the import and learn for kernel protocol? Yes, but it does not help you. Periodic scans took time because you have high number of your (BGP) routes in kernel FIB.
One solution would be to set OSPF dead timers to tens of seconds and use BFD for fast failure detection. BFD runs in separate thread and is not affected by the kernel scans.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
On Mon, Oct 26, 2015 at 05:32:47PM +0600, Andrey Kitsul wrote:
Is there a possibility to run for OSPF bird and bird separately for BGP? Or you can change the priority of kernel periodic scan?
Perhaps you could, but OSPF instance must not use kernel proto and you have to somehow propagate OSPF routes to the BGP instance. Probably not a good idea. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello, I see in the logs:
2015-10-26 16:52:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1418 ms 2015-10-26 16:52:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1523 ms 2015-10-26 16:52:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1452 ms 2015-10-26 16:52:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1429 ms 2015-10-26 16:53:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1405 ms 2015-10-26 16:53:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1428 ms 2015-10-26 16:53:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1493 ms 2015-10-26 16:53:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1455 ms How to interpret?
26.10.2015 16:32, Ondrej Zajicek пишет:
You could try to enable 'debug latency' option and optionally set 'debug latency limit' to see which events cause latency issues.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
On Mon, Oct 26, 2015 at 04:57:08PM +0600, Andrey Kitsul wrote:
Hello,
I see in the logs:
2015-10-26 16:52:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1418 ms 2015-10-26 16:52:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1523 ms 2015-10-26 16:52:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1452 ms 2015-10-26 16:52:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1429 ms 2015-10-26 16:53:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1405 ms 2015-10-26 16:53:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1428 ms 2015-10-26 16:53:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1493 ms 2015-10-26 16:53:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1455 ms How to interpret?
You would have to use gdb or perhpaps addr2line to translate addresses to event function names but with 15 s period it is pretty obvious that it is kernel periodic scan. You can confirm that by enabling debug { events } in the kernel proto and see kernel periodic events. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Yes, it is kernel scan:
2015-10-26 17:27:56 <TRACE> kernel1: Scanning routing table 2015-10-26 17:27:56 <TRACE> kernel1: Pruning table master 2015-10-26 17:27:56 <TRACE> kernel1: Pruning inherited routes 2015-10-26 17:27:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1830 ms
26.10.2015 17:09, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 04:57:08PM +0600, Andrey Kitsul wrote:
Hello,
I see in the logs:
2015-10-26 16:52:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1418 ms 2015-10-26 16:52:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1523 ms 2015-10-26 16:52:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1452 ms 2015-10-26 16:52:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1429 ms 2015-10-26 16:53:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1405 ms 2015-10-26 16:53:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1428 ms 2015-10-26 16:53:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1493 ms 2015-10-26 16:53:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1455 ms How to interpret? You would have to use gdb or perhpaps addr2line to translate addresses to event function names but with 15 s period it is pretty obvious that it is kernel periodic scan. You can confirm that by enabling debug { events } in the kernel proto and see kernel periodic events.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
Why ospf restart? Dead interval 4sec not been reached.
2015-10-26 17:35:11 <TRACE> kernel1: Scanning routing table 2015-10-26 17:35:11 <TRACE> kernel1: Pruning table master 2015-10-26 17:35:11 <TRACE> kernel1: Pruning inherited routes 2015-10-26 17:35:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1873 ms 2015-10-26 17:35:13 <TRACE> ospf_area0: Inactivity timer expired for nbr 195.x.y.z2 on eth0.1889 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from Full to Down 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 removed 2015-10-26 17:35:13 <TRACE> ospf_area0: New neighbor 195.x.y.z2 on eth0.1889, IP address 195.x.y.z2 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from Down to Init 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from Init to 2-Way 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from 2-Way to ExStart 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from ExStart to Exchange 2015-10-26 17:35:13 <TRACE> ospf_area0: Scheduling routing table calculation 2015-10-26 17:35:13 <TRACE> ospf_area0: Neighbor 195.x.y.z2 on eth0.1889 changed state from Exchange to Full
26.10.2015 17:09, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 04:57:08PM +0600, Andrey Kitsul wrote:
Hello,
I see in the logs:
2015-10-26 16:52:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1418 ms 2015-10-26 16:52:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1523 ms 2015-10-26 16:52:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1452 ms 2015-10-26 16:52:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1429 ms 2015-10-26 16:53:11 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1405 ms 2015-10-26 16:53:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1428 ms 2015-10-26 16:53:41 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1493 ms 2015-10-26 16:53:56 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1455 ms How to interpret? You would have to use gdb or perhpaps addr2line to translate addresses to event function names but with 15 s period it is pretty obvious that it is kernel periodic scan. You can confirm that by enabling debug { events } in the kernel proto and see kernel periodic events.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
On Mon, Oct 26, 2015 at 05:40:08PM +0600, Andrey Kitsul wrote:
Why ospf restart? Dead interval 4sec not been reached.
Well, it is counted from the last received hello, so you have to add 'packets' to OSPF debug to see when it started. Also timers in BIRD have generally second granularity, so 4 sec timer may timeout in some cases in 3.1 sec. And there is also one minor bug that may cause some timers to te triggered ~1 s earlier. So all these factors together, 4 s (or smaller) dead timers in BIRD may be unreliable. I would suggest at least 8-10 s. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello, Packets debug in attachment, problem: "2015-10-27 10:02:28 <TRACE> ospf_area0: Inactivity timer expired for nbr 195.x.y.z2 on eth0.1889" In pcap dump I see ospf Hello packets per second from other ospf nbr. 26.10.2015 19:47, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 05:40:08PM +0600, Andrey Kitsul wrote:
Why ospf restart? Dead interval 4sec not been reached. Well, it is counted from the last received hello, so you have to add 'packets' to OSPF debug to see when it started.
Also timers in BIRD have generally second granularity, so 4 sec timer may timeout in some cases in 3.1 sec. And there is also one minor bug that may cause some timers to te triggered ~1 s earlier.
So all these factors together, 4 s (or smaller) dead timers in BIRD may be unreliable. I would suggest at least 8-10 s.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
Hello, Packets debug in attachment (TRACE + WARN), problem: "2015-10-27 10:02:28 <TRACE> ospf_area0: Inactivity timer expired for nbr 195.x.y.z2 on eth0.1889" In pcap dump I see ospf Hello packets per second from other ospf nbr. 26.10.2015 19:47, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 05:40:08PM +0600, Andrey Kitsul wrote:
Why ospf restart? Dead interval 4sec not been reached. Well, it is counted from the last received hello, so you have to add 'packets' to OSPF debug to see when it started.
Also timers in BIRD have generally second granularity, so 4 sec timer may timeout in some cases in 3.1 sec. And there is also one minor bug that may cause some timers to te triggered ~1 s earlier.
So all these factors together, 4 s (or smaller) dead timers in BIRD may be unreliable. I would suggest at least 8-10 s.
-- Кицул Андрей администратор Unix-систем ЗАО "НГС" +7-923-111-3940
On Tue, Oct 27, 2015 at 10:25:06AM +0600, Andrey Kitsul wrote:
Hello,
Packets debug in attachment (TRACE + WARN), problem: "2015-10-27 10:02:28 <TRACE> ospf_area0: Inactivity timer expired for nbr 195.x.y.z2 on eth0.1889" In pcap dump I see ospf Hello packets per second from other ospf nbr.
2015-10-27 10:02:25 <TRACE> ospf_area0: HELLO packet received from nbr 195.x.y.z4 on greegw4 2015-10-27 10:02:25 <TRACE> ospf_area0: HELLO packet received from nbr 195.x.y.z2 on greegw2 2015-10-27 10:02:25 <TRACE> ospf_area0: HELLO packet received from nbr 195.a.b.c1 on eth1.2508 2015-10-27 10:02:25 <TRACE> ospf_area0: HELLO packet received from nbr 195.x.y.z4 on eth0.1889
2015-10-27 10:02:26 <TRACE> kernel1: Scanning routing table 2015-10-27 10:02:26 <TRACE> kernel1: Pruning table master 2015-10-27 10:02:26 <TRACE> kernel1: Pruning inherited routes 2015-10-27 10:02:26 <WARN> Event 0x0000000000442440 0x0000000000000000 took 1619 ms 2015-10-27 10:02:26 <TRACE> ospf_area0: HELLO packet sent via greegw4 2015-10-27 10:02:26 <TRACE> ospf_area0: HELLO packet sent via greegw2 2015-10-27 10:02:26 <TRACE> ospf_area0: HELLO packet sent via eth1.2508 2015-10-27 10:02:26 <TRACE> ospf_area0: HELLO packet sent via eth0.1889 2015-10-27 10:02:28 <TRACE> ospf_area0: HELLO packet sent via eth0.1889 2015-10-27 10:02:28 <TRACE> ospf_area0: HELLO packet sent via eth1.2508 2015-10-27 10:02:28 <TRACE> ospf_area0: HELLO packet sent via greegw2 2015-10-27 10:02:28 <TRACE> ospf_area0: HELLO packet sent via greegw4 2015-10-27 10:02:28 <TRACE> ospf_area0: Inactivity timer expired for nbr 195.x.y.z2 on eth0.1889
Hi The long kernel scan caused remaining events to be queued and this sequence of timer events (containing neighbor expiration event) was processed before BIRD started to process incoming socket events (e.g. the incoming HELLO). The incoming hello was processed immediately after that:
2015-10-27 10:02:28 <TRACE> ospf_area0: HELLO packet received from nbr 195.x.y.z2 on eth0.1889 2015-10-27 10:02:28 <TRACE> ospf_area0: New neighbor 195.x.y.z2 on eth0.1889, IP address 195.x.y.z2
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello, Ondrej, Is it possible to increase priority the processing ospf packets over other processes? 26.10.2015 19:47, Ondrej Zajicek пишет:
On Mon, Oct 26, 2015 at 05:40:08PM +0600, Andrey Kitsul wrote:
Why ospf restart? Dead interval 4sec not been reached.
Well, it is counted from the last received hello, so you have to add 'packets' to OSPF debug to see when it started.
Also timers in BIRD have generally second granularity, so 4 sec timer may timeout in some cases in 3.1 sec. And there is also one minor bug that may cause some timers to te triggered ~1 s earlier.
So all these factors together, 4 s (or smaller) dead timers in BIRD may be unreliable. I would suggest at least 8-10 s.
-- Andrey Kitsul
On Wed, Oct 28, 2015 at 02:23:33PM +0600, Andrey Kitsul wrote:
Hello, Ondrej,
Is it possible to increase priority the processing ospf packets over other processes?
No, BIRD does not differentiate events based on associated protocol but based on event type (timer, socket, ..) and timer events have more priority than socket events. We definitely should fix this, but i don't see any simple immediate hotfix (short of increasing dead time) for this problem. Perhaps optional disabling of periodic kernel scanning. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Andrey Kitsul -
Ondrej Zajicek