Hello, We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird daemon stops responding on BFD messages which causes the BGP peering to be stopped and started again. Some details on our setup. Servers have 2 interfaces (north and south) and advertise /32 prefixes to the north and south for IPs assigned to loopback interface. Bird receives 'Received: Other configuration change' message over BGP from both peers, which are 2 different arista switches, at the same time. Tracing on the switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird about it. It is very unlikely that switches or cables are the problem here. Bird will reestablish BGP peering with ~50secs and server will start receive traffic again. So, we have an outage that for about ~50seconds. We have seen these to different servers connecting to different switches. The frequency of the problem is very low, once per 2 days across 50 servers. Furthermore, there is any specific pattern in CPU, traffic, interface/TCP errors and something else which can help us to debug. I have seen in 2-3 cases that CPU usage at system jumps to 70% at the timestamp of the log that says 'BGP1: Received: Other configuration change' but the main software on the servers (HAProxy) which consumes most of the resources doesn't have any spike in CPU. Has anyone seen this behavior? Nov 01 16:23:00 bird[1376]: BGP1: Received: Other configuration change Nov 01 16:23:00 bird[1376]: BGP1: BGP session closed Nov 01 16:23:00 bird[1376]: BGP1: State changed to stop Nov 01 16:23:00 bird[1376]: BGP1: Down Nov 01 16:23:00 bird[1376]: bfd1: Session to 2.2.2.2 removed Nov 01 16:23:00 bird[1376]: BGP1: State changed to down Nov 01 16:23:00 bird[1376]: BGP1: Starting Nov 01 16:23:00 bird[1376]: BGP1: State changed to start Nov 01 16:23:00 bird[1376]: bfd1: Session to 2.2.2.2 added Nov 01 16:23:00 bird[1376]: BGP1: Startup delayed by 60 seconds Nov 01 16:23:00 bird[1376]: BGP2: Received: Other configuration change Nov 01 16:23:00 bird[1376]: BGP2: BGP session closed Nov 01 16:23:00 bird[1376]: BGP2: State changed to stop Nov 01 16:23:00 bird[1376]: BGP2: Down Nov 01 16:23:00 bird[1376]: bfd1: Session to 1.1.1.1 removed Nov 01 16:23:00 bird[1376]: BGP2: State changed to down Nov 01 16:23:00 bird[1376]: BGP2: Starting Nov 01 16:23:00 bird[1376]: BGP2: State changed to start Nov 01 16:23:00 bird[1376]: bfd1: Session to 1.1.1.1 added Nov 01 16:23:00 bird[1376]: BGP2: Startup delayed by 60 seconds Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id (3632088877) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id (3632088877) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id (3632088877) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id (3632088877) Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752) bird.conf: log syslog { debug, trace, info, remote, warning, error, auth, fatal, bug }; include "/etc/bird.d/*.conf"; router id 1.1.1.2; protocol device { scan time 10; } protocol static { disabled yes; } protocol direct direct1 { interface "lo"; debug all; export none; import where net ~ ANYCAST_NETWORKS; } protocol bfd bfd1 { debug { states, routes, filters, interfaces, events }; interface "north", "south" { min rx interval 400 ms; min tx interval 400 ms; idle tx interval 1000 ms; multiplier 3; }; } protocol bgp BGP1 { disabled no; description "Peer-BGP1"; neighbor 1.1.1.1 as 64824; source address 1.1.1.2; bfd on; debug all; import none; export where match_route_north(); direct; hold time 10; startup hold time 240; connect retry time 120; keepalive time 3; start delay time 5; error wait time 60, 300; error forget time 300; disable after error off; next hop self; path metric 1; default bgp_med 0; default bgp_local_pref 0; local as 64825; } protocol bgp BGP2 { disabled no; description "Peer-BGP2"; neighbor 2.2.2.2 as 64827; source address 2.2.2.1; bfd on; debug all; import none; export where match_route_south(); direct; hold time 10; startup hold time 240; connect retry time 120; keepalive time 3; start delay time 5; error wait time 60, 300; error forget time 300; disable after error off; next hop self; path metric 1; default bgp_med 0; default bgp_local_pref 0; local as 64825; } Cheers, Pavlos
On Tue, Nov 01, 2016 at 11:03:15PM +0100, Pavlos Parissis wrote:
Hello,
We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird daemon stops responding on BFD messages which causes the BGP peering to be stopped and started again.
Some details on our setup. Servers have 2 interfaces (north and south) and advertise /32 prefixes to the north and south for IPs assigned to loopback interface.
Bird receives 'Received: Other configuration change' message over BGP from both peers, which are 2 different arista switches, at the same time. Tracing on the switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird about it. It is very unlikely that switches or cables are the problem here.
Hi I have no explanation ready, but one thing seems strange to me - there are these lines in log message:
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752)
Which means that arista switches send BFD packets with (presumably old) BIRD session ID, although if BFD on arista detected session down, it should reset the old session ID and should start with zero. I see three possible explanations: 1) The issue was not caused by BFD session going down on Arista 2) Arista did not correctly reset its remote session id state when session went down 3) BFD packets from BIRD to Arista and BGP shutdown from Arista to BIRD were processed simultaneously, which means that after BFD/BGP session drop Arista relearned old BIRD session id from a BFD packet that was sent before BIRD noticed the session went down. It would be useful to see BFD state change logs from Arista. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 02/11/2016 12:14 μμ, Ondrej Zajicek wrote:
On Tue, Nov 01, 2016 at 11:03:15PM +0100, Pavlos Parissis wrote:
Hello,
We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird daemon stops responding on BFD messages which causes the BGP peering to be stopped and started again.
Some details on our setup. Servers have 2 interfaces (north and south) and advertise /32 prefixes to the north and south for IPs assigned to loopback interface.
Bird receives 'Received: Other configuration change' message over BGP from both peers, which are 2 different arista switches, at the same time. Tracing on the switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird about it. It is very unlikely that switches or cables are the problem here.
Hi
I have no explanation ready, but one thing seems strange to me - there are these lines in log message:
I had another run of investigation on a server and noticed that just few seconds before bird logged the message about 'Received: Other configuration change': 1. CPU at system level spiked to ~90% 2. A lot of memory allocation/deallocation 3. Some swapping but very minimal, but under normal circumstances we have zero swapping as the box has enough memory No external factors caused that, I saw zero increase of incoming traffic to HAProxy. I was thinking that some process was doing garbage collection, we run some perl/python stuff and some daemon from HP for hardware monitoring. But, I can't still explain why this could cause kernel to drop the BFD messages. I am thinking that bird may never have seen BFD messages from arista, but Arista sends 3 with interval of 400ms, so it is like we had kernel dropping BFD messages for 1.2seconds, which I find it very unlikely.
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752)
Which means that arista switches send BFD packets with (presumably old) BIRD session ID, although if BFD on arista detected session down, it should reset the old session ID and should start with zero. I see three possible explanations:
1) The issue was not caused by BFD session going down on Arista
2) Arista did not correctly reset its remote session id state when session went down
3) BFD packets from BIRD to Arista and BGP shutdown from Arista to BIRD were processed simultaneously, which means that after BFD/BGP session drop Arista relearned old BIRD session id from a BFD packet that was sent before BIRD noticed the session went down.
It would be useful to see BFD state change logs from Arista.
We will try to enable this, but we need a bit careful. Thanks a lot for your feedback, Pavlos
participants (2)
-
Ondrej Zajicek -
Pavlos Parissis