On 02/11/2016 12:14 μμ, Ondrej Zajicek wrote:
On Tue, Nov 01, 2016 at 11:03:15PM +0100, Pavlos Parissis wrote:
Hello,
We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird daemon stops responding on BFD messages which causes the BGP peering to be stopped and started again.
Some details on our setup. Servers have 2 interfaces (north and south) and advertise /32 prefixes to the north and south for IPs assigned to loopback interface.
Bird receives 'Received: Other configuration change' message over BGP from both peers, which are 2 different arista switches, at the same time. Tracing on the switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird about it. It is very unlikely that switches or cables are the problem here.
Hi
I have no explanation ready, but one thing seems strange to me - there are these lines in log message:
I had another run of investigation on a server and noticed that just few seconds before bird logged the message about 'Received: Other configuration change': 1. CPU at system level spiked to ~90% 2. A lot of memory allocation/deallocation 3. Some swapping but very minimal, but under normal circumstances we have zero swapping as the box has enough memory No external factors caused that, I saw zero increase of incoming traffic to HAProxy. I was thinking that some process was doing garbage collection, we run some perl/python stuff and some daemon from HP for hardware monitoring. But, I can't still explain why this could cause kernel to drop the BFD messages. I am thinking that bird may never have seen BFD messages from arista, but Arista sends 3 with interval of 400ms, so it is like we had kernel dropping BFD messages for 1.2seconds, which I find it very unlikely.
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id (3840383752)
Which means that arista switches send BFD packets with (presumably old) BIRD session ID, although if BFD on arista detected session down, it should reset the old session ID and should start with zero. I see three possible explanations:
1) The issue was not caused by BFD session going down on Arista
2) Arista did not correctly reset its remote session id state when session went down
3) BFD packets from BIRD to Arista and BGP shutdown from Arista to BIRD were processed simultaneously, which means that after BFD/BGP session drop Arista relearned old BIRD session id from a BFD packet that was sent before BIRD noticed the session went down.
It would be useful to see BFD state change logs from Arista.
We will try to enable this, but we need a bit careful. Thanks a lot for your feedback, Pavlos