Bird stops responding to BFD messages
pavlos.parissis at gmail.com
Wed Nov 2 16:34:45 CET 2016
On 02/11/2016 12:14 μμ, Ondrej Zajicek wrote:
> On Tue, Nov 01, 2016 at 11:03:15PM +0100, Pavlos Parissis wrote:
>> We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird
>> daemon stops responding on BFD messages which causes the BGP peering to be stopped
>> and started again.
>> Some details on our setup.
>> Servers have 2 interfaces (north and south) and advertise /32 prefixes to the
>> north and south for IPs assigned to loopback interface.
>> Bird receives 'Received: Other configuration change' message over BGP from both
>> peers, which are 2 different arista switches, at the same time. Tracing on the
>> switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird
>> about it. It is very unlikely that switches or cables are the problem here.
> I have no explanation ready, but one thing seems strange to me - there are
> these lines in log message:
I had another run of investigation on a server and noticed that just few seconds
before bird logged the message about 'Received: Other configuration change':
1. CPU at system level spiked to ~90%
2. A lot of memory allocation/deallocation
3. Some swapping but very minimal, but under normal circumstances we have zero
swapping as the box has enough memory
No external factors caused that, I saw zero increase of incoming traffic to
HAProxy. I was thinking that some process was doing garbage collection, we run
some perl/python stuff and some daemon from HP for hardware monitoring.
But, I can't still explain why this could cause kernel to drop the BFD messages.
I am thinking that bird may never have seen BFD messages from arista, but Arista
sends 3 with interval of 400ms, so it is like we had kernel dropping BFD
messages for 1.2seconds, which I find it very unlikely.
>> Nov 01 16:23:00 bird: bfd1: Bad packet from 184.108.40.206 - unknown session id
> Which means that arista switches send BFD packets with (presumably old)
> BIRD session ID, although if BFD on arista detected session down, it
> should reset the old session ID and should start with zero. I see three
> possible explanations:
> 1) The issue was not caused by BFD session going down on Arista
> 2) Arista did not correctly reset its remote session id state when session went down
> 3) BFD packets from BIRD to Arista and BGP shutdown from Arista to BIRD
> were processed simultaneously, which means that after BFD/BGP session
> drop Arista relearned old BIRD session id from a BFD packet that was sent
> before BIRD noticed the session went down.
> It would be useful to see BFD state change logs from Arista.
We will try to enable this, but we need a bit careful.
Thanks a lot for your feedback,
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 801 bytes
Desc: OpenPGP digital signature
More information about the Bird-users