On Wed, Aug 20, 2014 at 01:02:27AM +0000, Chris Caputo wrote:
At the Seattle IX we are using BIRD 1.4.4 for our native (non-VM) route servers.
With one particular IPv4 peer, on two different route servers, I am seeing "Keepalive timer" count down to zero and then becoming wedged/stalled. Tcpdump fails to show a keepalive message being sent, while it does show them being received from the peer. ... with Hold timer getting updated over time, but the Keepalive timer doesn't change after it has its initial countdown to zero. The peer eventually signals "ex: Received: Hold timer expired" once it goes 180 seconds without a BGP update, since it also hasn't gotten any keepalive messages.
I've looked at the code and haven't found a problem. The other 64 similarly configured peers on the route server are working fine.
Has anyone seen this or have any suggestions?
Hi I would guess that the problem is in the TCP connection to the peer - BGP packets are sent, not acknowledged, TX queue became full and TX hook is not called anymore (Keepalive timer is restarted in TX hook when previously scheduled Keepalive is sent). You should check whether other packets are propagated (e.g. updates from both sides), esp. when the connection is already in keepalive 0/60 state. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."