BGP Keepalive timer wedging
At the Seattle IX we are using BIRD 1.4.4 for our native (non-VM) route servers. With one particular IPv4 peer, on two different route servers, I am seeing "Keepalive timer" count down to zero and then becoming wedged/stalled. Tcpdump fails to show a keepalive message being sent, while it does show them being received from the peer. We are using the default timer values. 2014-08-20 00:30:08 <TRACE> ex: Incoming connection from 206.81.80.xx (port 25663) accepted 2014-08-20 00:30:08 <TRACE> ex: Sending OPEN(ver=4,as=33108,hold=240,id=xxxxxxxx) 2014-08-20 00:30:09 <TRACE> ex: Got OPEN(as=xxxxx,hold=180,id=xxxxxxxx) 2014-08-20 00:30:09 <TRACE> ex: Sending KEEPALIVE 2014-08-20 00:30:09 <TRACE> ex: Got KEEPALIVE 2014-08-20 00:30:09 <TRACE> ex: BGP session established 2014-08-20 00:30:09 <TRACE> ex: Connected to table master 2014-08-20 00:30:09 <TRACE> ex: State changed to feed In the above, "Sending KEEPALIVE" does correspond to an outgoing keepalive packet, per tcpdump. Actually, two are sent on session startup, and then no more. "show protocols all ex" shows: Hold timer: 126/180 Keepalive timer: 0/60 etc.: Hold timer: 119/180 Keepalive timer: 0/60 with Hold timer getting updated over time, but the Keepalive timer doesn't change after it has its initial countdown to zero. The peer eventually signals "ex: Received: Hold timer expired" once it goes 180 seconds without a BGP update, since it also hasn't gotten any keepalive messages. I've looked at the code and haven't found a problem. The other 64 similarly configured peers on the route server are working fine. Has anyone seen this or have any suggestions? Thanks, Chris Seattle Internet Exchange
On Wed, Aug 20, 2014 at 01:02:27AM +0000, Chris Caputo wrote:
At the Seattle IX we are using BIRD 1.4.4 for our native (non-VM) route servers.
With one particular IPv4 peer, on two different route servers, I am seeing "Keepalive timer" count down to zero and then becoming wedged/stalled. Tcpdump fails to show a keepalive message being sent, while it does show them being received from the peer. ... with Hold timer getting updated over time, but the Keepalive timer doesn't change after it has its initial countdown to zero. The peer eventually signals "ex: Received: Hold timer expired" once it goes 180 seconds without a BGP update, since it also hasn't gotten any keepalive messages.
I've looked at the code and haven't found a problem. The other 64 similarly configured peers on the route server are working fine.
Has anyone seen this or have any suggestions?
Hi I would guess that the problem is in the TCP connection to the peer - BGP packets are sent, not acknowledged, TX queue became full and TX hook is not called anymore (Keepalive timer is restarted in TX hook when previously scheduled Keepalive is sent). You should check whether other packets are propagated (e.g. updates from both sides), esp. when the connection is already in keepalive 0/60 state. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Wed, 20 Aug 2014, Ondrej Zajicek wrote:
On Wed, Aug 20, 2014 at 01:02:27AM +0000, Chris Caputo wrote:
At the Seattle IX we are using BIRD 1.4.4 for our native (non-VM) route servers.
With one particular IPv4 peer, on two different route servers, I am seeing "Keepalive timer" count down to zero and then becoming wedged/stalled. Tcpdump fails to show a keepalive message being sent, while it does show them being received from the peer. ... with Hold timer getting updated over time, but the Keepalive timer doesn't change after it has its initial countdown to zero. The peer eventually signals "ex: Received: Hold timer expired" once it goes 180 seconds without a BGP update, since it also hasn't gotten any keepalive messages.
I've looked at the code and haven't found a problem. The other 64 similarly configured peers on the route server are working fine.
Has anyone seen this or have any suggestions?
Hi
I would guess that the problem is in the TCP connection to the peer - BGP packets are sent, not acknowledged, TX queue became full and TX hook is not called anymore (Keepalive timer is restarted in TX hook when previously scheduled Keepalive is sent). You should check whether other packets are propagated (e.g. updates from both sides), esp. when the connection is already in keepalive 0/60 state.
Ondrej, You are correct: Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 42340 206.81.80.2:179 206.81.80.xx:35237 ESTABLISHED I should have caught that. Thank you, Chris
participants (2)
-
Chris Caputo -
Ondrej Zajicek