BGP Keepalive timer wedging

20 Aug 2014

      At the Seattle IX we are using BIRD 1.4.4 for our native (non-VM) route 
servers.

With one particular IPv4 peer, on two different route servers, I am seeing 
"Keepalive timer" count down to zero and then becoming wedged/stalled.  
Tcpdump fails to show a keepalive message being sent, while it does show 
them being received from the peer.

We are using the default timer values.

2014-08-20 00:30:08 <TRACE> ex: Incoming connection from 206.81.80.xx (port 25663) accepted
2014-08-20 00:30:08 <TRACE> ex: Sending OPEN(ver=4,as=33108,hold=240,id=xxxxxxxx)
2014-08-20 00:30:09 <TRACE> ex: Got OPEN(as=xxxxx,hold=180,id=xxxxxxxx)
2014-08-20 00:30:09 <TRACE> ex: Sending KEEPALIVE
2014-08-20 00:30:09 <TRACE> ex: Got KEEPALIVE
2014-08-20 00:30:09 <TRACE> ex: BGP session established
2014-08-20 00:30:09 <TRACE> ex: Connected to table master
2014-08-20 00:30:09 <TRACE> ex: State changed to feed

In the above, "Sending KEEPALIVE" does correspond to an outgoing keepalive 
packet, per tcpdump.  Actually, two are sent on session startup, and then 
no more.

"show protocols all ex" shows:

    Hold timer:       126/180                                                                                                                                                                                                           
    Keepalive timer:  0/60                                                                                                                                                                                                              

etc.:

    Hold timer:       119/180                                                                                                                                                                                                           
    Keepalive timer:  0/60                                                                                                                                                                                                              

with Hold timer getting updated over time, but the Keepalive timer doesn't 
change after it has its initial countdown to zero.  The peer eventually 
signals "ex: Received: Hold timer expired" once it goes 180 seconds 
without a BGP update, since it also hasn't gotten any keepalive messages.

I've looked at the code and haven't found a problem.  The other 64 
similarly configured peers on the route server are working fine.

Has anyone seen this or have any suggestions?

Thanks,
Chris
Seattle Internet Exchange

Chris Caputo

Ondrej Zajicek

Chris Caputo

tags

participants (2)