BGP Connection reset on fast timers
Hi All, I am testing BGP stability with extremely aggressive timers. The timers are 1/3. I see BGP flapping 2-3 times every 24 hours. Attached is a packet capture, captured on the system running bird. 10.194.8.11 is a Cisco ASR. 10.194.8.13 is the running bird instance. I see multiple keep-alives coming in from .11, but the bird instance responds with hold timer expired. This was about after 12 hours in established state. Connection opens right back up again. I don't see the same issue on upping the timers to 3/9 seconds. Anyone see anything like this before? Thanks, Saksham
Hello! There may be anything, including waiting for Netlink or CLI socket or other BIRDs, or even something totally unrelated that eats the three seconds. I'm quite surprised that it flaps only several times a day. If I remember it correctly, there was somebody who used a 5/15 setup and still had to take a lot of care to keep the links up. By the way, is there any good reason to have so short timeouts? Maria On June 11, 2018 8:51:06 PM GMT+02:00, saksham <saksham.manchanda@secure64.com> wrote:
Hi All,
I am testing BGP stability with extremely aggressive timers. The timers are 1/3. I see BGP flapping 2-3 times every 24 hours. Attached is a packet capture, captured on the system running bird.
10.194.8.11 is a Cisco ASR.
10.194.8.13 is the running bird instance.
I see multiple keep-alives coming in from .11, but the bird instance responds with hold timer expired. This was about after 12 hours in established state.
Connection opens right back up again.
I don't see the same issue on upping the timers to 3/9 seconds.
Anyone see anything like this before?
Thanks,
Saksham
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Just a comment: here we use 5/15 on some 10GE links between Redback/Ericsson/SmartEdge and Cisco routers (so, unrelated to BIRD and Linux) with success (never flaps if the link is OK). These links are used to receive/transmit L2TP tunnels traffic. The usecase was: 1) there are some intermediate switches on the links (so a cut cannot always be quickly detected) 2) L2TP timers are aggressive and it's relevant to switch to another path quickly enough in order to avoid some L2TP tunnels disconnections, which in turn would disconnect several tens of thousands PPP sessions and users 3) BFD wasn't an option (between two different operators) Olivier
Le 12 juin 2018 à 11:09, Maria Jan Matějka <jan.matejka@nic.cz> a écrit :
If I remember it correctly, there was somebody who used a 5/15 setup and still had to take a lot of care to keep the links up. By the way, is there any good reason to have so short timeouts?
Packet 35 shows .13, which is the Bird running on Vmware (sorry about that), and clearly thinks the hold time expired: Major error Code: Hold Timer Expired (4) Minor error Code (Hold Timer Expired): 0 Might be worth trying to run bird debugging to see what else it says. Have you consider BFD? Maybe try running different visualization (e.g. KVM), or no visualization. On Tue, Jun 12, 2018 at 3:42 AM, Olivier Benghozi < olivier.benghozi@wifirst.fr> wrote:
Just a comment:
here we use 5/15 on some 10GE links between Redback/Ericsson/SmartEdge and Cisco routers (so, unrelated to BIRD and Linux) with success (never flaps if the link is OK). These links are used to receive/transmit L2TP tunnels traffic.
The usecase was: 1) there are some intermediate switches on the links (so a cut cannot always be quickly detected) 2) L2TP timers are aggressive and it's relevant to switch to another path quickly enough in order to avoid some L2TP tunnels disconnections, which in turn would disconnect several tens of thousands PPP sessions and users 3) BFD wasn't an option (between two different operators)
Olivier
Le 12 juin 2018 à 11:09, Maria Jan Matějka <jan.matejka@nic.cz> a écrit :
If I remember it correctly, there was somebody who used a 5/15 setup and still had to take a lot of care to keep the links up.
By the way, is there any good reason to have so short timeouts?
-- Regards, Dave Seddon +1 415 310 4086
Thanks for the reply. I have been able to reproduce this on kvm too. This is the output from running in debug mode. Highlighted part is where we stop seeing RX's. I've added some additional debugs in the io loop and running a simultaneous packet capture. I'll report back when I have more. Jun 13 15:06:06 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:06 localhost bird: BGP: kicking TX Jun 13 15:06:06 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:06 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:07 localhost bird: BGP: Keepalive timer Jun 13 15:06:07 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:07 localhost bird: BGP: kicking TX Jun 13 15:06:07 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:07 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:08 localhost bird: BGP: Keepalive timer Jun 13 15:06:08 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:08 localhost bird: BGP: kicking TX Jun 13 15:06:08 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:08 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:09 localhost bird: BGP: Keepalive timer Jun 13 15:06:09 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:09 localhost bird: BGP: kicking TX Jun 13 15:06:09 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:09 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:10 localhost bird: BGP: Keepalive timer Jun 13 15:06:10 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:10 localhost bird: BGP: kicking TX Jun 13 15:06:10 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:10 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:11 localhost bird: BGP: Keepalive timer Jun 13 15:06:11 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:11 localhost bird: BGP: kicking TX Jun 13 15:06:11 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:11 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:12 localhost bird: BGP: Keepalive timer Jun 13 15:06:12 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:12 localhost bird: BGP: kicking TX Jun 13 15:06:12 localhost bird: BGP: RX hook: Got 19 bytes Jun 13 15:06:12 localhost bird: BGP: Got packet 04 (19 bytes) Jun 13 15:06:13 localhost bird: BGP: Keepalive timer Jun 13 15:06:13 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:13 localhost bird: BGP: kicking TX Jun 13 15:06:14 localhost bird: BGP: Keepalive timer Jun 13 15:06:14 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:14 localhost bird: BGP: kicking TX Jun 13 15:06:15 localhost bird: BGP: Keepalive timer Jun 13 15:06:15 localhost bird: BGP: Scheduling packet type 4 Jun 13 15:06:15 localhost bird: BGP: kicking TX Jun 13 15:06:15 localhost bird: BGP: Hold timeout Jun 13 15:06:15 localhost bird: BGP: Scheduling packet type 3 Jun 13 15:06:15 localhost bird: BGP: Updating startup delay Jun 13 15:06:15 localhost bird: bgp1: Error: Hold timer expired On 06/14/2018 11:10 PM, dave seddon wrote:
Packet 35 shows .13, which is the Bird running on Vmware (sorry about that), and clearly thinks the hold time expired:
Major error Code: Hold Timer Expired (4) Minor error Code (Hold Timer Expired): 0
Might be worth trying to run bird debugging to see what else it says. Have you consider BFD? Maybe try running different visualization (e.g. KVM), or no visualization.
On Tue, Jun 12, 2018 at 3:42 AM, Olivier Benghozi <olivier.benghozi@wifirst.fr <mailto:olivier.benghozi@wifirst.fr>> wrote:
Just a comment:
here we use 5/15 on some 10GE links between Redback/Ericsson/SmartEdge and Cisco routers (so, unrelated to BIRD and Linux) with success (never flaps if the link is OK). These links are used to receive/transmit L2TP tunnels traffic.
The usecase was: 1) there are some intermediate switches on the links (so a cut cannot always be quickly detected) 2) L2TP timers are aggressive and it's relevant to switch to another path quickly enough in order to avoid some L2TP tunnels disconnections, which in turn would disconnect several tens of thousands PPP sessions and users 3) BFD wasn't an option (between two different operators)
Olivier
Le 12 juin 2018 à 11:09, Maria Jan Matějka <jan.matejka@nic.cz <mailto:jan.matejka@nic.cz>> a écrit :
If I remember it correctly, there was somebody who used a 5/15 setup and still had to take a lot of care to keep the links up. By the way, is there any good reason to have so short timeouts?
-- Regards, Dave Seddon +1 415 310 4086
______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________
participants (4)
-
dave seddon -
Maria Jan Matějka -
Olivier Benghozi -
saksham