BIRD drops specific IPv6 session for no reason
Hi Bird community, We are investigating a weird customer issue regarding our Bird Route Servers (version 1.6.3) and a specific IPv6 session. Customer reports a sudden drop of his IPv6 session and -until now- we could not relate those drops with any issue or instability. Everything seems normal and no other customer complained at the moment of the incident. After some packet capturing at the moment of the event, we discovered that BIRD does not send a response messages to the customer’s BGP keepalive messages (see attached picture), which result to the BGP hold timer to expire and the sessions to be dropped. We observed this anomaly with both RSs but at different time slots and the tcpdump capture was running at the Interface were Bird is sending all BGP traffic for customers. At the moment of the event, we didn’t do any maintenance or other RS related work. Has any of you experienced this in the past? If yes, how did you solve this? Any related feedback is welcomed. Best regards, Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net
double check that your router have arp entry and route for that peer when that happens. Example if your router get wrong route for peer it can send response packets (or some cases arp requests) to wrong interface. So dump your another interfaces also at same time and you will see what it do. Probably watch for route and arp with proper grep and -n is also your friend if that happens very often. On 28/02/2020 13.41, Stavros Konstantaras wrote:
Hi Bird community,
We are investigating a weird customer issue regarding our Bird Route Servers (version 1.6.3) and a specific IPv6 session. Customer reports a sudden drop of his IPv6 session and -until now- we could not relate those drops with any issue or instability. Everything seems normal and no other customer complained at the moment of the incident.
After some packet capturing at the moment of the event, we discovered that BIRD does not send a response messages to the customer’s BGP keepalive messages (see attached picture), which result to the BGP hold timer to expire and the sessions to be dropped. We observed this anomaly with both RSs but at different time slots and the tcpdump capture was running at the Interface were Bird is sending all BGP traffic for customers. At the moment of the event, we didn’t do any maintenance or other RS related work.
Has any of you experienced this in the past? If yes, how did you solve this? Any related feedback is welcomed.
Best regards,
Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net <http://ams-ix.net>
-- F-Solutions Oy Tapio Haapala PL7, 90571 Oulu GSM +358400998371 Skype burner- IRC Burner@ircnet
Hi Tapio, Good point as well but I don’t have access to customer’s router. I can only touch my Linux server and based on that, ARP entry is there as the BGPv4 session remains up (which means that the switches in the middle can have a valid MAC entry in their MAC table). Only the BGPv6 session drops and when it drops, the log output does not really help: Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Received: Hold timer expired Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: BGP session closed Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: State changed to stop Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Down Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: State changed to down Best regards, Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net
On 28 Feb 2020, at 14:08, Tapio Haapala <tapio.haapala@f-solutions.fi> wrote:
double check that your router have arp entry and route for that peer when that happens. Example if your router get wrong route for peer it can send response packets (or some cases arp requests) to wrong interface. So dump your another interfaces also at same time and you will see what it do. Probably watch for route and arp with proper grep and -n is also your friend if that happens very often.
On 28/02/2020 13.41, Stavros Konstantaras wrote:
Hi Bird community,
We are investigating a weird customer issue regarding our Bird Route Servers (version 1.6.3) and a specific IPv6 session. Customer reports a sudden drop of his IPv6 session and -until now- we could not relate those drops with any issue or instability. Everything seems normal and no other customer complained at the moment of the incident.
After some packet capturing at the moment of the event, we discovered that BIRD does not send a response messages to the customer’s BGP keepalive messages (see attached picture), which result to the BGP hold timer to expire and the sessions to be dropped. We observed this anomaly with both RSs but at different time slots and the tcpdump capture was running at the Interface were Bird is sending all BGP traffic for customers. At the moment of the event, we didn’t do any maintenance or other RS related work.
Has any of you experienced this in the past? If yes, how did you solve this? Any related feedback is welcomed.
Best regards,
Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net <http://ams-ix.net/> <http://ams-ix.net <http://ams-ix.net/>>
-- F-Solutions Oy
Tapio Haapala
PL7, 90571 Oulu GSM +358400998371 Skype burner- IRC Burner@ircnet
Hi Stavros, On ven. 28 févr. 12:41:24 2020, Stavros Konstantaras wrote:
Hi Bird community,
We are investigating a weird customer issue regarding our Bird Route Servers (version 1.6.3) and a specific IPv6 session. Customer reports a sudden drop of his IPv6 session and -until now- we could not relate those drops with any issue or instability. Everything seems normal and no other customer complained at the moment of the incident.
After some packet capturing at the moment of the event, we discovered that BIRD does not send a response messages to the customer’s BGP keepalive messages (see attached picture), which result to the BGP hold timer to expire and the sessions to be dropped. We observed this anomaly with both RSs but at different time slots and the tcpdump capture was running at the Interface were Bird is sending all BGP traffic for customers. At the moment of the event, we didn’t do any maintenance or other RS related work.
Has any of you experienced this in the past? If yes, how did you solve this? Any related feedback is welcomed.
Do you have the MSS used to establish the session? I had an issue about a session flapping with edgecast (verizonmedia) flapping on AMS-IX because both were having a MTU at 9216 on our port. But some switch didn’t like it well and sometime a packet is loss. If it’s the one containing the keepalive, the session goes down. I resolved it by setting a MTU of 1514 on my side (which should have been since always). Also, note that I’m not directly connected to the IXP, I’m using a reseller. Regards, -- Alarig
HI Alarig, Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently? Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the reply packet being generated and leaving the box. What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket (not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready. But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well? But good point the MSS, I will try to check it as well in the next incident. Thanks Best regards, Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net
On 28 Feb 2020, at 14:12, Alarig Le Lay <alarig@swordarmor.fr> wrote:
Hi Stavros,
On ven. 28 févr. 12:41:24 2020, Stavros Konstantaras wrote:
Hi Bird community,
We are investigating a weird customer issue regarding our Bird Route Servers (version 1.6.3) and a specific IPv6 session. Customer reports a sudden drop of his IPv6 session and -until now- we could not relate those drops with any issue or instability. Everything seems normal and no other customer complained at the moment of the incident.
After some packet capturing at the moment of the event, we discovered that BIRD does not send a response messages to the customer’s BGP keepalive messages (see attached picture), which result to the BGP hold timer to expire and the sessions to be dropped. We observed this anomaly with both RSs but at different time slots and the tcpdump capture was running at the Interface were Bird is sending all BGP traffic for customers. At the moment of the event, we didn’t do any maintenance or other RS related work.
Has any of you experienced this in the past? If yes, how did you solve this? Any related feedback is welcomed.
Do you have the MSS used to establish the session? I had an issue about a session flapping with edgecast (verizonmedia) flapping on AMS-IX because both were having a MTU at 9216 on our port. But some switch didn’t like it well and sometime a packet is loss. If it’s the one containing the keepalive, the session goes down.
I resolved it by setting a MTU of 1514 on my side (which should have been since always).
Also, note that I’m not directly connected to the IXP, I’m using a reseller.
Regards, -- Alarig
On Fri, Feb 28, 2020 at 03:33:06PM +0100, Stavros Konstantaras wrote:
HI Alarig,
Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently? Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the reply packet being generated and leaving the box.
What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket (not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready.
But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well?
Hi That is a good point. BIRD uses regular TCP socket, so if you do not see TCP ack, then it is likely an underlying (kernel) issue. There were some reports of IPv6 issues in recent kernels [*] Also, the log message: Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Received: Hold timer expired shows that the notification message was received by the BIRD. The packet dump shows that keepalives were not sent by BIRD side. You could enable 'debug all' for given peer to see if BIRD tries to send keepalives. You could also monitor state of socket using 'ss' tool. [*] https://bird.network.cz/pipermail/bird-users/2020-February/014270.html -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi, Can it be some IO issue? We had similar problems with bird making an IO loop for too much time so that hold timers were expired by that time. It was probably caused when it was writing a log file on a busy HDD. But we catch those with syslog too, because that write is blocking for the bird too. But nevertheless the OS should have been replying something in the TCP session in your case - accepting the segments or showing that the window is full. As far as I know bird does not have its own TCP stack, so the OS is to be blamed for that part. It can be stuck for some reason/bug or as other people suggested it could be sending packets somewhere else or not knowing where to send them. On Fri, Feb 28, 2020 at 4:46 PM Ondrej Zajicek <santiago@crfreenet.org> wrote:
On Fri, Feb 28, 2020 at 03:33:06PM +0100, Stavros Konstantaras wrote:
HI Alarig,
Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently? Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the reply packet being generated and leaving the box.
What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket (not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready.
But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well?
Hi
That is a good point. BIRD uses regular TCP socket, so if you do not see TCP ack, then it is likely an underlying (kernel) issue. There were some reports of IPv6 issues in recent kernels [*]
Also, the log message:
Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Received: Hold timer expired
shows that the notification message was received by the BIRD. The packet dump shows that keepalives were not sent by BIRD side. You could enable 'debug all' for given peer to see if BIRD tries to send keepalives. You could also monitor state of socket using 'ss' tool.
[*] https://bird.network.cz/pipermail/bird-users/2020-February/014270.html
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Alexander, In general we try to keep the RS as light as possible, which means we do not run unwanted applications or packet captures over there. Nevertheless, we didn’t observe busy HDD issues but that is also a valid point. As per Ondrej’s feedback, it seems there is a kernel issue, maybe bug or scalability, so I will schedule a maintenance to update the kernel on the servers and see if it fixes the problem. Best regards, Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net
On 29 Feb 2020, at 01:08, Alexander Zubkov <green@qrator.net> wrote:
Hi,
Can it be some IO issue? We had similar problems with bird making an IO loop for too much time so that hold timers were expired by that time. It was probably caused when it was writing a log file on a busy HDD. But we catch those with syslog too, because that write is blocking for the bird too. But nevertheless the OS should have been replying something in the TCP session in your case - accepting the segments or showing that the window is full. As far as I know bird does not have its own TCP stack, so the OS is to be blamed for that part. It can be stuck for some reason/bug or as other people suggested it could be sending packets somewhere else or not knowing where to send them.
On Fri, Feb 28, 2020 at 4:46 PM Ondrej Zajicek <santiago@crfreenet.org> wrote:
On Fri, Feb 28, 2020 at 03:33:06PM +0100, Stavros Konstantaras wrote:
HI Alarig,
Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently? Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the reply packet being generated and leaving the box.
What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket (not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready.
But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well?
Hi
That is a good point. BIRD uses regular TCP socket, so if you do not see TCP ack, then it is likely an underlying (kernel) issue. There were some reports of IPv6 issues in recent kernels [*]
Also, the log message:
Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Received: Hold timer expired
shows that the notification message was received by the BIRD. The packet dump shows that keepalives were not sent by BIRD side. You could enable 'debug all' for given peer to see if BIRD tries to send keepalives. You could also monitor state of socket using 'ss' tool.
[*] https://bird.network.cz/pipermail/bird-users/2020-February/014270.html
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Ondrej, Great feedback, thank’s a lot for sharing it with us. Currently we are using the kernel 3.16.39-1+deb8u2, do you remember if this kernel was inside the list of complains? We will schedule a maintenance to update the kernel in both Route Servers and see if that solves the issue. Thank you for the “debug” tip as well, I wasn’t aware that gives you that much detailed output like sending Keepalives to a peer. I will give it a shot. Best regards, Stavros Konstantaras | Sr. Network Engineer | AMS-IX M +31 (0) 620 89 51 04 | T +31 20 305 8999 ams-ix.net
On 28 Feb 2020, at 16:42, Ondrej Zajicek <santiago@crfreenet.org> wrote:
On Fri, Feb 28, 2020 at 03:33:06PM +0100, Stavros Konstantaras wrote:
HI Alarig,
Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently? Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the reply packet being generated and leaving the box.
What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket (not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready.
But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well?
Hi
That is a good point. BIRD uses regular TCP socket, so if you do not see TCP ack, then it is likely an underlying (kernel) issue. There were some reports of IPv6 issues in recent kernels [*]
Also, the log message:
Feb 20 21:46:11 rs1-mng bird6: 2001:7F8:1::A500:19:7727:1: Received: Hold timer expired
shows that the notification message was received by the BIRD. The packet dump shows that keepalives were not sent by BIRD side. You could enable 'debug all' for given peer to see if BIRD tries to send keepalives. You could also monitor state of socket using 'ss' tool.
[*] https://bird.network.cz/pipermail/bird-users/2020-February/014270.html
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Tue, Mar 03, 2020 at 09:09:58AM +0100, Stavros Konstantaras wrote:
Hi Ondrej,
Great feedback, thank’s a lot for sharing it with us. Currently we are using the kernel 3.16.39-1+deb8u2, do you remember if this kernel was inside the list of complains?
Hi See the link i sent you in the previous mail. It is an issue in newer kernels (Debian 10), people noted that it can be prevented by downgrading to 4.9, so it is unlikely an issue in your case. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (5)
-
Alarig Le Lay -
Alexander Zubkov -
Ondrej Zajicek -
Stavros Konstantaras -
Tapio Haapala