HI Alarig, 

Thank you for sharing your experiences. I don’t have the MSS currently but if that was the case, wouldn’t have experienced the drops more frequently?
Currently it happens once per month (or 0.8 per month) and contrary to your case which was 100% network related, in our case we don’t even see the
reply packet being generated and leaving the box. 

What puzzles me also and based on the capture, is that I don’t see the TCP-ACK messages being sent to the customer. If BIRD opens a TCP socket 
(not a simple RAW socket), I assume that the TCP connection will be handled by the OS and BIRD will push data segments (BGP keep alive messages) when ready.

But as per output, I don’t see the TCP ack messages at all. Is BIRD handling the TCP communication as well? 


But good point the MSS, I will try to check it as well in the next incident. Thanks 


Best regards,

Stavros Konstantaras | Sr. Network Engineer | AMS-IX 
M +31 (0) 620 89 51 04 | T +31 20 305 8999
ams-ix.net




On 28 Feb 2020, at 14:12, Alarig Le Lay <alarig@swordarmor.fr> wrote:

Hi Stavros,

On ven. 28 févr. 12:41:24 2020, Stavros Konstantaras wrote:
Hi Bird community,

We are investigating a weird customer issue regarding our Bird Route
Servers (version 1.6.3) and a specific IPv6 session. Customer reports
a sudden drop of his IPv6 session and -until now- we could not relate
those drops with any issue or instability. Everything seems normal and
no other customer complained at the moment of the incident.



After some packet capturing at the moment of the event, we discovered
that BIRD does not send a response messages to the customer’s BGP
keepalive messages (see attached picture), which result to the BGP
hold timer to expire and the sessions to be dropped. We observed this
anomaly with both RSs but at different time slots and the tcpdump
capture was running at the Interface were Bird is sending all BGP
traffic for customers. At the moment of the event, we didn’t do any
maintenance or other RS related work.

Has any of you experienced this in the past? If yes, how did you solve
this?
Any related feedback is welcomed.

Do you have the MSS used to establish the session? I had an issue about
a session flapping with edgecast (verizonmedia) flapping on AMS-IX
because both were having a MTU at 9216 on our port. But some switch
didn’t like it well and sometime a packet is loss. If it’s the one
containing the keepalive, the session goes down.

I resolved it by setting a MTU of 1514 on my side (which should have
been since always).

Also, note that I’m not directly connected to the IXP, I’m using a
reseller.

Regards,
--
Alarig