On the possibility of updating BGP passwords without network disruption.
Hi BIRD users, Does anyone know whether a BGP shared secret can be rotated without incurring any network downtime? I did some testing with the BGP password functionality offered and it appears that any update to the BGP password configuration incurs a brief network outage with both existing/new connections. It seems like something about the way BIRD is restarting is leading to it pulling down learned routes immediately as opposed to letting them live according to the timeout setting. Does BIRD flush all routes it has learned when this configuration changes? Here is a brief excerpt to demonstrate the outage. Take note that the network disruption precisely matches the timestamp at which BIRD is reconfigured: # Logs from calico-node-4w4bv (10.95.14.104) 2022-06-29 18:26:18.836 [INFO][60] confd/client.go 1415: Trigger to recheck BGP peers following possible password update 2022-06-29 18:26:18.836 [INFO][60] confd/client.go 250: Recompute v1 BGP peerings 2022-06-29 18:26:18.836 [INFO][60] confd/client.go 949: Recompute BGP peerings: bird: Reconfiguration requested by SIGHUP 2022-06-29 18:26:18.844 [INFO][60] confd/resource.go 278: Target config /etc/calico/confd/config/bird.cfg has been updated due to change in key: /calico/bgp/v1/global bird: Reconfiguring bird: device1: Reconfigured bird: direct1: Reconfigured bird: Restarting protocol Mesh_10_95_14_105 bird: Mesh_10_95_14_105: Shutting down bird: Mesh_10_95_14_105: State changed to stop bird: Restarting protocol Mesh_10_95_14_110 bird: Mesh_10_95_14_110: Shutting down bird: Mesh_10_95_14_110: State changed to stop bird: Mesh_10_95_14_105: State changed to down bird: Mesh_10_95_14_105: Initializing bird: Mesh_10_95_14_105: Starting bird: Mesh_10_95_14_105: State changed to start bird: Mesh_10_95_14_110: State changed to down bird: Mesh_10_95_14_110: Initializing bird: Mesh_10_95_14_110: Starting bird: Mesh_10_95_14_110: State changed to start bird: Reconfigured bird: Mesh_10_95_14_110: Connected to table master bird: Mesh_10_95_14_110: State changed to feed bird: Mesh_10_95_14_110: State changed to up bird: Mesh_10_95_14_105: Connected to table master bird: Mesh_10_95_14_105: State changed to feed bird: Mesh_10_95_14_105: State changed to up 2022-06-29 18:26:36.079 [INFO][58] monitor-addresses/autodetection_methods.go 117: Using autodetected IPv4 address 10.95.14.104/26 on matching interface eth0 2022-06-29 18:26:54.537 [INFO][62] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.1s: avg=5ms longest=21ms (resync-filter-v4,resync-nat-v4) # Connection Tester Daemonset (hitting an echoserver twice a second or so) Wed Jun 29 18:26:17 UTC 2022 Successful echo server connection Wed Jun 29 18:26:17 UTC 2022 Wed Jun 29 18:26:18 UTC 2022 Successful echo server connection Wed Jun 29 18:26:18 UTC 2022 Wed Jun 29 18:26:18 UTC 2022 curl: (28) Connection timed out after 300 milliseconds Failed to connect to echo server Wed Jun 29 18:26:19 UTC 2022 Wed Jun 29 18:26:19 UTC 2022 curl: (28) Connection timed out after 300 milliseconds Failed to connect to echo server Wed Jun 29 18:26:20 UTC 2022 Wed Jun 29 18:26:20 UTC 2022 Successful echo server connection Wed Jun 29 18:26:20 UTC 2022 # For peer /host/10.95.14.81/ip_addr_v4 protocol bgp Mesh_10_95_14_81 from bgp_template { neighbor 10.95.14.81 as 64512; source address 10.95.14.82; # The local address we use for the TCP connection graceful restart time 1800; # This parameter seems to make no difference when changing BGP passwords password "LJiKASiglY+KafEwEn/cSmkiok0zHgpQq5EtYhYgoDcSQwKIpX22Tz7jOzX+"; } I have perused the RFCs for both BGP Graceful Restart (4724) & Secure BGP Sessions (2385) but haven't found a solid answer yet. When the password is changed it makes complete sense that any peers with the new password will refuse to accept any NEW routes received from peers using the old password and vice versa. I don't see the fundamental reason why TCP segments arriving with some unexpected hash necessitates that previously learned routes from that peer need to be flushed with no TTL, but the observance of the outage suggests that is what is happening. One would think that, in principle, it could wait to tear down existing routes until a configurable timeout (say the graceful restart) expires, providing a window in which we can change the password and maintain stable routing. I am relatively new to BGP and am using BIRD indirectly via Calico for container networking inside Kubernetes. I will of course take things up with the guys behind Calico, but is there anything in the BGP spec/BIRD implementation which fundamentally prevents network disruption free secret rotation? Let me know if there is any place I should look for more information on this or any debug logs which would be helpful. Thanks, Calvin
Hello, Do you have graceful restart enabled for your session from both ends? But anyway, I'm not sure that bird uses graceful session restart when it restarts protocols due to reconfiguring. Maybe someone tested it already or Bird developers will tell for sure. On Mon, Aug 8, 2022 at 5:06 PM Calvin Zachman <calvin.zachman@ibm.com> wrote:
Hi BIRD users,
Does anyone know whether a BGP shared secret can be rotated without incurring any network downtime? I did some testing with the BGP password functionality offered and it appears that any update to the BGP password configuration incurs a brief network outage with both existing/new connections. It seems like something about the way BIRD is restarting is leading to it pulling down learned routes immediately as opposed to letting them live according to the timeout setting. Does BIRD flush all routes it has learned when this configuration changes? Here is a brief excerpt to demonstrate the outage. Take note that the network disruption precisely matches the timestamp at which BIRD is reconfigured:
# Logs from calico-node-4w4bv (10.95.14.104)
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 1415: Trigger to recheck BGP peers following possible password update
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 250: Recompute v1 BGP peerings
2022-06-29 18:26:18.836 [INFO][60] confd/client.go 949: Recompute BGP peerings:
bird: Reconfiguration requested by SIGHUP
2022-06-29 18:26:18.844 [INFO][60] confd/resource.go 278: Target config /etc/calico/confd/config/bird.cfg has been updated due to change in key: /calico/bgp/v1/global
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Restarting protocol Mesh_10_95_14_105
bird: Mesh_10_95_14_105: Shutting down
bird: Mesh_10_95_14_105: State changed to stop
bird: Restarting protocol Mesh_10_95_14_110
bird: Mesh_10_95_14_110: Shutting down
bird: Mesh_10_95_14_110: State changed to stop
bird: Mesh_10_95_14_105: State changed to down
bird: Mesh_10_95_14_105: Initializing
bird: Mesh_10_95_14_105: Starting
bird: Mesh_10_95_14_105: State changed to start
bird: Mesh_10_95_14_110: State changed to down
bird: Mesh_10_95_14_110: Initializing
bird: Mesh_10_95_14_110: Starting
bird: Mesh_10_95_14_110: State changed to start
bird: Reconfigured
bird: Mesh_10_95_14_110: Connected to table master
bird: Mesh_10_95_14_110: State changed to feed
bird: Mesh_10_95_14_110: State changed to up
bird: Mesh_10_95_14_105: Connected to table master
bird: Mesh_10_95_14_105: State changed to feed
bird: Mesh_10_95_14_105: State changed to up
2022-06-29 18:26:36.079 [INFO][58] monitor-addresses/autodetection_methods.go 117: Using autodetected IPv4 address 10.95.14.104/26 on matching interface eth0
2022-06-29 18:26:54.537 [INFO][62] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.1s: avg=5ms longest=21ms (resync-filter-v4,resync-nat-v4)
# Connection Tester Daemonset (hitting an echoserver twice a second or so)
Wed Jun 29 18:26:17 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:17 UTC 2022
Wed Jun 29 18:26:18 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:18 UTC 2022
Wed Jun 29 18:26:18 UTC 2022
curl: (28) Connection timed out after 300 milliseconds
Failed to connect to echo server
Wed Jun 29 18:26:19 UTC 2022
Wed Jun 29 18:26:19 UTC 2022
curl: (28) Connection timed out after 300 milliseconds
Failed to connect to echo server
Wed Jun 29 18:26:20 UTC 2022
Wed Jun 29 18:26:20 UTC 2022
Successful echo server connection
Wed Jun 29 18:26:20 UTC 2022
# For peer /host/10.95.14.81/ip_addr_v4
protocol bgp Mesh_10_95_14_81 from bgp_template {
neighbor 10.95.14.81 as 64512;
source address 10.95.14.82; # The local address we use for the TCP connection
graceful restart time 1800; # This parameter seems to make no difference when changing BGP passwords
password "LJiKASiglY+KafEwEn/cSmkiok0zHgpQq5EtYhYgoDcSQwKIpX22Tz7jOzX+";
}
I have perused the RFCs for both BGP Graceful Restart (4724) & Secure BGP Sessions (2385) but haven't found a solid answer yet. When the password is changed it makes complete sense that any peers with the new password will refuse to accept any NEW routes received from peers using the old password and vice versa. I don't see the fundamental reason why TCP segments arriving with some unexpected hash necessitates that previously learned routes from that peer need to be flushed with no TTL, but the observance of the outage suggests that is what is happening. One would think that, in principle, it could wait to tear down existing routes until a configurable timeout (say the graceful restart) expires, providing a window in which we can change the password and maintain stable routing.
I am relatively new to BGP and am using BIRD indirectly via Calico for container networking inside Kubernetes. I will of course take things up with the guys behind Calico, but is there anything in the BGP spec/BIRD implementation which fundamentally prevents network disruption free secret rotation? Let me know if there is any place I should look for more information on this or any debug logs which would be helpful.
Thanks,
Calvin
On 08.08.22 16:58, Calvin Zachman wrote:
Hi BIRD users,
Does anyone know whether a BGP shared secret can be rotated without incurring any network downtime? I did some testing with the BGP password functionality offered and it appears that any update to the BGP password configuration incurs a brief network outage with both existing/new connections. It seems like something about the way BIRD is restarting is leading to it pulling down learned routes immediately as opposed to letting them live according to the timeout setting. Does BIRD flush all routes it has learned when this configuration changes? Here is a brief excerpt to demonstrate the outage. Take note that the network disruption precisely matches the timestamp at which BIRD is reconfigured:
Hey Calvin, It is not explicit mentioned in the user documentation but for babel, bfd, ospf, and others, you can do something like: ``` password "<text>"; password "<text>" { id <num>; generate from "<date>"; generate to "<date>"; accept from "<date>"; accept to "<date>"; from "<date>"; to "<date>"; }; ``` The ospf sections contains the following example: ``` password "abc" { id 1; generate to "22-04-2003 11:00:06"; accept from "17-01-2001 12:01:05"; }; password "def" { id 2; generate to "22-07-2005 17:03:21"; accept from "22-02-2001 11:34:06"; }; ``` A while ago I tested it with OSPF and BFD, and used `include` statements for the passwords, and used `birdc configure` for a "soft" reload. As far as I remember, this just worked(tm). But no warranties that this is implemented for BGP, too. It's just wild guessing. Best and good luck, Bernd
On Mon, 8 Aug 2022, 16:58 Calvin Zachman, <calvin.zachman@ibm.com> wrote:
Hi BIRD users,
Does anyone know whether a BGP shared secret can be rotated without incurring any network downtime? I did some testing with the BGP password functionality offered and it appears that any update to the BGP password configuration incurs a brief network outage with both existing/new connections. It seems like something about the way BIRD is restarting is leading to it pulling down learned routes immediately as opposed to letting them live according to the timeout setting. Does BIRD flush all routes it has learned when this configuration changes? Here is a brief excerpt to demonstrate the outage. Take note that the network disruption precisely matches the timestamp at which BIRD is reconfigured
Rotating MD5 passwords for bgp sessions has _never_ been hitless. And _will_ force the session down. For it to be reestablish. Due to changing the session parameters. Requiring a full session negotiation from scratch. What you are looking for is TCP-AO support. https://tcp-ao.net/ https://duckduckgo.com/?q=tcp-ao https://blog.apnic.net/2021/07/28/its-time-to-replace-md5-with-tcp-ao/ TCP-AO implements logic (in simple terms) similar to what you are used to with key chains when configuring e.g. RIP, OSPF, BABEL on most routing platforms. Where a key has a specified lifetime. And one key is used. But multiple is allowed to permit for key rotation. Both bird (on the mailing list, see list archives) and FRRouting (see the projects github issue tracker) have open questions regarding when this feature is ready. Both projects are thou dependent on a Linux kernel implementation being mainlined before they can support this feature. If you have ever used one of the bigger players NOS releases. Juniper, Cisco, and Nokia (what I know of) has been shipping support for TCP-AO in their newer releases.
participants (4)
-
Alexander Zubkov -
Bernd Naumann -
Calvin Zachman -
ch@ntrv.dk