[PATCH] babel: Send out low-interval hello on shutdown
When shutting down a Babel instance we send a wildcard retraction to make sure all peers can quickly switch to other route origins. Add another small optimisation borrowed from babeld: sending a Hello message (along with the retraction) with a very low interval. This will cause neighbours to modify their expiry timers for the node's state to quickly time it out, thus conserving resources in the network. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> --- proto/babel/babel.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 174fc9e26654..4c7f70c3ff66 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -842,14 +842,14 @@ babel_send_ihus(struct babel_iface *ifa) } static void -babel_send_hello(struct babel_iface *ifa) +babel_send_hello(struct babel_iface *ifa, uint interval) { struct babel_proto *p = ifa->proto; union babel_msg msg = {}; msg.type = BABEL_TLV_HELLO; msg.hello.seqno = ifa->hello_seqno++; - msg.hello.interval = ifa->cf->hello_interval; + msg.hello.interval = interval ?: ifa->cf->hello_interval; TRACE(D_PACKETS, "Sending hello on %s with seqno %d interval %t", ifa->ifname, msg.hello.seqno, (btime) msg.hello.interval); @@ -1557,7 +1557,7 @@ babel_iface_timer(timer *t) if (now_ >= ifa->next_hello) { - babel_send_hello(ifa); + babel_send_hello(ifa, 0); ifa->next_hello += hello_period * (1 + (now_ - ifa->next_hello) / hello_period); } @@ -1604,7 +1604,7 @@ babel_iface_start(struct babel_iface *ifa) tm_start(ifa->timer, 100 MS); ifa->up = 1; - babel_send_hello(ifa); + babel_send_hello(ifa, 0); babel_send_wildcard_retraction(ifa); babel_send_wildcard_request(ifa); babel_send_update(ifa, 0); /* Full update */ @@ -2417,6 +2417,11 @@ babel_iface_shutdown(struct babel_iface *ifa) { if (ifa->sk) { + /* + * Retract all our routes and lower the hello interval so peers' neighbour + * state expires quickly + */ + babel_send_hello(ifa, BABEL_TIME_UNITS); babel_send_wildcard_retraction(ifa); babel_send_queue(ifa); } -- 2.36.0
On Wed, Apr 20, 2022 at 01:43:21AM +0200, Toke Høiland-Jørgensen wrote:
When shutting down a Babel instance we send a wildcard retraction to make sure all peers can quickly switch to other route origins. Add another small optimisation borrowed from babeld: sending a Hello message (along with the retraction) with a very low interval.
This will cause neighbours to modify their expiry timers for the node's state to quickly time it out, thus conserving resources in the network.
Hi Thanks, merged. Just changed BABEL_TIME_UNITS to BABEL_MIN_INTERVAL. BTW, when we added CI tests for Babel authentication, we noticed that it has rather slow convergence after reconfiguration. The reason is that when authentication changed to become non-matching, it took many missed (misauthenticated) hellos to sufficiently clean up hello_map for neighbor to go down. Perhaps there could be some decision in iface reconfiguration that the change is significant to affect reachability of neighbors and in such case deprecate some/most items in hello_map.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> --- proto/babel/babel.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 174fc9e26654..4c7f70c3ff66 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -842,14 +842,14 @@ babel_send_ihus(struct babel_iface *ifa) }
static void -babel_send_hello(struct babel_iface *ifa) +babel_send_hello(struct babel_iface *ifa, uint interval) { struct babel_proto *p = ifa->proto; union babel_msg msg = {};
msg.type = BABEL_TLV_HELLO; msg.hello.seqno = ifa->hello_seqno++; - msg.hello.interval = ifa->cf->hello_interval; + msg.hello.interval = interval ?: ifa->cf->hello_interval;
TRACE(D_PACKETS, "Sending hello on %s with seqno %d interval %t", ifa->ifname, msg.hello.seqno, (btime) msg.hello.interval); @@ -1557,7 +1557,7 @@ babel_iface_timer(timer *t)
if (now_ >= ifa->next_hello) { - babel_send_hello(ifa); + babel_send_hello(ifa, 0); ifa->next_hello += hello_period * (1 + (now_ - ifa->next_hello) / hello_period); }
@@ -1604,7 +1604,7 @@ babel_iface_start(struct babel_iface *ifa) tm_start(ifa->timer, 100 MS); ifa->up = 1;
- babel_send_hello(ifa); + babel_send_hello(ifa, 0); babel_send_wildcard_retraction(ifa); babel_send_wildcard_request(ifa); babel_send_update(ifa, 0); /* Full update */ @@ -2417,6 +2417,11 @@ babel_iface_shutdown(struct babel_iface *ifa) { if (ifa->sk) { + /* + * Retract all our routes and lower the hello interval so peers' neighbour + * state expires quickly + */ + babel_send_hello(ifa, BABEL_TIME_UNITS); babel_send_wildcard_retraction(ifa); babel_send_queue(ifa); } -- 2.36.0
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
On Wed, Apr 20, 2022 at 01:43:21AM +0200, Toke Høiland-Jørgensen wrote:
When shutting down a Babel instance we send a wildcard retraction to make sure all peers can quickly switch to other route origins. Add another small optimisation borrowed from babeld: sending a Hello message (along with the retraction) with a very low interval.
This will cause neighbours to modify their expiry timers for the node's state to quickly time it out, thus conserving resources in the network.
Hi
Thanks, merged. Just changed BABEL_TIME_UNITS to BABEL_MIN_INTERVAL.
Awesome! I noticed we had that define as well and did the same rename in my local tree :)
BTW, when we added CI tests for Babel authentication, we noticed that it has rather slow convergence after reconfiguration. The reason is that when authentication changed to become non-matching, it took many missed (misauthenticated) hellos to sufficiently clean up hello_map for neighbor to go down.
Just to make sure I'm understanding the scenario correctly: A node is reconfigured to turn on authentication, but not all peers use it; so now some peers are essentially cut off (basically like if they just dropped off the network). However, their hello history remain, so their routes stay active until they time out. Right?
Perhaps there could be some decision in iface reconfiguration that the change is significant to affect reachability of neighbors and in such case deprecate some/most items in hello_map.
Hmm, yeah, we could do something like that I suppose. I'm wondering if it should really be stronger, though? Enabling auth on an already-running instance is an increase in the "security level" of the interface, so should we really be keeping unauthenticated data around at all? I.e., maybe we should simply flush all neighbour entries on an interface when enabling auth on that interface? Or another, slightly less disruptive, option is to flush a neighbour if we receive a packet from it that fails auth, and that neighbour doesn't have the 'auth_passed' flag set? For existing neighbours that succeeds auth, that flag should be set immediately on the next packet we receive from that neighbour, whereas this would quickly clear out neighbours that fail it. We could maybe speed things up further by immediately issuing auth challenges to all neighbours when the config changes? -Toke
On Sat, Apr 23, 2022 at 10:31:26PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
BTW, when we added CI tests for Babel authentication, we noticed that it has rather slow convergence after reconfiguration. The reason is that when authentication changed to become non-matching, it took many missed (misauthenticated) hellos to sufficiently clean up hello_map for neighbor to go down.
Just to make sure I'm understanding the scenario correctly: A node is reconfigured to turn on authentication, but not all peers use it; so now some peers are essentially cut off (basically like if they just dropped off the network).
See: https://gitlab.nic.cz/labs/bird-tools/-/tree/master/netlab/cf-babel-auth We just change keys to be matching/mismatched (although enabling / disabling auth is a similar case)
However, their hello history remain, so their routes stay active until they time out. Right?
Yes
Perhaps there could be some decision in iface reconfiguration that the change is significant to affect reachability of neighbors and in such case deprecate some/most items in hello_map.
Hmm, yeah, we could do something like that I suppose. I'm wondering if it should really be stronger, though? Enabling auth on an already-running instance is an increase in the "security level" of the interface, so should we really be keeping unauthenticated data around at all? I.e., maybe we should simply flush all neighbour entries on an interface when enabling auth on that interface?
I would prefer to not make it that disruptive. Also it is not just enabling auth, but in general changing keys.
Or another, slightly less disruptive, option is to flush a neighbour if we receive a packet from it that fails auth, and that neighbour doesn't have the 'auth_passed' flag set? For existing neighbours that succeeds auth, that flag should be set immediately on the next packet we receive from that neighbour, whereas this would quickly clear out neighbours that fail it. We could maybe speed things up further by immediately issuing auth challenges to all neighbours when the config changes?
I thing that the current approach where there is some leeway between reconfiguration and dropping a neighbor is basically ok and it is better to just ignore failed auths than trigger some action based on that. Just that with full hello history it took too long. So perhaps just removing 3/4 of hellos from hello history? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
On Sat, Apr 23, 2022 at 10:31:26PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
BTW, when we added CI tests for Babel authentication, we noticed that it has rather slow convergence after reconfiguration. The reason is that when authentication changed to become non-matching, it took many missed (misauthenticated) hellos to sufficiently clean up hello_map for neighbor to go down.
Just to make sure I'm understanding the scenario correctly: A node is reconfigured to turn on authentication, but not all peers use it; so now some peers are essentially cut off (basically like if they just dropped off the network).
See:
https://gitlab.nic.cz/labs/bird-tools/-/tree/master/netlab/cf-babel-auth
We just change keys to be matching/mismatched (although enabling / disabling auth is a similar case)
However, their hello history remain, so their routes stay active until they time out. Right?
Yes
Perhaps there could be some decision in iface reconfiguration that the change is significant to affect reachability of neighbors and in such case deprecate some/most items in hello_map.
Hmm, yeah, we could do something like that I suppose. I'm wondering if it should really be stronger, though? Enabling auth on an already-running instance is an increase in the "security level" of the interface, so should we really be keeping unauthenticated data around at all? I.e., maybe we should simply flush all neighbour entries on an interface when enabling auth on that interface?
I would prefer to not make it that disruptive. Also it is not just enabling auth, but in general changing keys.
Or another, slightly less disruptive, option is to flush a neighbour if we receive a packet from it that fails auth, and that neighbour doesn't have the 'auth_passed' flag set? For existing neighbours that succeeds auth, that flag should be set immediately on the next packet we receive from that neighbour, whereas this would quickly clear out neighbours that fail it. We could maybe speed things up further by immediately issuing auth challenges to all neighbours when the config changes?
I thing that the current approach where there is some leeway between reconfiguration and dropping a neighbor is basically ok and it is better to just ignore failed auths than trigger some action based on that. Just that with full hello history it took too long. So perhaps just removing 3/4 of hellos from hello history?
Hmm, right, that sounds reasonable. Something like the below, then? -Toke diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 4a7d550f5efe..55ae5f972573 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -1861,8 +1861,25 @@ babel_reconfigure_iface(struct babel_proto *p, struct babel_iface *ifa, struct b babel_iface_update_buffers(ifa); - if ((new->auth_type != BABEL_AUTH_NONE) && (new->auth_type != old->auth_type)) - babel_auth_reset_index(ifa); + if (new->auth_type != BABEL_AUTH_NONE) + { + struct babel_neighbor *n; + + if (new->auth_type != old->auth_type) + babel_auth_reset_index(ifa); + + WALK_LIST(n, ifa->neigh_list) + { + /* trim the hello history on auth change to make sure neighbours expire + * quickly if they no longer pass auth + */ + if (n->hello_cnt < 2) + continue; + + n->hello_cnt = MAX(2, n->hello_cnt / 4); + n->hello_map &= 0xFFFF >> (16 - n->hello_cnt); + } + } if (ipa_zero(ifa->next_hop_ip4) && p->ip4_channel) log(L_WARN "%s: Missing IPv4 next hop address for %s", p->p.name, ifa->ifname);
participants (2)
-
Ondrej Zajicek -
Toke Høiland-Jørgensen