Ondrej Zajicek <santiago@crfreenet.org> writes:
On Sat, Apr 23, 2022 at 10:31:26PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
BTW, when we added CI tests for Babel authentication, we noticed that it has rather slow convergence after reconfiguration. The reason is that when authentication changed to become non-matching, it took many missed (misauthenticated) hellos to sufficiently clean up hello_map for neighbor to go down.
Just to make sure I'm understanding the scenario correctly: A node is reconfigured to turn on authentication, but not all peers use it; so now some peers are essentially cut off (basically like if they just dropped off the network).
See:
https://gitlab.nic.cz/labs/bird-tools/-/tree/master/netlab/cf-babel-auth
We just change keys to be matching/mismatched (although enabling / disabling auth is a similar case)
However, their hello history remain, so their routes stay active until they time out. Right?
Yes
Perhaps there could be some decision in iface reconfiguration that the change is significant to affect reachability of neighbors and in such case deprecate some/most items in hello_map.
Hmm, yeah, we could do something like that I suppose. I'm wondering if it should really be stronger, though? Enabling auth on an already-running instance is an increase in the "security level" of the interface, so should we really be keeping unauthenticated data around at all? I.e., maybe we should simply flush all neighbour entries on an interface when enabling auth on that interface?
I would prefer to not make it that disruptive. Also it is not just enabling auth, but in general changing keys.
Or another, slightly less disruptive, option is to flush a neighbour if we receive a packet from it that fails auth, and that neighbour doesn't have the 'auth_passed' flag set? For existing neighbours that succeeds auth, that flag should be set immediately on the next packet we receive from that neighbour, whereas this would quickly clear out neighbours that fail it. We could maybe speed things up further by immediately issuing auth challenges to all neighbours when the config changes?
I thing that the current approach where there is some leeway between reconfiguration and dropping a neighbor is basically ok and it is better to just ignore failed auths than trigger some action based on that. Just that with full hello history it took too long. So perhaps just removing 3/4 of hellos from hello history?
Hmm, right, that sounds reasonable. Something like the below, then? -Toke diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 4a7d550f5efe..55ae5f972573 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -1861,8 +1861,25 @@ babel_reconfigure_iface(struct babel_proto *p, struct babel_iface *ifa, struct b babel_iface_update_buffers(ifa); - if ((new->auth_type != BABEL_AUTH_NONE) && (new->auth_type != old->auth_type)) - babel_auth_reset_index(ifa); + if (new->auth_type != BABEL_AUTH_NONE) + { + struct babel_neighbor *n; + + if (new->auth_type != old->auth_type) + babel_auth_reset_index(ifa); + + WALK_LIST(n, ifa->neigh_list) + { + /* trim the hello history on auth change to make sure neighbours expire + * quickly if they no longer pass auth + */ + if (n->hello_cnt < 2) + continue; + + n->hello_cnt = MAX(2, n->hello_cnt / 4); + n->hello_map &= 0xFFFF >> (16 - n->hello_cnt); + } + } if (ipa_zero(ifa->next_hop_ip4) && p->ip4_channel) log(L_WARN "%s: Missing IPv4 next hop address for %s", p->p.name, ifa->ifname);