[PATCH v3] Babel: Replace internal route selection by bird's nest

Toke Høiland-Jørgensen toke at toke.dk
Tue Jan 31 17:35:42 CET 2023


Daniel Gröber <dxld at darkboxed.org> writes:

> Hi Toke,
>
> Thanks for the comprehensive review! See below.
>
> On Tue, Jan 31, 2023 at 12:38:25PM +0100, Toke Høiland-Jørgensen wrote:
>> Daniel Gröber <dxld at darkboxed.org> writes:
>> > This appears to not actually be a breaking change as route announcement was
>> > already based on which routes bird would send back to babel_rt_notify
>> > filters shouldn't be able to tell the difference between a single and
>> > multiple routes AFAIK.
>> 
>> Can we be a bit more sure than "appears"? :)
>
> Testers welcome :P
>
>> Just to make sure I'm understanding correctly: this relies on the nest
>> using babel_rte_better() to select the route with the lowest metric from
>> all the ones we announce, right? And if there's no filter, functionality
>> will be equivalent to before this patch?
>
> Right, I have yet to actually compare the before/after behaviour more
> comprehensively. Hence the "appears" :)
>
> Also note the feasibility condition is checked before announcing to nest so
> rte_better will never see ones that don't satisfy it.

That I did catch!

> Furthermore metric smoothing is going to be even fun to implement with
> this ;)

Add the smoothed metric as a separate attribute? But yeah, the
hysteresis itself may have to be moved into the nest, I guess? Let's
cross that bridge when we get to it :)

> If there is no filter everything works as ever, what we still have to worry
> about is the with filter case. I'm not sure if there are any differences
> observable from the filter POV.

On the subject of filters, this capability makes more sense if the
router ID is also available to filters, doesn't it? This has been
requested a few times already, the last time just a few weeks ago:

https://bird.network.cz/pipermail/bird-users/2022-May/016160.html
https://bird.network.cz/pipermail/bird-users/2023-January/016518.html

>> > +static void
>> > +babel_announce_rte_unreachable(struct babel_proto *p, struct babel_entry *e, u8 unreachable)
>> > +{
>> > +  struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : p->ip6_channel;
>> > +  rte *rte = NULL;
>> > +
>> > +  if (unreachable) {
>> > +    /* Unreachable */
>> > +    rta a0 = {
>> > +      .source = RTS_BABEL,
>> > +      .scope = SCOPE_UNIVERSE,
>> > +      .dest = RTD_UNREACHABLE,
>> > +      .pref = 1,
>> > +    };
>> > +
>> > +    rta *a = rta_lookup(&a0);
>> > +    rte = rte_get_temp(a, p->p.main_source);
>> > +  }
>> > +
>> > +  e->unreachable = unreachable;
>> > +
>> > +  /* Unlike the many neighbour routes we only want one unreachable route, for
>> > +   * one it's ugly to have lots of unreachable routes (if we just did this right
>> > +   * in babel_announce_rte) but we also loose access to the neibour rte_src id
>> > +   * when the neighbour is deallocated. This happens on a different schedule
>> > +   * than unreachable route removal. */
>> > +  rte_update2(c, e->n.addr, rte, p->p.main_source);
>> > +}
>> 
>> OK, so this function took me a while to understand: There's a special
>> "unreachable" route we announce whenever we lose a route, and this
>> function announces or retract that particular route based on the
>> "unreachable" argument. Could you please add a comment to the top of the
>> function explaining all this (the comment at the bottom is only a
>> partial explanation).
>> 
>> Also s/loose/lose/
>
> ACK.
>
>> > -    TRACE(D_EVENTS, "Lost feasible route for prefix %N", e->n.addr);
>> > -    if (e->valid && (e->selected->router_id == e->router_id))
>> > -      babel_add_seqno_request(p, e, e->selected->router_id, e->selected->seqno + 1, 0, NULL);
>> 
>> With your patch we no longer send seqno requests. That seems bad? :)
>
> Uups, I've been observing routes staying unfeasible for (seemingly) much
> longer than with the old code during testing and was worried it might have
> something to do with the seqno reqs :)

Haha, yeah, that seems likely ;)

>> >      e = babel_get_entry(p, net->n.addr);
>> >  
>> > +    while (internal) {
>> > +      ifa = babel_find_iface(p, new->attrs->nh.iface);
>> > +      if (!ifa) break;
>> > +      n = babel_find_neighbor(ifa, new->attrs->from);
>> > +      if (!n) break;
>> > +      e->selected_nbr = n;
>> > +      break;
>> > +    }
>> > +    if (!internal) {
>> > +      e->selected_nbr = NULL;
>> > +    }
>> 
>> Using a loop+break statements as some kind of weird 'goto'? Yuck!
>> Maybe just make this a helper function?
>> 
>> babel_entry_update_selected(p, e, new->attrs->nh.iface, internal ? new->attrs->from : NULL)
>> 
>> Also, if we can't find the iface or neighbour object (can that ever
>> happen) shouldn't we be resetting e->selected_nbr to NULL instead of
>> keeping whatever old value is there?
>
> Yeah I'm not happy with this code either. It seems like the NULL cases
> should never be able to happen but I'm not convinced, hence this quick
> "beauty".
>
> I think you have a point though selected_nbr should be reset when the
> lookups fail as nest is still telling us: this is the optimal route now. I
> don't like that at all though as it means our view of the routing table is
> out of sync with what bird is doing and so we'd probably violate protocol
> in any of the cases that need to check if a route is "best".
>
> Is there a way we can easily trigger a protocol restart instead of crashing
> all of bird with an assertion here if this does happen against all odds?

Hmm, no, I don't think there's any way that the protocol can signal to
the nest that it has gotten itself into an inconsistent state and needs
to be restarted. Not sure if it's better to just crash in this case, or
if we should add such a mechanism? Hopefully Ondrej has an opinion...

-Toke



More information about the Bird-users mailing list