[PATCH v3] Babel: Replace internal route selection by bird's nest

Daniel Gröber dxld at darkboxed.org
Tue Jan 31 13:58:31 CET 2023


Hi Toke,

Thanks for the comprehensive review! See below.

On Tue, Jan 31, 2023 at 12:38:25PM +0100, Toke Høiland-Jørgensen wrote:
> Daniel Gröber <dxld at darkboxed.org> writes:
> > This appears to not actually be a breaking change as route announcement was
> > already based on which routes bird would send back to babel_rt_notify
> > filters shouldn't be able to tell the difference between a single and
> > multiple routes AFAIK.
> 
> Can we be a bit more sure than "appears"? :)

Testers welcome :P

> Just to make sure I'm understanding correctly: this relies on the nest
> using babel_rte_better() to select the route with the lowest metric from
> all the ones we announce, right? And if there's no filter, functionality
> will be equivalent to before this patch?

Right, I have yet to actually compare the before/after behaviour more
comprehensively. Hence the "appears" :)

Also note the feasibility condition is checked before announcing to nest so
rte_better will never see ones that don't satisfy it. Furthermore metric
smoothing is going to be even fun to implement with this ;)

If there is no filter everything works as ever, what we still have to worry
about is the with filter case. I'm not sure if there are any differences
observable from the filter POV.

In particular import/export filters that fiddle with the metric need to be
tested still.

> > +static struct babel_route *
> > +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e)
> > +{
> > +  if (e->selected_nbr)
> > +    return babel_get_route(p, e, e->selected_nbr);
> > +  return NULL;
> > +}
> 
> Why the switch to storing the neighbour and resolving that to a route
> here? Couldn't we just keep the existing e->selected route and populate
> that where e->selected_nbr is currently set?

Just seemed more natural to me at the time, but looking at it now I wonder
if it isn't better for performance to cache this after all, since
get_selected_route does get called in handle_update. I think I'll roll that
back.

> > +  for (struct babel_route *r_next, *r = expire_head; r ; r = r_next) {
> > +    r_next = r->expire_next;
> > +    babel_expire_route(p, r);
> > +  }
> > +
> > +  for (struct babel_entry *e = retract_head; e ; e = e->retract_next) {
> > +    babel_announce_retraction(p, e);
> > +  }
> > +
> > +  for (struct babel_entry *e_next, *e = delete_head; e ; e = e_next) {
> > +    e_next = e->delete_next;
> > +    fib_delete(rtable, e);
> > +  }
> 
> These also be clearing the ->{expire,delete,rectract}_next pointers? I
> know they're not used anywhere else, but they'll be left pointing to
> random memory in some cases here...

Yeah, I had that for all the loops then removed it because it complicated
the expire case as that one might deallocate the route. I'll give it
another go.

> > +static void
> > +babel_announce_rte_unreachable(struct babel_proto *p, struct babel_entry *e, u8 unreachable)
> > +{
> > +  struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : p->ip6_channel;
> > +  rte *rte = NULL;
> > +
> > +  if (unreachable) {
> > +    /* Unreachable */
> > +    rta a0 = {
> > +      .source = RTS_BABEL,
> > +      .scope = SCOPE_UNIVERSE,
> > +      .dest = RTD_UNREACHABLE,
> > +      .pref = 1,
> > +    };
> > +
> > +    rta *a = rta_lookup(&a0);
> > +    rte = rte_get_temp(a, p->p.main_source);
> > +  }
> > +
> > +  e->unreachable = unreachable;
> > +
> > +  /* Unlike the many neighbour routes we only want one unreachable route, for
> > +   * one it's ugly to have lots of unreachable routes (if we just did this right
> > +   * in babel_announce_rte) but we also loose access to the neibour rte_src id
> > +   * when the neighbour is deallocated. This happens on a different schedule
> > +   * than unreachable route removal. */
> > +  rte_update2(c, e->n.addr, rte, p->p.main_source);
> > +}
> 
> OK, so this function took me a while to understand: There's a special
> "unreachable" route we announce whenever we lose a route, and this
> function announces or retract that particular route based on the
> "unreachable" argument. Could you please add a comment to the top of the
> function explaining all this (the comment at the bottom is only a
> partial explanation).
> 
> Also s/loose/lose/

ACK.

> > -    TRACE(D_EVENTS, "Lost feasible route for prefix %N", e->n.addr);
> > -    if (e->valid && (e->selected->router_id == e->router_id))
> > -      babel_add_seqno_request(p, e, e->selected->router_id, e->selected->seqno + 1, 0, NULL);
> 
> With your patch we no longer send seqno requests. That seems bad? :)

Uups, I've been observing routes staying unfeasible for (seemingly) much
longer than with the old code during testing and was worried it might have
something to do with the seqno reqs :)

> >      e = babel_get_entry(p, net->n.addr);
> >  
> > +    while (internal) {
> > +      ifa = babel_find_iface(p, new->attrs->nh.iface);
> > +      if (!ifa) break;
> > +      n = babel_find_neighbor(ifa, new->attrs->from);
> > +      if (!n) break;
> > +      e->selected_nbr = n;
> > +      break;
> > +    }
> > +    if (!internal) {
> > +      e->selected_nbr = NULL;
> > +    }
> 
> Using a loop+break statements as some kind of weird 'goto'? Yuck!
> Maybe just make this a helper function?
> 
> babel_entry_update_selected(p, e, new->attrs->nh.iface, internal ? new->attrs->from : NULL)
> 
> Also, if we can't find the iface or neighbour object (can that ever
> happen) shouldn't we be resetting e->selected_nbr to NULL instead of
> keeping whatever old value is there?

Yeah I'm not happy with this code either. It seems like the NULL cases
should never be able to happen but I'm not convinced, hence this quick
"beauty".

I think you have a point though selected_nbr should be reset when the
lookups fail as nest is still telling us: this is the optimal route now. I
don't like that at all though as it means our view of the routing table is
out of sync with what bird is doing and so we'd probably violate protocol
in any of the cases that need to check if a route is "best".

Is there a way we can easily trigger a protocol restart instead of crashing
all of bird with an assertion here if this does happen against all odds?

--Daniel


More information about the Bird-users mailing list