[PATCH 0/3] babel: Add support for the RTT extension

Fri Mar 3 01:08:51 CET 2023

Hi Maria,
Hi Toke,

On Tue, Feb 28, 2023 at 02:07:06PM +0100, Maria Matejka via Bird-users wrote:
> > > I think it's probably simpler to just re-announce any route that's still
> > > converging every time we go through the routing table.
> > 
> > Simpler, yes, but I do want to be able to maintain internet scale routing
> > tables through babel eventually so slashing every little bit helps :)
> 
> In version 2, update of non-best route is propagated only to some protocols
> like pipes, add-path BGPs and alike.

Ok, that's good for either approach.

> In version 3, this is even more smoothed as all updates of one prefix are
> exported asynchronously to each protocol, being notified after your Babel
> ends the task (event, socket, timer), dampening best route oscillation or
> other flaps.

I don't quite understand why this would damp oscillations? Do you mean
there's explicit route flap damping support in v3 or just that this is a
side-effect of the new async world? I'd like to know more about either :)

If we do have actual BGP style damping in nest in v3 I'm not sure there's
much point in doing essentially the same thing in our proto. At the very
least that would be a good reason to keep the babel specific daming easy to
remove if it's about to become superceeded by direct nest support anyway.

> This way, I'm not so scared about Babel periodically updating many routes.
> BIRD has to withstand it.

I still think doing uneccessary work/computation is just dumb if we can
avoid it :)

On Tue, Feb 28, 2023 at 04:45:35PM +0100, Toke Høiland-Jørgensen wrote:
> Right, sure, that's a nice property, but I'm not actually sure how what
> you sketched out above accomplishes that?

Don't worry I'll send an RFC patch soon if I can make it work out, just got
tied up by some (mild) covid.

> > Simpler, yes, but I do want to be able to maintain internet scale routing
> > tables through babel eventually so slashing every little bit helps :)
> 
> Heh, yeah, I would like to eventually be able to do that as well, but I
> think there are other optimisations we need to do first. For instance,
> walking the entire routing table every second is not going to work in
> the first place in this case :)

True, but might as well throw myself at this RTT stuff while I have the
time and energy. Large scale route table performance testing will have to
wait for another day since there's not much point making it performant if
the features I want/need aren't supported (and performant! haha).

> >> Bear in mind that the currently selected route can also be converging, so
> >> predicting when two routes "cross" gets complicated quickly. Simpler to
> >> just do a periodic update and redo the comparison every time this update
> >> happens.
> >
> > I feel like that's an artifact specific to the "metric smoothing" approach
> > to route dampening not a feature though. The way I see it the behaviour we
> > really want is to delay any change in selected route for a time related to
> > the metric difference.
> >
> > Think back to what the purpose of the metric smoothing is in the first
> > place: to limit oscillations of the selected route, which this will do just
> > as well.
> 
> I'm OK with finding another solution, but I think you're going to have
> to explain in more detail how what you propose actually represents such
> a solution, then :)

Will do, I've been looking throug some network stability under dynamic
routing literature to see if there's any well founded science we can apply
here. Haven't really found anything good yet.

The RTT paper does admit that "we lack an in-depth theoretical
understanding of the performance of our algorithm, in particular of its
stability." ;)

There is one thing I'm unsure about: does the delay before propagating a
route change to the kernel FIB actually have to depend on the metric
difference to provide the network stability properties we're looking for?

I think just strictly for stability a fixed delay should be fine, despite
not being optimal in terms of convergence time.

> > I don't agree with that. It's not as if I want per-hop information. Just a
> > sum of RTTs along the path and a sum of administrative metric along the
> > path rather than have those jumbled together into one number.
> >
> > Since babel is quite flexible in the actual metric math that would allow
> > interesting ways of weighing each metric component rather than just having
> > everything be linear.
> 
> It also introduces dependencies, though. I.e., with the current approach
> you can have a subset of the routers speak the RTT extension, and other
> parts of the network will just have that incorporated into the metric.
> Whereas if it is carried as a separate metric your entire network has to
> know about the extension for it to be useful.

I don't see why you couldn't do both. Incorporate the rtt (or other
measures) into the metric for oblivious nodes and expose optional TLVs for
ones that care about the different components.

> > For debugging this would be useful as you can see that this path in front
> > of you actually has a crazy RTT rather than someone just having fiddled
> > with their rxcost.
> 
> Meh, not convinced that the routing protocol is the right place to get
> such debugging information. I'd rather just monitor the actual traffic :)

I just put myself into the mind frame of "what if babel where used on the
internet instead of eBGP" and how that means you'd have to convince lazy
admins to run some weird additional software on their nodes or black box
vendors not cooperating because they want to sell everyone their full
network observability platform instead.

Seems preferable to just have some more debuggability right in the protocol
instead, no?

If you're getting at the fact that you'd just do some passive TCP header
sniffing do consider what happens when QUIC is widely deployed and that
gets a whole lot harder :P

> > yikes. Don't want to go down that road, got enough of these lookups in
> > rt_notify already :)
> 
> Right, but then we do need to put the smoothed metric into an attribute
> if it's to be used in the comparison. But maybe you can explain how
> that's not really need cf the above.

Right and my first attempt was doing that, before I came up with the new
approach.

--Daniel