Stefan Haller <stefan.haller@stha.de> writes:
Hi Toke,
On Fri, Apr 22, 2022 at 01:48:46AM +0200, Toke Høiland-Jørgensen wrote:
I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing.
Nice work! I replaced the bird binary and changed the interface type to "tunnel" on a mesh of four hosts. Works great so far!
Things I noticed:
Thanks for testing! This is just the kind of feedback I was looking for!
(1) When I forgot to change the config file (one side was type tunnel, on side was type wired), the Babel neighbor metric was stuck on 65535. I think this happens because the expected time stamp was not received and then the metric computation does not work. While I understand that such a "broken" setup is not really supported, it was not exactly clear where to locate the problem.
No, this should definitely be supported! The reason it breaks is simply that I misremembered the semantics of the subtlv parsing code, so it ignored the whole IHU TLV instead of just the timestamp. Will fix!
(2) Also I think it would be neat if "birdc show babel neigh" would show latency info (current latency + smoothed value).
Yup, definitely, will add!
(3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
Which values become very large? And is this persistent? What kind of route flapping were you seeing?
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 4294966323 us (srtt 99189.162 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 99189.162 ms
Nothing changed after >1h. The opposite side was reporting sensible RTT numbers. After I restarted the daemon, the smoothed value was still off for this one link:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1241 us (srtt 69656.646 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 69656.646 ms
The srtt value did not converge after >1h. For all other links the smoothing works, e.g. for wg1 on the very same host:
bird: babel1: RTT sample for neighbour fe80::1 on wg0: 14570 us (srtt 15.876 ms) bird: babel1: Added RTT cost 5 to nbr fe80::1 on wg0 with srtt 15.876 ms
After restarting bird once more (without changing anything) it works since then:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1261 us (srtt 1.313 ms)
In this setup wg2 was a tunnel over the local LAN, so latency was often < 1000 us. Maybe there is a problem for tiny latencies and/or larger values of "metric decay"? I did not find a way to reliably reproduce the problem.
Hmm, so that initial value is '-973' as a 32-bit unsigned int. So looks like there's an overflow in the RTT calculation. I'll add some sanity checks to make sure this doesn't happen (as indeed babeld has as well). I force-pushed an updated version to the same branch which fixes the issues you noted above apart from the route flap one. Inspired by the first issue noted above, I also added a separate config parameter to control the *sending* of timestamps, which defaults to on. -Toke