Hi Toke, On Fri, Apr 22, 2022 at 01:48:46AM +0200, Toke Høiland-Jørgensen wrote:
I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing.
Nice work! I replaced the bird binary and changed the interface type to "tunnel" on a mesh of four hosts. Works great so far! Things I noticed: (1) When I forgot to change the config file (one side was type tunnel, on side was type wired), the Babel neighbor metric was stuck on 65535. I think this happens because the expected time stamp was not received and then the metric computation does not work. While I understand that such a "broken" setup is not really supported, it was not exactly clear where to locate the problem. (2) Also I think it would be neat if "birdc show babel neigh" would show latency info (current latency + smoothed value). (3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 4294966323 us (srtt 99189.162 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 99189.162 ms
Nothing changed after >1h. The opposite side was reporting sensible RTT numbers. After I restarted the daemon, the smoothed value was still off for this one link:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1241 us (srtt 69656.646 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 69656.646 ms
The srtt value did not converge after >1h. For all other links the smoothing works, e.g. for wg1 on the very same host:
bird: babel1: RTT sample for neighbour fe80::1 on wg0: 14570 us (srtt 15.876 ms) bird: babel1: Added RTT cost 5 to nbr fe80::1 on wg0 with srtt 15.876 ms
After restarting bird once more (without changing anything) it works since then:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1261 us (srtt 1.313 ms)
In this setup wg2 was a tunnel over the local LAN, so latency was often < 1000 us. Maybe there is a problem for tiny latencies and/or larger values of "metric decay"? I did not find a way to reliably reproduce the problem. Best regards, Stefan Haller