Hi everyone I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing. The code currently lives here: https://github.com/tohojo/bird/tree/babel-rtt-01 To compile it, do: git clone -b babel-rtt-01 https://github.com/tohojo/bird cd bird && autoreconf && ./configure && make To run it, create a config file enabling the RTT extension; the simplest way to do this is to set the interface type to 'tunnel'. A sample minimal config is included below (just change the interface name from "veth0"). Save this and run Bird in the foreground in debug mode like: ./bird -c sample.conf -d Any feedback (successes or failures) greatly appreciated! -Toke debug protocols all; router id 1.2.3.4; ipv4 table master4; ipv6 table master6; protocol device { scan time 10; } protocol kernel kernel4 { learn; ipv4 { export all; import all; }; } protocol kernel kernel6 { learn; ipv6 { import all; import all; }; } protocol babel { ipv4 { import all; export all; }; ipv6 { import all; export all; }; interface "veth0" { type tunnel; }; }
Hi Toke, On Fri, Apr 22, 2022 at 01:48:46AM +0200, Toke Høiland-Jørgensen wrote:
I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing.
Nice work! I replaced the bird binary and changed the interface type to "tunnel" on a mesh of four hosts. Works great so far! Things I noticed: (1) When I forgot to change the config file (one side was type tunnel, on side was type wired), the Babel neighbor metric was stuck on 65535. I think this happens because the expected time stamp was not received and then the metric computation does not work. While I understand that such a "broken" setup is not really supported, it was not exactly clear where to locate the problem. (2) Also I think it would be neat if "birdc show babel neigh" would show latency info (current latency + smoothed value). (3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 4294966323 us (srtt 99189.162 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 99189.162 ms
Nothing changed after >1h. The opposite side was reporting sensible RTT numbers. After I restarted the daemon, the smoothed value was still off for this one link:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1241 us (srtt 69656.646 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 69656.646 ms
The srtt value did not converge after >1h. For all other links the smoothing works, e.g. for wg1 on the very same host:
bird: babel1: RTT sample for neighbour fe80::1 on wg0: 14570 us (srtt 15.876 ms) bird: babel1: Added RTT cost 5 to nbr fe80::1 on wg0 with srtt 15.876 ms
After restarting bird once more (without changing anything) it works since then:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1261 us (srtt 1.313 ms)
In this setup wg2 was a tunnel over the local LAN, so latency was often < 1000 us. Maybe there is a problem for tiny latencies and/or larger values of "metric decay"? I did not find a way to reliably reproduce the problem. Best regards, Stefan Haller
Stefan Haller <stefan.haller@stha.de> writes:
Hi Toke,
On Fri, Apr 22, 2022 at 01:48:46AM +0200, Toke Høiland-Jørgensen wrote:
I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing.
Nice work! I replaced the bird binary and changed the interface type to "tunnel" on a mesh of four hosts. Works great so far!
Things I noticed:
Thanks for testing! This is just the kind of feedback I was looking for!
(1) When I forgot to change the config file (one side was type tunnel, on side was type wired), the Babel neighbor metric was stuck on 65535. I think this happens because the expected time stamp was not received and then the metric computation does not work. While I understand that such a "broken" setup is not really supported, it was not exactly clear where to locate the problem.
No, this should definitely be supported! The reason it breaks is simply that I misremembered the semantics of the subtlv parsing code, so it ignored the whole IHU TLV instead of just the timestamp. Will fix!
(2) Also I think it would be neat if "birdc show babel neigh" would show latency info (current latency + smoothed value).
Yup, definitely, will add!
(3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
Which values become very large? And is this persistent? What kind of route flapping were you seeing?
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 4294966323 us (srtt 99189.162 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 99189.162 ms
Nothing changed after >1h. The opposite side was reporting sensible RTT numbers. After I restarted the daemon, the smoothed value was still off for this one link:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1241 us (srtt 69656.646 ms) bird: babel1: Added RTT cost 96 to nbr fe80::3 on wg2 with srtt 69656.646 ms
The srtt value did not converge after >1h. For all other links the smoothing works, e.g. for wg1 on the very same host:
bird: babel1: RTT sample for neighbour fe80::1 on wg0: 14570 us (srtt 15.876 ms) bird: babel1: Added RTT cost 5 to nbr fe80::1 on wg0 with srtt 15.876 ms
After restarting bird once more (without changing anything) it works since then:
bird: babel1: RTT sample for neighbour fe80::3 on wg2: 1261 us (srtt 1.313 ms)
In this setup wg2 was a tunnel over the local LAN, so latency was often < 1000 us. Maybe there is a problem for tiny latencies and/or larger values of "metric decay"? I did not find a way to reliably reproduce the problem.
Hmm, so that initial value is '-973' as a 32-bit unsigned int. So looks like there's an overflow in the RTT calculation. I'll add some sanity checks to make sure this doesn't happen (as indeed babeld has as well). I force-pushed an updated version to the same branch which fixes the issues you noted above apart from the route flap one. Inspired by the first issue noted above, I also added a separate config parameter to control the *sending* of timestamps, which defaults to on. -Toke
On Sat, Apr 23, 2022 at 11:20:57PM +0200, Toke Høiland-Jørgensen wrote:
(3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
Which values become very large? And is this persistent? What kind of route flapping were you seeing?
The routes were "flapping" simply because the RTT were very similar and sometimes one next hop wins, sometimes the other. I don't think this is a problem, but expected behaviour. I was just trying out the "metric decay" setting to check if routing tables do change less often. Seems to work.
Hmm, so that initial value is '-973' as a 32-bit unsigned int. So looks like there's an overflow in the RTT calculation. I'll add some sanity checks to make sure this doesn't happen (as indeed babeld has as well).
I did not observe this problem with the new version anymore.
I force-pushed an updated version to the same branch which fixes the issues you noted above apart from the route flap one. Inspired by the first issue noted above, I also added a separate config parameter to control the *sending* of timestamps, which defaults to on.
I am running the new version since Tuesday and everything looks good so far. Best regards, Stefan Haller
Stefan Haller <stefan.haller@stha.de> writes:
On Sat, Apr 23, 2022 at 11:20:57PM +0200, Toke Høiland-Jørgensen wrote:
(3) Due to route flapping I tried to increase "metric decay" to 60s. After running "birdc configure" the values became very large for one link (on one side only).
Which values become very large? And is this persistent? What kind of route flapping were you seeing?
The routes were "flapping" simply because the RTT were very similar and sometimes one next hop wins, sometimes the other. I don't think this is a problem, but expected behaviour. I was just trying out the "metric decay" setting to check if routing tables do change less often. Seems to work.
Ah, right. When you said 'flapping' I took that to mean 'at very short timescales', but if the damping is working that's great!
Hmm, so that initial value is '-973' as a 32-bit unsigned int. So looks like there's an overflow in the RTT calculation. I'll add some sanity checks to make sure this doesn't happen (as indeed babeld has as well).
I did not observe this problem with the new version anymore.
I force-pushed an updated version to the same branch which fixes the issues you noted above apart from the route flap one. Inspired by the first issue noted above, I also added a separate config parameter to control the *sending* of timestamps, which defaults to on.
I am running the new version since Tuesday and everything looks good so far.
Excellent! Thanks again for testing - let me know if you run into any other issues :) -Toke
On Thu, Apr 28, 2022 at 03:20:24PM +0200, Toke Høiland-Jørgensen wrote:
Excellent! Thanks again for testing - let me know if you run into any other issues :)
While running your `babel-rtt-01` branch (commit bb858a8673c5a3c) I received a SIGSEGV after ~2 weeks uptime (FreeBSD arm64). Here is the backtrace:
# lldb --core bird.core bird (lldb) target create "bird" --core "bird.core" Core file '/root/bird/bird.core' (aarch64) was loaded. (lldb) bt all * thread #1, name = 'bird', stop reason = signal SIGSEGV * frame #0: 0x00000000002b2860 bird`babel_add_seqno_request [inlined] rem_node(n=0x0000000040322030) at lists.c:149:11 frame #1: 0x00000000002b2860 bird`babel_add_seqno_request(p=0x0000000040c282a0, e=0x000000004031b1a0, router_id=4919093121465068033, seqno=39, hop_count='\0', nbr=0x0000000040c9b020) at babel.c:414:7 frame #2: 0x00000000002b2470 bird`babel_handle_update(m=0x0000000040324d50, ifa=<unavailable>) at babel.c:1454:5 frame #3: 0x00000000002b6b84 bird`babel_rx_hook [inlined] babel_process_packet(ifa=0x0000000040c207a0, pkt=<unavailable>, len=<unavailable>, saddr=<unavailable>, sport=<unavailable>, daddr=<unavailable>, dport=6696) at packets.c:1644:7 frame #4: 0x00000000002b696c bird`babel_rx_hook(sk=<unavailable>, len=<unavailable>) at packets.c:1703:3 frame #5: 0x00000000002e97f0 bird`sk_read(s=0x0000000040c208c0, revents=<unavailable>) at io.c:1914:7 frame #6: 0x00000000002eab7c bird`io_loop at io.c:2349:5 frame #7: 0x00000000002edf80 bird`main(argc=<unavailable>, argv=<unavailable>) at main.c:940:3 frame #8: 0x000000000026ceb4 bird`__start(argc=4, argv=0x0000ffffffffe9f8, env=0x0000ffffffffea20, cleanup=<unavailable>) at crt1_c.c:70:7 frame #9: 0x0000000040326018 ld-elf.so.1`___lldb_unnamed_symbol27 + 24
From a first glance it seems unlikely that the RTT changes are the root cause. Don't if it helps, but I could make binary and core dump available.
Best regards, Stefan Haller
Stefan Haller <stefan.haller@stha.de> writes:
On Thu, Apr 28, 2022 at 03:20:24PM +0200, Toke Høiland-Jørgensen wrote:
Excellent! Thanks again for testing - let me know if you run into any other issues :)
While running your `babel-rtt-01` branch (commit bb858a8673c5a3c) I received a SIGSEGV after ~2 weeks uptime (FreeBSD arm64). Here is the backtrace:
# lldb --core bird.core bird (lldb) target create "bird" --core "bird.core" Core file '/root/bird/bird.core' (aarch64) was loaded. (lldb) bt all * thread #1, name = 'bird', stop reason = signal SIGSEGV * frame #0: 0x00000000002b2860 bird`babel_add_seqno_request [inlined] rem_node(n=0x0000000040322030) at lists.c:149:11 frame #1: 0x00000000002b2860 bird`babel_add_seqno_request(p=0x0000000040c282a0, e=0x000000004031b1a0, router_id=4919093121465068033, seqno=39, hop_count='\0', nbr=0x0000000040c9b020) at babel.c:414:7 frame #2: 0x00000000002b2470 bird`babel_handle_update(m=0x0000000040324d50, ifa=<unavailable>) at babel.c:1454:5 frame #3: 0x00000000002b6b84 bird`babel_rx_hook [inlined] babel_process_packet(ifa=0x0000000040c207a0, pkt=<unavailable>, len=<unavailable>, saddr=<unavailable>, sport=<unavailable>, daddr=<unavailable>, dport=6696) at packets.c:1644:7 frame #4: 0x00000000002b696c bird`babel_rx_hook(sk=<unavailable>, len=<unavailable>) at packets.c:1703:3 frame #5: 0x00000000002e97f0 bird`sk_read(s=0x0000000040c208c0, revents=<unavailable>) at io.c:1914:7 frame #6: 0x00000000002eab7c bird`io_loop at io.c:2349:5 frame #7: 0x00000000002edf80 bird`main(argc=<unavailable>, argv=<unavailable>) at main.c:940:3 frame #8: 0x000000000026ceb4 bird`__start(argc=4, argv=0x0000ffffffffe9f8, env=0x0000ffffffffea20, cleanup=<unavailable>) at crt1_c.c:70:7 frame #9: 0x0000000040326018 ld-elf.so.1`___lldb_unnamed_symbol27 + 24
From a first glance it seems unlikely that the RTT changes are the root cause. Don't if it helps, but I could make binary and core dump available.
Not directly, but you may be hitting a condition that's rare without it; I think I see the problem, will send a patch. Thank you for the report! -Toke
Toke Høiland-Jørgensen <toke@toke.dk> writes:
Stefan Haller <stefan.haller@stha.de> writes:
On Thu, Apr 28, 2022 at 03:20:24PM +0200, Toke Høiland-Jørgensen wrote:
Excellent! Thanks again for testing - let me know if you run into any other issues :)
While running your `babel-rtt-01` branch (commit bb858a8673c5a3c) I received a SIGSEGV after ~2 weeks uptime (FreeBSD arm64). Here is the backtrace:
# lldb --core bird.core bird (lldb) target create "bird" --core "bird.core" Core file '/root/bird/bird.core' (aarch64) was loaded. (lldb) bt all * thread #1, name = 'bird', stop reason = signal SIGSEGV * frame #0: 0x00000000002b2860 bird`babel_add_seqno_request [inlined] rem_node(n=0x0000000040322030) at lists.c:149:11 frame #1: 0x00000000002b2860 bird`babel_add_seqno_request(p=0x0000000040c282a0, e=0x000000004031b1a0, router_id=4919093121465068033, seqno=39, hop_count='\0', nbr=0x0000000040c9b020) at babel.c:414:7 frame #2: 0x00000000002b2470 bird`babel_handle_update(m=0x0000000040324d50, ifa=<unavailable>) at babel.c:1454:5 frame #3: 0x00000000002b6b84 bird`babel_rx_hook [inlined] babel_process_packet(ifa=0x0000000040c207a0, pkt=<unavailable>, len=<unavailable>, saddr=<unavailable>, sport=<unavailable>, daddr=<unavailable>, dport=6696) at packets.c:1644:7 frame #4: 0x00000000002b696c bird`babel_rx_hook(sk=<unavailable>, len=<unavailable>) at packets.c:1703:3 frame #5: 0x00000000002e97f0 bird`sk_read(s=0x0000000040c208c0, revents=<unavailable>) at io.c:1914:7 frame #6: 0x00000000002eab7c bird`io_loop at io.c:2349:5 frame #7: 0x00000000002edf80 bird`main(argc=<unavailable>, argv=<unavailable>) at main.c:940:3 frame #8: 0x000000000026ceb4 bird`__start(argc=4, argv=0x0000ffffffffe9f8, env=0x0000ffffffffea20, cleanup=<unavailable>) at crt1_c.c:70:7 frame #9: 0x0000000040326018 ld-elf.so.1`___lldb_unnamed_symbol27 + 24
From a first glance it seems unlikely that the RTT changes are the root cause. Don't if it helps, but I could make binary and core dump available.
Not directly, but you may be hitting a condition that's rare without it; I think I see the problem, will send a patch. Thank you for the report!
Alright, sent a patch; also force-pushed the babel-rtt-01 branch to incorporate it, so you can pull and reset if you want to incorporate it in your system :) -Toke
On Fri, Apr 22, 2022 at 01:48:46AM +0200, Toke Høiland-Jørgensen wrote:
Hi everyone
I've implemented the Babel RTT extension specified in draft-ietf-babel-rtt-extension in Bird. I've tested that it talks to babeld on a single link and that the two implementations agree on each others' (smoothed) RTT values. However, I'd like to subject the code to some more tortured testing before submitting it to upstream Bird. So I'm sending this note as a request for testing.
Hi That seems like an interesting idea, especially for things like automatically switching between multiple Wireguard tunnel concentrators. Did not yet checked the code how smoothing is done here, but seems to me that considering: 1) There is baseline RTT from distance / speed of propagation 2) There is one-side noise from congestion 3) The metric should be based on 1) and suppress effects of 2) It would make sense to use something like running minimum instead of running average.
The code currently lives here:
https://github.com/tohojo/bird/tree/babel-rtt-01
To compile it, do:
git clone -b babel-rtt-01 https://github.com/tohojo/bird cd bird && autoreconf && ./configure && make
To run it, create a config file enabling the RTT extension; the simplest way to do this is to set the interface type to 'tunnel'.
There is one thing that is IMHO a bit strange, type wired/wireless/tunnel option is just an indirect way of set k-from-j / ETX / RTT-based cost algorithm. But RTT-based cost have applications that are unrelated to tunnels. Seems to me that it would make sense to have a direct option to set the cost algorithm (and just that), while the 'type' option would provide reasonable default for that (and possibly other options). -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (3)
-
Ondrej Zajicek -
Stefan Haller -
Toke Høiland-Jørgensen