Baptiste Jonglez <baptiste@bitsofnetworks.org> writes:
On Sat, Apr 30, 2016 at 01:20:14AM +0200, Toke Høiland-Jørgensen wrote:
On 29 April 2016 20:24:43 CEST, Ondrej Zajicek <santiago@crfreenet.org> wrote:
I would like to thank Toke Høiland-Jørgensen for the Babel protocol implementation, which was finally merged despite my tardy code reviews.
Yay, awesome! You're very welcome, and congratulations on the release! :)
Congrats, Toke!
I have just tested your implementation (from a pure user perspective, I haven't looked at the code), and it looks great!
Thanks, and thanks for taking it for a spin! Some comments below.
Thanks to you, I will have to switch my test overlay network to IPv6-only...
Sorry about that ;)
=== Configuration ===
When the quotes are missing around the interface name, the error message is misleading. This configuration:
protocol babel { interface eth0; }
gives:
/etc/bird6.conf, line 240: IP address expected
which is very strange, because an IP address does not make sense here (and something like "interface 2001:db8::42" obviously gives an error).
Not sure why this is. The inner workings of the configuration parsing still has some a good portion of black magic...
=== Filtering ===
When updating a filter that removes some routes (e.g. switching from "export all" to "export none"), the routes stay in the Babel table, even though they are no longer exported to the protocol:
bird> show babel entries babel1 babel1: Prefix Router ID Metric Seqno Expires Sources 2001:db8:42::/64 <pending> 1 bird> show route export babel1 bird>
These are routes coming from the box itself (not from peers)? They should expire after BABEL_HOLD_TIME (10 seconds)...
=== Hello interval range checking ===
The hello interval cannot be set below 1 second: the parser seems to expect an integer. The packet format encodes intervals in centiseconds, so it would make sense to allow any fractional Hello interval down to 0.01 seconds.
I just checked, babeld's parser allows to go as low as 0.001s (which is a bit scary in itself, since it generates about 1 kpps / 1.5 Mbps of control traffic).
Hmm, yes. However, since the internal Bird timers run at a granularity of seconds only there's not much point in having the ability to configure smaller values.
On a related note, there is no overflow check on the hello interval. For instance, setting the hello interval to 800 seconds produces this in tcpdump:
Hello seqno 1 interval 144.64s
Right, yeah, the overflow check is done on the specified values not their centisecond equivalents. Oops.
=== Interaction between reconfiguration and timers ===
Changing the hello interval and telling bird to reconfigure itself leads to a somewhat inconsistent behaviour:
- when decreasing the hello interval, the old interval is still used for the next Hello message, but the new interval is immediately taken into account for IHU and Update messages. I don't think this is a real interoperability issue, but it's just strange to see a lot of Updates and IHU without any Hello (when decreasing the hello interval substantially, for instance from one minute to 3 seconds)
Increasing the hello interval works as expected (the old interval is still used for the next Hello, IHU and Update messages). This looks correct and consistent with the RFC, section 3.4.1: the interval change is actually taken into account when sending the next Hello message, because otherwise our neighbours could think we are dead.
Hmm, is this consistent? Reconfiguration simply replaces the struct with the values in the internal data structures, so the only reason I can see why you would get that behaviour is because there's a TLV that happens to be queued at the time you reconfigure. You can verify this by looking at the debug output when running with TRACE level debugging enabled; that outputs messages when the TLVs are generated, not when they are sent out. I do seem to have forgotten to generate a new hello when reconfiguring (RFC section 3.4.1: "Equivalently, a node SHOULD send an unscheduled Hello immediately after increasing its Hello interval."). I do believe just adding that should resolve the issues with inconsistent behaviour?
=== TLV parsing error ===
When a neighbouring babeld starts or stop, Bird complains about an error parsing a TLV.
When starting babeld:
avril 30 12:23:38 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 18 - parse error avril 30 12:23:38 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 18 - parse error
When stopping babeld:
avril 30 12:23:59 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 4 - parse error avril 30 12:23:59 lud bird6[17307]: babel2: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 4 - parse error
Packet captures are attached (babeld-start-parse-error.pcap and babeld-stop-parse-error.pcap). The local node running Bird is fe80::c07d:62ff:fea6:b8f4, all other speakers are babeld 1.5.0.
Type 8 is the Update TLV, I think Bird complains about this particular message (here seen by tcpdump):
Update/id ::/0 metric 65535 seqno 21902 interval infinity
I remember running into this. What happens here is that babeld sends an update without a preceding router_id TLV, with a wildcard address, but flag 0x40 set (meaning "infer the router ID from the address"). While I'm not sure what the purpose of this is (a null update with a null router ID with infinity metric and interval?) it *is* technically in spec. I think the reason why Bird complains is that Ondrej's cleanup of my (admittedly messy) packet parsing code inadvertently moved the check for the 0x40 flag inside the case branch for AE_IP6. If you turn on debugging you should get a "No router ID seen before update" message which will confirm that this is indeed the issue. Most of these issues are fairly trivial fixes. I'll produce a patch once I'm done grokking Ondrej's code cleanup changes :) -Toke