Hi Toke (and others) I updated Babel to use microsecond timers [1] and during testing i noticed several unrelated issues. 1) Hello expiration was not properly implemented, as i understand RFC, it should reset the timer based on last packet hello interval and time-out repeatedly until hello map is zero. Fixed in [2]. 2) In some setups unicast seqno requests were sent repeatedly and seqno was increased indefinitely due to last case in babel_handle_update(), when the RFC specifies it should be done only if unfeasible route would otherwise become best. Fixed in [3], by this condition handling all unicast seqno requests: if (!feasible && (metric != BABEL_INFINITY) && (!best || (r == best) || (metric < best->metric))) babel_unicast_seqno_request(e, s, nbr); It also changes the behavior by sending seqno request when receiving infeasible update if there is no previous route from that neighbor. 3) Seems like the fix of hello expiration caused that when neighbor disappears, it is no longer removed timely and routes have to wait for route timeout, which is about a minute for 4 s hello interval. It seems that by RFC, we should recompute route metrics whanever rxcost reported by IHU is changed, so IHU expiration should be enough to make all related routes unreachable. Apparently, this is not done in BIRD. 4) Seems like we do not trigger updates when a route update with just increased seqno due to forwarded unicast request is received. That also slows convergence. IMHO we should do in babel_handle_update() something like: if (route_is_selected() && matching_entry_in_seqno_request_cache()) babel_trigger_update() Do you think that my interpretation of these issues and the fixes are correct? [1] https://gitlab.labs.nic.cz/labs/bird/commit/8a76785a6fb23a64fcc93f998336794f... [2] https://gitlab.labs.nic.cz/labs/bird/commit/c7fa4efd701716a9c40620f2e3e8f065... [3] https://gitlab.labs.nic.cz/labs/bird/commit/c54526afeb4187d55993c3ab777fcfe5... -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
Hi Toke (and others)
I updated Babel to use microsecond timers [1] and during testing i noticed several unrelated issues.
Ah, awesome, thanks for doing that! :)
1) Hello expiration was not properly implemented, as i understand RFC, it should reset the timer based on last packet hello interval and time-out repeatedly until hello map is zero. Fixed in [2].
Yeah, I think you are right. However, your fix doesn't apply BABEL_HELLO_EXPIRY_FACTOR to the value stored in last_hello_int, which is probably a bug, right?
2) In some setups unicast seqno requests were sent repeatedly and seqno was increased indefinitely due to last case in babel_handle_update(), when the RFC specifies it should be done only if unfeasible route would otherwise become best. Fixed in [3], by this condition handling all unicast seqno requests:
if (!feasible && (metric != BABEL_INFINITY) && (!best || (r == best) || (metric < best->metric))) babel_unicast_seqno_request(e, s, nbr);
It also changes the behavior by sending seqno request when receiving infeasible update if there is no previous route from that neighbor.
Yup, this seams reasonable as well.
3) Seems like the fix of hello expiration caused that when neighbor disappears, it is no longer removed timely and routes have to wait for route timeout, which is about a minute for 4 s hello interval.
It seems that by RFC, we should recompute route metrics whanever rxcost reported by IHU is changed, so IHU expiration should be enough to make all related routes unreachable. Apparently, this is not done in BIRD.
Hmm, yeah, babel_expire_ihu() should probably do something similar to babel_flush_neighbor(), i.e., loop through all selected routes from that neighbor and re-run the route selection on them?
4) Seems like we do not trigger updates when a route update with just increased seqno due to forwarded unicast request is received. That also slows convergence. IMHO we should do in babel_handle_update() something like:
if (route_is_selected() && matching_entry_in_seqno_request_cache()) babel_trigger_update()
Not sure we should use the request cache for this? We might not have seen the request, but a downstream still needs the updated seqno; so maybe just do something like: diff --git a/proto/babel/babel.c b/proto/babel/babel.c index c5d695c9..06e1d3ad 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -1219,6 +1219,11 @@ babel_handle_update(union babel_msg *m, struct babel_iface *ifa) } else { + /* Trigger an update if we get a seqno increase for a selected route, to + improve convergence time */ + if (r->seqno < msg->seqno && r == best) + babel_trigger_update(p); + /* Last paragraph above - update the entry */ r->advert_metric = msg->metric; r->metric = metric;
Do you think that my interpretation of these issues and the fixes are correct?
Think so, yeah; nice job. And thanks for taking the time to go over this :) -Toke
On Thu, Oct 19, 2017 at 01:23:45PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
1) Hello expiration was not properly implemented, as i understand RFC, it should reset the timer based on last packet hello interval and time-out repeatedly until hello map is zero. Fixed in [2].
Yeah, I think you are right. However, your fix doesn't apply BABEL_HELLO_EXPIRY_FACTOR to the value stored in last_hello_int, which is probably a bug, right?
I think it is correct, see section A.1: If the Hello history is empty (it contains 0 bits only), the neighbour entry is flushed; otherwise, it resets the neighbour's Hello timer to the value advertised in the last Hello received from this neighbour (no extra margin is necessary in this case). The BABEL_HELLO_EXPIRY_FACTOR is the extra margin for jitter.
3) Seems like the fix of hello expiration caused that when neighbor disappears, it is no longer removed timely and routes have to wait for route timeout, which is about a minute for 4 s hello interval.
It seems that by RFC, we should recompute route metrics whanever rxcost reported by IHU is changed, so IHU expiration should be enough to make all related routes unreachable. Apparently, this is not done in BIRD.
Hmm, yeah, babel_expire_ihu() should probably do something similar to babel_flush_neighbor(), i.e., loop through all selected routes from that neighbor and re-run the route selection on them?
I think route selection re-run should be done when IHU changes rxcost (either explicitly, or by expiration). It is explained in section 3.4.2.: After updating a neighbour's txcost, the receiving node recomputes the neighbour's cost (Section 3.4.3) and runs the route selection procedure (Section 3.6). And it seems that BIRD should re-run the route selection even if effective cost is changed by Hello receipt or expiration: After receiving a Hello, or determining that it has missed one, the node recomputes the association's cost (Section 3.4.3) and runs the route selection procedure (Section 3.6).
4) Seems like we do not trigger updates when a route update with just increased seqno due to forwarded unicast request is received. That also slows convergence. IMHO we should do in babel_handle_update() something like:
if (route_is_selected() && matching_entry_in_seqno_request_cache()) babel_trigger_update()
Not sure we should use the request cache for this? We might not have seen the request, but a downstream still needs the updated seqno; so maybe just do something like:
IMHO the fast propagation is necessary just for the originator of the seqno request and there exist at least one path of routers that propagates such request. Others learn updated seqno during regular updates. It is mentioned in section 3.7.2: A node SHOULD NOT send triggered updates for other reasons, such as when there is a minor fluctuation in a route's metric, when the selected next hop changes, or to propagate a new sequence number (except to satisfy a request, as specified in Section 3.8).
Do you think that my interpretation of these issues and the fixes are correct?
Think so, yeah; nice job. And thanks for taking the time to go over this :)
You are welcome. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
On Thu, Oct 19, 2017 at 01:23:45PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
1) Hello expiration was not properly implemented, as i understand RFC, it should reset the timer based on last packet hello interval and time-out repeatedly until hello map is zero. Fixed in [2].
Yeah, I think you are right. However, your fix doesn't apply BABEL_HELLO_EXPIRY_FACTOR to the value stored in last_hello_int, which is probably a bug, right?
I think it is correct, see section A.1:
If the Hello history is empty (it contains 0 bits only), the neighbour entry is flushed; otherwise, it resets the neighbour's Hello timer to the value advertised in the last Hello received from this neighbour (no extra margin is necessary in this case).
The BABEL_HELLO_EXPIRY_FACTOR is the extra margin for jitter.
Ah, right, gotcha.
3) Seems like the fix of hello expiration caused that when neighbor disappears, it is no longer removed timely and routes have to wait for route timeout, which is about a minute for 4 s hello interval.
It seems that by RFC, we should recompute route metrics whanever rxcost reported by IHU is changed, so IHU expiration should be enough to make all related routes unreachable. Apparently, this is not done in BIRD.
Hmm, yeah, babel_expire_ihu() should probably do something similar to babel_flush_neighbor(), i.e., loop through all selected routes from that neighbor and re-run the route selection on them?
I think route selection re-run should be done when IHU changes rxcost (either explicitly, or by expiration). It is explained in section 3.4.2.:
After updating a neighbour's txcost, the receiving node recomputes the neighbour's cost (Section 3.4.3) and runs the route selection procedure (Section 3.6).
And it seems that BIRD should re-run the route selection even if effective cost is changed by Hello receipt or expiration:
After receiving a Hello, or determining that it has missed one, the node recomputes the association's cost (Section 3.4.3) and runs the route selection procedure (Section 3.6).
Ah, right, of course it should be done when the metric changes in either direction; silly me. Well, a new babel_neighbor_update_routes() function is warranted, I guess...
4) Seems like we do not trigger updates when a route update with just increased seqno due to forwarded unicast request is received. That also slows convergence. IMHO we should do in babel_handle_update() something like:
if (route_is_selected() && matching_entry_in_seqno_request_cache()) babel_trigger_update()
Not sure we should use the request cache for this? We might not have seen the request, but a downstream still needs the updated seqno; so maybe just do something like:
IMHO the fast propagation is necessary just for the originator of the seqno request and there exist at least one path of routers that propagates such request. Others learn updated seqno during regular updates. It is mentioned in section 3.7.2:
A node SHOULD NOT send triggered updates for other reasons, such as when there is a minor fluctuation in a route's metric, when the selected next hop changes, or to propagate a new sequence number (except to satisfy a request, as specified in Section 3.8).
My thinking was that there may be a case where the request made it back to the source in a roundabout way and that we could be helping by just flushing the route with the updated seqno as soon as possible. But OK, if the RFC says not to do that, better check the request cache first ;) -Toke
participants (2)
-
Ondrej Zajicek -
Toke Høiland-Jørgensen