Hi During testing of Babel we noticed a strange behavior and we are not sure if it is a problem in the implementation or in the specification. The sequence of events is like: 1) Router A is running, originated some routes with seqno old_seqno and propagated them to a router B. 2) If router A (or Babel protocol) is restarted, the seqno counter used for originating routes is also reset to 1. 3) Then routes are originated with seqno 1 and propagated to router B. 4) Router B already has these routes with old high seqno, so the update is considered unfeasible and for selected entries, so it is ignored (3.5.4). 5) The routes in the router B's table slowly expire. 6) When routes expire, router B sends seqno request (3.8.2.1) with old_seqno+1. 7) Router A receives seqno request, increases seqno to 2 (3.8.1.2), originates routes with seqno 2 and sends them to router B. 8) Router B still ignores them as 2 < old_seqno. 9) Subsequent updates are still ignored as router B holds old timeouted entry as a selected entry with unreachable metric (as there is no other one) until GC time. I have two questions w.r.t. this sequence of events: 1) How is router restart and seqnos supposed to be handled without waiting for route timeout? 2) If a route is selected, then becomes unreachable/retracted, and there is no other route to be selected, is it still considered selected? I would say that no as the selection process (3.6) forbids retracted routes to be selected, but the BIRD implementation keeps the old selected route (now unreachable) in this case. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
I have two questions w.r.t. this sequence of events:
1) How is router restart and seqnos supposed to be handled without waiting for route timeout?
This is indeed a problem. Babeld sends a wildcard retraction before shutting down (which I just taught Bird to handle; will send off the patch tonight after testing it properly). I can add in sending such retractions when an interface goes away as well. But this problem can still occur on a crash, or if the retraction was lost for whatever reason. In that case, waiting for things to expire is the only option.
2) If a route is selected, then becomes unreachable/retracted, and there is no other route to be selected, is it still considered selected? I would say that no as the selection process (3.6) forbids retracted routes to be selected, but the BIRD implementation keeps the old selected route (now unreachable) in this case.
It is kept for a while (and installed as unreachable) to avoid transient routing loops. This is described in section 2.8 of the RFC. -Toke
On Mon, May 02, 2016 at 03:59:35PM +0200, Toke Høiland-Jørgensen wrote:
Ondrej Zajicek <santiago@crfreenet.org> writes:
I have two questions w.r.t. this sequence of events:
1) How is router restart and seqnos supposed to be handled without waiting for route timeout?
This is indeed a problem. Babeld sends a wildcard retraction before shutting down (which I just taught Bird to handle; will send off the patch tonight after testing it properly). I can add in sending such retractions when an interface goes away as well.
But this problem can still occur on a crash, or if the retraction was lost for whatever reason. In that case, waiting for things to expire is the only option.
Perhaps sending wildcard retraction not only in the last packet, but also in the first one?
2) If a route is selected, then becomes unreachable/retracted, and there is no other route to be selected, is it still considered selected? I would say that no as the selection process (3.6) forbids retracted routes to be selected, but the BIRD implementation keeps the old selected route (now unreachable) in this case.
It is kept for a while (and installed as unreachable) to avoid transient routing loops. This is described in section 2.8 of the RFC.
Well, section 2.8 (and in more detail section 3.5.5) specifies that we should keep unreachable entries, but IMHO it does not specify that the old route is considered selected/installed for a purpose of conditions in section 3.5.4. The unreachable entry after retraction could be undestood as a special case, unrelated to any route. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Ondrej Zajicek <santiago@crfreenet.org> writes:
Perhaps sending wildcard retraction not only in the last packet, but also in the first one?
Don't see why not (and I think babeld does the same, actually). Currently we are doing this when an iface comes up: babel_send_hello(ifa, 0); babel_send_wildcard_request(ifa); babel_send_update(ifa, 0); /* Full update */ Guess sticking a wildcard retraction in after the hello wouldn't hurt.
2) If a route is selected, then becomes unreachable/retracted, and there is no other route to be selected, is it still considered selected? I would say that no as the selection process (3.6) forbids retracted routes to be selected, but the BIRD implementation keeps the old selected route (now unreachable) in this case.
It is kept for a while (and installed as unreachable) to avoid transient routing loops. This is described in section 2.8 of the RFC.
Well, section 2.8 (and in more detail section 3.5.5) specifies that we should keep unreachable entries, but IMHO it does not specify that the old route is considered selected/installed for a purpose of conditions in section 3.5.4. The unreachable entry after retraction could be undestood as a special case, unrelated to any route.
Hmm, you mean babel_select_route() should clear selected_in after announcing the unreachable route to the core? Yes, that would appear to be needed for the retraction to have any effect if the same router comes back up before the entries are garbage collected... -Toke
Well, section 2.8 (and in more detail section 3.5.5) specifies that we should keep unreachable entries, but IMHO it does not specify that the old route is considered selected/installed for a purpose of conditions in section 3.5.4. The unreachable entry after retraction could be undestood as a special case, unrelated to any route.
Why does it matter? Isn't the behaviour the same with both interpretations? -- Juliusz
On Thu, May 05, 2016 at 03:04:21PM +0200, Juliusz Chroboczek wrote:
Well, section 2.8 (and in more detail section 3.5.5) specifies that we should keep unreachable entries, but IMHO it does not specify that the old route is considered selected/installed for a purpose of conditions in section 3.5.4. The unreachable entry after retraction could be undestood as a special case, unrelated to any route.
Why does it matter? Isn't the behaviour the same with both interpretations?
You are right, in this case the difference does not matter. (I was confused by mixing the route timeout with the source timeout). BTW, why Babel accepts unfeasible updates of non-selected routes? It will not cause problems as such route cannot be selected later (due to its unfeasibility) but it seems strange. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
BTW, why Babel accepts unfeasible updates of non-selected routes? It will not cause problems as such route cannot be selected later (due to its unfeasibility) but it seems strange.
Yeah, very good question. It's counterintuitive for me too, but it turns out to work better that way: 1. Having an unfeasible route available makes it possible to use it for fallback after a single seqno increase. If the route were not in your routing table at all, you'd need to acquire it after your selected route disappears, which may take some time, and will require even more time for things like link quality and hysteresis to converge. 2. Having an unfeasible route available makes it available for sending unicast requests (Section 3.8.1.2, fourth paragraph). 3. If the metric is non-isotonic, the best route might actually be unfeasible (in which case you'll need to send a request to select it, but everything will work out in the end). Yeah, non-isotonic metrics are strange, but they do exist in nature. -- Juliusz
On Wed, May 11, 2016 at 03:48:19PM +0200, Juliusz Chroboczek wrote:
BTW, why Babel accepts unfeasible updates of non-selected routes? It will not cause problems as such route cannot be selected later (due to its unfeasibility) but it seems strange.
Yeah, very good question. It's counterintuitive for me too, but it turns out to work better that way:
1. Having an unfeasible route available makes it possible to use it for fallback after a single seqno increase. If the route were not in your routing table at all, you'd need to acquire it after your selected route disappears, which may take some time, and will require even more time for things like link quality and hysteresis to converge. ...
Thanks, that makes sense. But now i wonder why not to accept unfeasible updates of selected routes? (At least in a case where router ids differ and the the update is handled as retraction.) Obviously, that would case the route to be de-selected. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
But now i wonder why not to accept unfeasible updates of selected routes? (At least in a case where router ids differ and the the update is handled as retraction.) Obviously, that would case the route to be de-selected.
I think you're right, that's a possible improvement. -- Juliusz
What you describe is perfectly correct.
I have two questions w.r.t. this sequence of events:
1) How is router restart and seqnos supposed to be handled without waiting for route timeout?
It's worse than that, actually -- it's not the route timeout, it's the source GC time. The issue is a consequence of having a stateful loop-avoidance algorithm: if the state is lost, the loop-avoidance algorithm gets confused, and only recovers after the state has expired. Babeld currently has two workarounds: - it stores the current seqno on disk when it shuts down, so that it can use the same seqno when it restarts; - it can optionally draw a random router-id at startup, so that the old and new states don't interfere. It would be great to design a procedure to recover from this case without a timeout, but I haven't given it much thought yet. So for now consider it as a flaw in the protocol. -- Juliusz
On Thu, May 05, 2016 at 03:02:55PM +0200, Juliusz Chroboczek wrote:
What you describe is perfectly correct.
I have two questions w.r.t. this sequence of events:
1) How is router restart and seqnos supposed to be handled without waiting for route timeout?
It's worse than that, actually -- it's not the route timeout, it's the source GC time.
Yes, you are right, i missed this.
The issue is a consequence of having a stateful loop-avoidance algorithm: if the state is lost, the loop-avoidance algorithm gets confused, and only recovers after the state has expired.
Babeld currently has two workarounds:
- it stores the current seqno on disk when it shuts down, so that it can use the same seqno when it restarts; - it can optionally draw a random router-id at startup, so that the old and new states don't interfere.
It would be great to design a procedure to recover from this case without a timeout, but I haven't given it much thought yet. So for now consider it as a flaw in the protocol.
Using random router-id seems like a good idea. Perhaps even an TLV that describes 'nominal' configured router-id, so regular router-id could be random, but routes could still contain configured router-id for admin purposes. Unfortunately, Babel does not have support for something like Opaque-LSA. Could not help with this issue just to allow increasing seqnum by more than 1 in reaction to recevied seqno request (3.8.1.2), to value rcv_seqno+1 ? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Using random router-id seems like a good idea. Perhaps even an TLV that describes 'nominal' configured router-id, so regular router-id could be random, but routes could still contain configured router-id for admin purposes. Unfortunately, Babel does not have support for something like Opaque-LSA.
Right on all counts.
Could not help with this issue just to allow increasing seqnum by more than 1 in reaction to recevied seqno request (3.8.1.2), to value rcv_seqno+1 ?
I'm not sure it doesn't break loop avoidance, I'd need to think it over. -- Juliusz
participants (3)
-
Juliusz Chroboczek -
Ondrej Zajicek -
Toke Høiland-Jørgensen