[PATCH] Fix Bird/Bird6 wrong LSA collisions detection

Ondrej Zajicek santiago at crfreenet.org
Tue Nov 6 18:19:45 CET 2018


On Tue, Nov 06, 2018 at 04:15:34PM +0100, Ondrej Zajicek wrote:
> On Tue, Nov 06, 2018 at 08:53:35AM +0000, Mikkelsen, Asbjorn wrote:
> > Bird/Bird6 can wrongly report wrong LSA collisions and stop working. A route
> > fib can in certain cases be released (believing nobody are using them) and
> > still be used (but not dereferenced).
> > 
> > A PR can be found here:
> > https://github.com/BIRD/bird/pull/3
> 
> Hi
> 
> Thanks for the patch, i have some questions:
> 
> 1) Am i understand correctly the cause that there is struct top_hash_entry
> and associated struct ort entry, but struct ort entry got removed while
> struct top_hash_entry still links to it, so when new struct ort for given
> network is allocated, it is technically different from old one and that
> cause the collision?

One more question - in your scenario from PR:

    1. Bird OSPF is running fine

    2. Cut the network connectivity

    3. Then quickly enough, and before Bird realizes that the network is down, flush bird OSPF routes

    4. Bird will flush its routes but none of them will be removed because there are no OSPF routers to ACK this route flushing

    5. Wait long enough for Bird to realize that all neighbors are gone (~5 minutes)

    6. Bird will delete the corresponding route fibs, believing nobody is using them

    7. Re-add Bird OSPF routes and then Bird will try to read the corresponding deleted fib

    8. Bird OSPF does not work anymore and we see this in the log:
    <ERR> ospf1: LSA ID collision for X.X.X.X/32

It seems to me that it may happen after route are flushed, ort entries
got removed (in rt_sync()), but before LSA entries are removed (by
ospf_clear_lsa()). After ospf_clear_lsa() en->lsa_body is NULL and
therefore the issue will not happen.

If i understand it correctly, that may happen after the flush but
*before* all neighbors are gone (because after all neighbors are gone,
LSA entries will be removed from ospf_update_lsadb() like if they are
ACKed). That is slightly different than the scenario from PR.

I think that the proper fix is to reset en->nf field during
ospf_flush_lsa(). Patch attached, could you try it?

Your patch should also work, but there is a corner case when route R1 is
exported, then flushed (but not ACKed), then exported again in short time
after the first export so triggering MinLSInterval check in
ospf_do_originate_lsa(), so now we have LSA entry that is MAXAGE but with
scheduled LSA in en->next_lsa_body, then a different route R2 with real
LSA ID collision is exported, it should report collision, but is ignored
by the condition in the patch.

-- 
Elen sila lumenn' omentielvo

Ondrej 'Santiago' Zajicek (email: santiago at crfreenet.org)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
"To err is human -- to blame it on a computer is even more so."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix-ospf-lsa-collision.patch
Type: text/x-diff
Size: 391 bytes
Desc: not available
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20181106/96278416/attachment.bin>


More information about the Bird-users mailing list