Optimizing large BGP -> Linux kernel additions/removals
When a BGP sessions comes up/down, there is a period of high CPU while all the new best-routes are computed -- this is understandable. However, on boxes I have where this table is then promoted to the Linux kernel, the CPU usage stays high for some considerable time. For a full (~600,000 prefix)
From the looks of things, a bunch of netlink messages are being generated to the kernel to RTM_DELROUTE then RTM_NEWROUTE -- each of which is causing the kernel's trie to rebalance which is a fairly costly operation.
I was wondering if anyone had experimented with anything such as implementing either: 1. a corking mechanism (i.e. stop balancing the trie until a signal is sent to uncork) 2. fib_trie garbage collection (i.e. only rebalance the trie once per time interval) 3. "double buffering" (i.e. for a given operation such as protocol flap, memcpy the trie, perform operations, then update the root node pointer to the new optimised trie) Any and all of these ideas may be horrific, I'm just interested whether anyone's running full tables in linux, filled by (e.g.) BIRD, and have encountered this issue. Unfortunately there doesn't appear to be an RTM_CHANGE or similar in Linux, so the DELROUTE will seemingly cause a tree to either be pruned or re-branched, followed by the NEWROUTE causing a full rebalance run -- whereas a CHANGE would (could) hopefully just over-write the value. Many thanks in advance, Matthew Walster
On Mon, Aug 17, 2015 at 03:41:00PM -0700, Matthew Walster wrote:
When a BGP sessions comes up/down, there is a period of high CPU while all the new best-routes are computed -- this is understandable.
However, on boxes I have where this table is then promoted to the Linux kernel, the CPU usage stays high for some considerable time. For a full (~600,000 prefix)
From the looks of things, a bunch of netlink messages are being generated to the kernel to RTM_DELROUTE then RTM_NEWROUTE -- each of which is causing the kernel's trie to rebalance which is a fairly costly operation. ... Unfortunately there doesn't appear to be an RTM_CHANGE or similar in Linux, so the DELROUTE will seemingly cause a tree to either be pruned or re-branched, followed by the NEWROUTE causing a full rebalance run -- whereas a CHANGE would (could) hopefully just over-write the value.
Hi I don't know much of kernel internal code, so i cannot answer the further questions, although it seems silly to me to not have some kind of amortized behavior for trie rebalancing. Note that there is NLM_F_REPLACE flag that with RTM_NEWROUTE works like RTM_CHANGE. And BIRD used that in some historic versions. But there is a semantic problem with these replacements - route options works like selector for RTM_DELROUTE, but not for replacement (where they just represent new values). Namely BIRD can specify rtm_protocol RTPROT_BIRD for RTM_DELROUTE and only BIRD routes are affected (and RTM_NEWROUTE would succeed only if the RTM_DELROUTE succeded), while for replacement specifying rtm_protocol RTPROT_BIRD just means that the current route (regardless its rtm_protocol) is replaced with the new one with that rtm_protocol value, which is undesirable. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Matthew Walster -
Ondrej Zajicek