Mitigations/tunables for reducing netlink loss?

20 Sep 2021

      Hi,

I'm seeing these warnings quite frequently on a system that has
full-internet table programmed into Linux kernel.

  Sep 20 11:50:48 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:50:57 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:50:59 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:00 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:01 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:05 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:05 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:07 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:10 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:10 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:14 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:17 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:19 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
  Sep 20 11:51:19 ganges bird: Kernel dropped some netlink messages, will resync on next scan.

First off, I'm not expecting wonders here, I don't expect having near
to 1m routes in Linux to work flawlessly however just looking to
improve on the situation.

My setup: bird 1.6.6 (Debian 10) with Linux 4.9.xxx

Internet routes are segregated into a dedicated kernel routing table
(#99 below), no non-bird routes are in this table.

prot kernel is export only (again, no non-bird routes in this kernel
table)

bird table "internet" contains a protocol kernel, a protocol pipe, and
3 protocol bgps (for 3 iBGP peers).

prot kernel config is:

  protocol kernel kernel_internet {
        debug { states, interfaces };
        table internet;
        kernel table 99;
        scan time 60;
        persist;
        learn off;
        graceful restart on;
        import none;
        export filter {
                if net ~ IP_MY_NET_PLUS then reject;
                if net ~ IP_CORE_NET then reject;
                accept;
        };
  }

I'm seeing netlink drops when upstream internet churn is say more than
200 updates/sec or so, not huge, but quite freqent and can continue
for minutes/hours.

Some items I've investigated so far:

Increasing net.core.rmem_max and net.core.wmem_max sysctls doesn't
seem to help much, strace of bird doesn't indicate any EAGAIN or
blocking when writing to the netlink sockets.

strace shows some room for optimization in the prot kernel (these
would obviously be code changes).  For example, when a route changes
next-hop/interface, 2 netlink messages are sent, delete followed by
add, instead of a single change/replace (this would complicate bird,
but reduce netlink message in half for updates).

There is plenty of cpu cycles available, bird is <%1, etc...

Any pointers on tuning or config changes that may help here are
appreciated.

-- 
Dave

Dave Johnson

Maria Matejka

Alexander

Trisha Biswas

Alexander

Dave Johnson

Alexander

tags

participants (4)