Mitigations/tunables for reducing netlink loss?
Hi, I'm seeing these warnings quite frequently on a system that has full-internet table programmed into Linux kernel. Sep 20 11:50:48 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:50:57 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:50:59 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:00 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:01 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:05 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:05 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:07 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:10 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:10 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:14 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:17 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:19 ganges bird: Kernel dropped some netlink messages, will resync on next scan. Sep 20 11:51:19 ganges bird: Kernel dropped some netlink messages, will resync on next scan. First off, I'm not expecting wonders here, I don't expect having near to 1m routes in Linux to work flawlessly however just looking to improve on the situation. My setup: bird 1.6.6 (Debian 10) with Linux 4.9.xxx Internet routes are segregated into a dedicated kernel routing table (#99 below), no non-bird routes are in this table. prot kernel is export only (again, no non-bird routes in this kernel table) bird table "internet" contains a protocol kernel, a protocol pipe, and 3 protocol bgps (for 3 iBGP peers). prot kernel config is: protocol kernel kernel_internet { debug { states, interfaces }; table internet; kernel table 99; scan time 60; persist; learn off; graceful restart on; import none; export filter { if net ~ IP_MY_NET_PLUS then reject; if net ~ IP_CORE_NET then reject; accept; }; } I'm seeing netlink drops when upstream internet churn is say more than 200 updates/sec or so, not huge, but quite freqent and can continue for minutes/hours. Some items I've investigated so far: Increasing net.core.rmem_max and net.core.wmem_max sysctls doesn't seem to help much, strace of bird doesn't indicate any EAGAIN or blocking when writing to the netlink sockets. strace shows some room for optimization in the prot kernel (these would obviously be code changes). For example, when a route changes next-hop/interface, 2 netlink messages are sent, delete followed by add, instead of a single change/replace (this would complicate bird, but reduce netlink message in half for updates). There is plenty of cpu cycles available, bird is <%1, etc... Any pointers on tuning or config changes that may help here are appreciated. -- Dave
Hello!
Sep 20 11:50:48 ganges bird: Kernel dropped some netlink messages, will resync on next scan. [...] Sep 20 11:51:19 ganges bird: Kernel dropped some netlink messages, will resync on next scan.
This is somehow inevitable, as the netlink manpage states: However, reliable transmissions from kernel to user are impossible in any case. The kernel can't send a netlink message if the socket buffer is full: the message will be dropped and the kernel and the user-space process will no longer have the same view of kernel state. It is up to the application to detect when this happens (via the ENOBUFS error returned by recvmsg(2)) and resynchronize. This unreliability is also a good reason to have periodic table scans, just to be sure that kernel is in sync with BIRD.
I'm seeing netlink drops when upstream internet churn is say more than 200 updates/sec or so, not huge, but quite freqent and can continue for minutes/hours.
Yes, this is quite a known situation. We can't do much about it in single-threaded BIRD – the ENOBUFS error signals that the kernel has no more room to store route updates. (See more thorough explanation down there.)
Some items I've investigated so far:
Increasing net.core.rmem_max and net.core.wmem_max sysctls doesn't seem to help much, strace of bird doesn't indicate any EAGAIN or blocking when writing to the netlink sockets.
Here somebody suggests increasing net.core.rmem_default before starting BIRD. https://bird.network.cz/pipermail/bird-users/2017-September/011541.html
strace shows some room for optimization in the prot kernel (these would obviously be code changes). For example, when a route changes next-hop/interface, 2 netlink messages are sent, delete followed by add, instead of a single change/replace (this would complicate bird, but reduce netlink message in half for updates).
This would be feasible in a world of one single kernel with no bugs, yet there have been quite a few bugs needed to be worked-around and we have no useful detection mechanism to check whether this exact kernel version suffers from that bug. (There are still people running new BIRD on old kernels.)
There is plenty of cpu cycles available, bird is <%1, etc...
Any pointers on tuning or config changes that may help here are appreciated.
Well, to be honest, I think this may be fixed by having a separate netlink thread (which is a work-almost-in-progress), yet without that, it is almost impossible. The reason is how it works now: 1) BGP receives a packet (quite a big one or several of them) 2) BGP parses the input data and for each single route: 2A) import filter is run 2B) best route in table is recalculated 2C) all exports are run; in case of kernel, the netlink message is sent 2D) kernel generates a netlink message in response, confirming the route update (repeat this for all the data) 3) BGP is done and another socket is read. For simplicity, let's assume it is the netlink receive socket. 4) Netlink parses the incoming messages, getting ENOBUFS and realizing that there are some more updates that didn't fit into the receive buffer, issuing that warning. 5) After a while, netlink scan is issued, successfully checking that all routes are there. The actual reason for BIRD showing these warning in tables where only BIRD writes is simply the impossibility of reading the netlink socket while exporting routes from another protocol. This will be fixed in future BIRD versions supporting multithreaded execution where the netlink thread should have enough time to read the netlink socket and the exports for netlink (and all other protocols) will properly queue and wait to be processed until the protocol decides to actually export. Maria
On 2021-09-21 09:53, Maria Matejka wrote:
Here somebody suggests increasing net.core.rmem_default before starting BIRD.
https://bird.network.cz/pipermail/bird-users/2017-September/011541.html
Why not add an option with socket buffer size and force for netlink socket when specified? Like: setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buffer_size, sizeof(buffer_size)); SO_RCVBUFFORCE (available in kernels since 2.6.14) ignores limits in net.core.rmem_max so there is no need to mangle with settings, especially default settings (rmem_default) as it will affect *all* applications. Those who are lucky enough to run recent kernels without bugs could simply start bird with custom buffer size. /Al
Hi all, We are using the following patch to increase the netlink receive buffer size to 2 MB. It has certainly helped reduce the instances of buffer overruns in either kernel async notifications or ack processing. Since more folks are having the same issue, maybe this patch could be added upstream? Thanks, Trisha diff --git a/lib/socket.h b/lib/socket.h index 96fedeeb..71fdcc1e 100644 --- a/lib/socket.h +++ b/lib/socket.h @@ -93,6 +93,7 @@ void sk_set_rbsize(sock *s, uint val); /* Resize RX buffer */ void sk_set_tbsize(sock *s, uint val); /* Resize TX buffer, keeping content */ void sk_set_tbuf(sock *s, void *tbuf); /* Switch TX buffer, NULL-> return to internal */ void sk_dump_all(void); +void sk_set_rcvbuf(int fd, int val); /* Set socket receive buffer size */ int sk_is_ipv4(sock *s); /* True if socket is IPv4 */ int sk_is_ipv6(sock *s); /* True if socket is IPv6 */ diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index fdf3f2db..89bb5a81 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -131,6 +131,7 @@ struct nl_sock }; #define NL_RX_SIZE 8192 +#define RCVBUF_SIZE 2*1024*1024 #define NL_OP_DELETE 0 #define NL_OP_ADD (NLM_F_CREATE|NLM_F_EXCL) @@ -154,6 +155,7 @@ nl_open_sock(struct nl_sock *nl) nl->rx_buffer = xmalloc(NL_RX_SIZE); nl->last_hdr = NULL; nl->last_size = 0; + sk_set_rcvbuf(nl->fd, RCVBUF_SIZE); } } @@ -2014,6 +2016,7 @@ nl_open_async(void) log(L_ERR "Unable to open asynchronous rtnetlink socket: %m"); return; } + sk_set_rcvbuf(fd, RCVBUF_SIZE); bzero(&sa, sizeof(sa)); sa.nl_family = AF_NETLINK; diff --git a/sysdep/linux/sysio.h b/sysdep/linux/sysio.h index e21ff487..93b5de7f 100644 --- a/sysdep/linux/sysio.h +++ b/sysdep/linux/sysio.h @@ -266,3 +266,10 @@ sk_set_priority(sock *s, int prio) return 0; } +void +sk_set_rcvbuf(int fd, int val) +{ + int len = val; + if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &len, sizeof(len)) < 0) + log(L_WARN "sk_set_rcvbuf: Could not set RCVBUF to %d", len); +} On Tue, Sep 21, 2021 at 7:27 AM Alexander <aldem-bird.201704@nk7.net> wrote:
On 2021-09-21 09:53, Maria Matejka wrote:
Here somebody suggests increasing net.core.rmem_default before starting BIRD.
https://bird.network.cz/pipermail/bird-users/2017-September/011541.html
Why not add an option with socket buffer size and force for netlink socket when specified? Like:
setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buffer_size, sizeof(buffer_size));
SO_RCVBUFFORCE (available in kernels since 2.6.14) ignores limits in net.core.rmem_max so there is no need to mangle with settings, especially default settings (rmem_default) as it will affect *all* applications.
Those who are lucky enough to run recent kernels without bugs could simply start bird with custom buffer size.
/Al
On 2021-09-21 17:51, Trisha Biswas wrote:
We are using the following patch to increase the netlink receive buffer size to 2 MB.
I am afraid that this might be insufficient - as it depends on CPU power, system load and probably some other factors like number of routes etc. In my case (multi-CPU dedicated border router) where bird is the only significant consumer of CPU power and running only BGP (with very few OSPF nodes) I had to set it to 128MB to get rid of any loss, 64MB was barely sufficient. Therefore I believe it makes sense to make the buffer size configurable. /Al
Thanks all, I increased the system-wide default to 2MB and seems to be better (default was 256K). On code changes, I completely agree this should be configurable either globally or per protocol kernel, having to set a system-wide default when a socket option is available isn't great. On the previously posted patch, I don't think SO_RCVBUFFORCE would be the best idea. Generally, max system setting should be honored, and SO_RCVBUF used (caped at the lesser of user configured value and the system-wide max). Alexander writes:
On 2021-09-21 17:51, Trisha Biswas wrote:
We are using the following patch to increase the netlink receive buffer size to 2 MB.
I am afraid that this might be insufficient - as it depends on CPU power, system load and probably some other factors like number of routes etc.
In my case (multi-CPU dedicated border router) where bird is the only significant consumer of CPU power and running only BGP (with very few OSPF nodes) I had to set it to 128MB to get rid of any loss, 64MB was barely sufficient.
Therefore I believe it makes sense to make the buffer size configurable.
/Al
-- Dave
On 2021-09-21 23:19, Dave Johnson wrote:
On the previously posted patch, I don't think SO_RCVBUFFORCE would be the best idea. Generally, max system setting should be honored, and SO_RCVBUF used (caped at the lesser of user configured value and the system-wide max).
The problem is that rmem_max is also global (kind of) as it allows any user/app (even without CAP_NET_ADMIN capability) to use it, and this is not always desirable. /Al
participants (4)
-
Alexander -
Dave Johnson -
Maria Matejka -
Trisha Biswas