netlink filtering to avoid clostly FNHE table dumps on Linux
Hi! This is not a direct follow-up on the thread "BIRD trying to reinsert existing kernel routes, netlink issue?" ( https://www.mail-archive.com/bird-users@network.cz/msg05429.html ) but I think that the cause of my issue can be the same. My scenario: I have a server in LAB that has a public IP on a GRE tunnel interface and the tunnel goes to a router in the Internet. BIRD receives full BGP feed over the GRE tunnel. The tunnel is terminated in a VRF and BIRD exports the BGP feed to the kernel RT associated with the VRF. I observed pretty much the same set of issues as Sasha and others described in the linked thread two years ago: BIRD starts normally, BGP session comes up but after 3 or 5 minutes it max-out one CPU, mostly in sys (doing the netlink I/O). Then BIRD stops responding normally over birdc - each command takes several minutes to complete and I also saw the log messages like: Jan 2 00:00:35 lab bird: Netlink: File exists Jan 2 00:00:35 lab bird: Netlink: File exists Jan 2 00:00:35 lab bird: ... Jan 2 00:00:35 lab bird: I/O loop cycle took 73127 ms for 11 events I realized that BIRD actually transfers huge amounts of data from kernel over netlink (several gigabytes of messages per minute or so, which translates to hundreds of millions of FIB records) in the beginning of each sync and it ultimately stalls BIRD I/O long enough to miss BGP keepalive deadlines and the sessions start flapping, which makes the situation hard to understand using the conventional profiling tools, tcpdump for netlink dump etc. FYI I am attaching my quick and dirty patch to BIRD that I used to collect stats from netlink interactions to understand the problem and finally to add experimental PoC fix - see nl_request_rt_dump() in the patch. The large table that BIRD pulled from the kernel was a FNHE table where Linux collects PMTU records for *all* destination IPs that are routed to the tunnel (which does not seem to be right and I will discuss it in LKML shortly). These records have (default) 600s expiration time and in my scenario I happen to receive some backscatter traffic that in most cases gets ICMP or TCP reset responses that could ultimately create millions of these records in a few minutes. The reason why this problem occured only in Linux ~5.2+ lies in the patch https://patchwork.ozlabs.org/project/netdev/patch/8d3b68cd37fb5fddc470904cdd... that changed the semantics of netlink dump requests. Now the kernel dumps the FIB Next Hop Exceptions table (previously known as route cache) alongside the RT unless the requester sets sockopt NETLINK_GET_STRICT_CHK and clear the flag RTM_F_CLONED in the dump request. BIRD does not apply the filters so the kernel dumps everything. But iproute2 and other programs that use netlink utilize the filters, so no similar performance issue occurs unless I explicitly dump the FNHE table (ip route show cache). I believe that many different types of Linux tunnels create the PMTU records for all packets transmitted over the tunnel as well. And it works like that for a long time - the code that creates the route cache (at that time, now it is FNHE table) records has been introduced in Linux 3.10 (https://elixir.bootlin.com/linux/v3.10/source/net/ipv4/ip_tunnel.c#L591). Regardless of what may or may not happen on the kernel side I think that implementing the netlink filter in BIRD to avoid the described situation makes sense. I am almost certain that my experimental fix breaks other things (most likely OSPF) but I would be glad to help make it right. What do you think? Best regards, Tomas
On Sat, Jan 08, 2022 at 12:03:52AM +0100, Tomas Hlavacek wrote:
Hi!
The large table that BIRD pulled from the kernel was a FNHE table where Linux collects PMTU records for *all* destination IPs that are routed to the tunnel (which does not seem to be right and I will discuss it in LKML shortly). These records have (default) 600s expiration time and in my scenario I happen to receive some backscatter traffic that in most cases gets ICMP or TCP reset responses that could ultimately create millions of these records in a few minutes.
The reason why this problem occured only in Linux ~5.2+ lies in the patch https://patchwork.ozlabs.org/project/netdev/patch/8d3b68cd37fb5fddc470904cdd... that changed the semantics of netlink dump requests. Now the kernel dumps the FIB Next Hop Exceptions table (previously known as route cache) alongside the RT unless the requester sets sockopt NETLINK_GET_STRICT_CHK and clear the flag RTM_F_CLONED in the dump request. BIRD does not apply the filters so the kernel dumps everything. But iproute2 and other programs that use netlink utilize the filters, so no similar performance issue occurs unless I explicitly dump the FNHE table (ip route show cache).
Hi Thanks, that seems like plausible explanation. Being spammed by PMTU cache entries where requesting route table dumps is a creative interpretation of stable API commitment :-(
I believe that many different types of Linux tunnels create the PMTU records for all packets transmitted over the tunnel as well. And it works like that for a long time - the code that creates the route cache (at that time, now it is FNHE table) records has been introduced in Linux 3.10 (https://elixir.bootlin.com/linux/v3.10/source/net/ipv4/ip_tunnel.c#L591).
If i understand it correctly, these PMTU records can also be a result of regular TCP communication from/to the router even if there are no tunnels?
Regardless of what may or may not happen on the kernel side I think that implementing the netlink filter in BIRD to avoid the described situation makes sense. I am almost certain that my experimental fix breaks other things (most likely OSPF) but I would be glad to help make it right.
How could OSPF be affected by filters on netlink socket? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Ondrej, all, On Sat, Jan 8, 2022 at 5:56 AM Ondrej Zajicek <santiago@crfreenet.org> wrote:
I believe that many different types of Linux tunnels create the PMTU records for all packets transmitted over the tunnel as well. And it works like that for a long time - the code that creates the route cache (at that time, now it is FNHE table) records has been introduced in Linux 3.10 (https://elixir.bootlin.com/linux/v3.10/source/net/ipv4/ip_tunnel.c#L591).
If i understand it correctly, these PMTU records can also be a result of regular TCP communication from/to the router even if there are no tunnels?
Yes, but in most cases the kernel should not create that many PMTU records. Even with 600 s expiration I would expect several thousands or hundreds of thousands maximum. I still do not fully understand why I saw over 130M PMTU records received by BIRD in one scan. Either there is some multiplication within the dump or there was something very wrong. Anyway, I am going to analyze the kernel part in more detail and I will address this in LKML.
Regardless of what may or may not happen on the kernel side I think that implementing the netlink filter in BIRD to avoid the described situation makes sense. I am almost certain that my experimental fix breaks other things (most likely OSPF) but I would be glad to help make it right.
How could OSPF be affected by filters on netlink socket?
My experimental patch actually broke kif_do_scan(). It turned out that there are some (all?) missing link records caused by the NETLINK_GET_STRICT_CHK sockopt. I guess it breaks device protocol, which in turn breaks OSPF. In any case OSPF did not start on the GRE interface (it didn't send or receive any messages) until I fixed the kif_do_scan(). I think there is an easy way out without needing larger changes: We can enable NETLINK_GET_STRICT_CHK only for krt_do_scan(). I'll send a new RFC patch shortly. Best regards, Tomas
participants (2)
-
Ondrej Zajicek -
Tomas Hlavacek