100% CPU load with device scanning enabled
Hi list, We're in the process of replacing Quagga with BIRD but stumble upon a little problem. When device scanning is on (obviously default) our testing machine completely fills up a CPU core. The culprit isn't BIRD itself but an Open vSwitch daemon. After disabling the device protocol and restarting BIRD, everything goes back to it's quiet state. BIRD (1.6.3-2) and Open vSwitch (2.6.2~pre+git20161223-3) both were installed as Debian stable packages. The configuration is as simple as:
# This is a minimal configuration file, which allows the bird daemon to start # but will not cause anything else to happen. # # Please refer to the documentation in the bird-doc package or BIRD User's # Guide on http://bird.network.cz/ for more information on configuring BIRD and # adding routing protocols.
# Change this into your BIRD router ID. It's a world-wide unique identification # of your router, usually one of router's IPv4 addresses. router id 1.2.3.4;
# The Device protocol is not a real routing protocol. It doesn't generate any # routes and it only serves as a module for getting information about network # interfaces from the kernel. protocol device { }
# The Kernel protocol is not a real routing protocol. Instead of communicating # with other routers in the network, it performs synchronization of BIRD's # routing tables with the OS kernel. protocol kernel { metric 64; # Use explicit kernel route metric to avoid collisions # with non-BIRD routes in the kernel routing table import none; export all; # Actually insert routes into the kernel routing table }
protocol bgp test { description "BGP test"; local as REDACTED; neighbor 1.2.3.4 as REDACTED; direct; next hop self; deterministic med on; export none; import all; }
Meanwhile log messages such as below arise:
bird: Kernel dropped some netlink messages, will resync on next scan.
For a test I deleted all existing Open vSwitch bridges and the load dropped again. After adding an empty new bridge, the load spikes again in an instant. This is unexpected behaviour. Maybe it's an implementation problem in Open vSwitch or maybe in BIRD. Anyway, it should happen I guess. Any clues? Thanks in advance! Regards, Kees
Hi again, Sorry: shouldn't happen. Meanwhile we tested BIRD 2.0.4 as well (compiled from source) with the same result. The process ovs-vswitchd completely consumes a CPU thread. When disabling the exportation of the routes (full feed, so 700k+ routes) the load drops back to nothing. Regards, Kees On 06-05-19 19:30, Kees Meijs wrote:
This is unexpected behaviour. Maybe it's an implementation problem in Open vSwitch or maybe in BIRD. Anyway, it should happen I guess.
Update: strace(1) shows the following in regard to Open vSwitch in an endless loop:
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 40 ioctl(40, SIOCGIFNAME, {ifr_index=6, ifr_name="vlan1105"}) = 0 close(40) = 0
This doesn't make any sense (at least, to me) since the interface vlan1105 isn't related to any bridge configuration at all. Only outbound routes (originating from BGP) point to an address via interface vlan1105. As the name implies this is just a VLAN on a direct physical network port. And BIRD (2.0.4-1):
poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 4, 3000) = 1 ([{fd=7, revents=POLLIN}]) recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000040}, msg_namelen=12, msg_iov=[{iov_base=[{{len=68, type=0x19 /* NLMSG_??? */, flags=0, seq=806042, pid=2639368684}, "\2\26\0\0\376\f\0\1\0\0\0\0\10\0\17\0\376\0\0\0\10\0\1\0X\335d\0\10\0\6\0"...}, {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 68
There's just a little of polling going on but it seems as every poll results in the socket being opened by ovs-vswitchd, for each route present. And again, and again. Regards, Kees On 06-05-19 20:22, Kees Meijs wrote:
Sorry: shouldn't happen.
Meanwhile we tested BIRD 2.0.4 as well (compiled from source) with the same result. The process ovs-vswitchd completely consumes a CPU thread.
When disabling the exportation of the routes (full feed, so 700k+ routes) the load drops back to nothing.
participants (1)
-
Kees Meijs