Re: 100% CPU load with device scanning enabled
Hi, this is an OVS issue, already discussed: https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html <https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html> ... https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html <https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html> Official OVS quote:
We'd accept patches to improve OVS's routing table code. It's not designed to scale to 1,800,000 routes. We'd also take code to suppress the routing table code in cases where it isn't actually needed, since it's not always needed. But we can't take a patch to just delete it; I'm sure you understand. I tried to apply this patch at that time, but was already useless for newer versions:
https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379... <https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379b333/attachment.bin> Our workaround was to scale VM with 3 vCPU-s, since our average system load is 1.5 for BGP. You can see what is happening: [root@bgp1 ~]# top ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 654 root 10 -10 1284492 1.0g 20276 R 98.0 27.0 2513:01 ovs-vswitchd 16 root 20 0 0 0 0 S 2.0 0.0 24:45.60 ksoftirqd/1 [root@bgp1 ~]# ip route show ... 1.0.0.0/24 via 89.212.47.185 dev t2-v24-ha proto bird 1.0.4.0/24 via 89.212.47.185 dev t2-v24-ha proto bird 1.0.4.0/22 via 89.212.47.185 dev t2-v24-ha proto bird 1.0.5.0/24 via 89.212.47.185 dev t2-v24-ha proto bird Routes being constantly added and deleted: [root@bgp1 ~]# ip monitor ... Deleted 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium Deleted 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium Deleted 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium 2620:11d:6000::/42 via 2a01:260:1021::1 dev t2-v26-ha proto bird metric 1024 pref medium Deleted 68.69.37.0/24 via 89.212.47.185 dev t2-v24-ha proto bird 68.69.37.0/24 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 103.115.180.0/22 via 89.212.47.185 dev t2-v24-ha proto bird 103.115.180.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 103.115.180.0/22 via 89.212.47.185 dev t2-v24-ha proto bird 103.115.180.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 2.16.70.0/23 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 88.221.28.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 23.50.188.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 92.122.68.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 88.221.100.0/22 via 89.212.47.185 dev t2-v24-ha proto bird Deleted 92.123.208.0/22 via 89.212.47.185 dev t2-v24-ha proto bird ..... Regards, saso
On 6 May 2019, at 19:30, Kees Meijs <kees@nefos.nl <mailto:kees@nefos.nl>> wrote:
Hi list,
We're in the process of replacing Quagga with BIRD but stumble upon a little problem.
When device scanning is on (obviously default) our testing machine completely fills up a CPU core. The culprit isn't BIRD itself but an Open vSwitch daemon.
After disabling the device protocol and restarting BIRD, everything goes back to it's quiet state.
BIRD (1.6.3-2) and Open vSwitch (2.6.2~pre+git20161223-3) both were installed as Debian stable packages.
The configuration is as simple as:
# This is a minimal configuration file, which allows the bird daemon to start # but will not cause anything else to happen. # # Please refer to the documentation in the bird-doc package or BIRD User's # Guide on http://bird.network.cz/ <http://bird.network.cz/> for more information on configuring BIRD and # adding routing protocols.
# Change this into your BIRD router ID. It's a world-wide unique identification # of your router, usually one of router's IPv4 addresses. router id 1.2.3.4;
# The Device protocol is not a real routing protocol. It doesn't generate any # routes and it only serves as a module for getting information about network # interfaces from the kernel. protocol device { }
# The Kernel protocol is not a real routing protocol. Instead of communicating # with other routers in the network, it performs synchronization of BIRD's # routing tables with the OS kernel. protocol kernel { metric 64; # Use explicit kernel route metric to avoid collisions # with non-BIRD routes in the kernel routing table import none; export all; # Actually insert routes into the kernel routing table }
protocol bgp test { description "BGP test"; local as REDACTED; neighbor 1.2.3.4 as REDACTED; direct; next hop self; deterministic med on; export none; import all; }
Meanwhile log messages such as below arise:
bird: Kernel dropped some netlink messages, will resync on next scan.
For a test I deleted all existing Open vSwitch bridges and the load dropped again. After adding an empty new bridge, the load spikes again in an instant.
This is unexpected behaviour. Maybe it's an implementation problem in Open vSwitch or maybe in BIRD. Anyway, it should happen I guess.
Any clues?
Thanks in advance!
Regards, Kees
Hi Saso, Thank you very much. OVS is new in the mix (we're not replacing Quagga alone) as well. Obviously we didn't expect this to happen. I'll see if patching OVS in Debian in a similar way works for us or if another approach fits better (i.e. maybe not using OVS at all). If you'll know of a better more upgrade-and-maintainance-proof solution I would welcome more information. Regards, Kees On 06-05-19 20:40, Saso Tavcar wrote:
this is an OVS issue, already discussed:
https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html ... _https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html... _ _ Official OVS quote:
/We'd accept patches to improve OVS's routing table code. It's not />/designed to scale to 1,800,000 routes. We'd also take code to suppress />/the routing table code in cases where it isn't actually needed, since />/it's not always needed. But we can't take a patch to just delete it; />/I'm sure you understand./ I tried to apply this patch at that time, but was already useless for newer versions:
_https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379...
Our workaround was to scale VM with 3 vCPU-s, since our average system load is 1.5 for BGP.
You can see what is happening:
The best solution would be a good OVS routing table patch as quoted. Maybe BIRD developers can help, since they are native C developers. We also tried bird on native (K)VM network interfaces. Since they are some kind of SW emulation too, we hit on unrecoverable network IRQ problems, thus overloaded OVS is still better solution for us. Regards, saso
On 6 May 2019, at 21:01, Kees Meijs <kees@nefos.nl> wrote:
Hi Saso,
Thank you very much. OVS is new in the mix (we're not replacing Quagga alone) as well. Obviously we didn't expect this to happen.
I'll see if patching OVS in Debian in a similar way works for us or if another approach fits better (i.e. maybe not using OVS at all).
If you'll know of a better more upgrade-and-maintainance-proof solution I would welcome more information.
Regards, Kees
On 06-05-19 20:40, Saso Tavcar wrote:
this is an OVS issue, already discussed:
https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html <https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html> ... https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html <https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html>
Official OVS quote:
We'd accept patches to improve OVS's routing table code. It's not designed to scale to 1,800,000 routes. We'd also take code to suppress the routing table code in cases where it isn't actually needed, since it's not always needed. But we can't take a patch to just delete it; I'm sure you understand. I tried to apply this patch at that time, but was already useless for newer versions:
https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379... <https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379b333/attachment.bin>
Our workaround was to scale VM with 3 vCPU-s, since our average system load is 1.5 for BGP.
You can see what is happening:
Hello! Just shortly, I was trying BIRD in QEMUs connected via OVS bridges several years ago. It was even worse, I went to some segfaults in OVS. (A week later, one of my friends told me that it was an embargoed bug.) I didn't try any more, it wasn't feasible. I'd like to promise that I'd look into it, anyway I'm busy a lot and I'll surely forget your issue in ten minutes. If you don't hear me until next Monday, please ping me. Maria On 5/6/19 9:33 PM, Saso Tavcar wrote:
The best solution would be a good OVS routing table patch as quoted.
Maybe BIRD developers can help, since they are native C developers.
We also tried bird on native (K)VM network interfaces. Since they are some kind of SW emulation too, we hit on unrecoverable network IRQ problems, thus overloaded OVS is still better solution for us.
Regards, saso
On 6 May 2019, at 21:01, Kees Meijs <kees@nefos.nl <mailto:kees@nefos.nl>> wrote:
Hi Saso,
Thank you very much. OVS is new in the mix (we're not replacing Quagga alone) as well. Obviously we didn't expect this to happen.
I'll see if patching OVS in Debian in a similar way works for us or if another approach fits better (i.e. maybe not using OVS at all).
If you'll know of a better more upgrade-and-maintainance-proof solution I would welcome more information.
Regards, Kees
On 06-05-19 20:40, Saso Tavcar wrote:
this is an OVS issue, already discussed:
https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043007.html ... _https://mail.openvswitch.org/pipermail/ovs-discuss/2016-November/043063.html... _ _ Official OVS quote:
/We'd accept patches to improve OVS's routing table code. It's not />/designed to scale to 1,800,000 routes. We'd also take code to suppress />/the routing table code in cases where it isn't actually needed, since />/it's not always needed. But we can't take a patch to just delete it; />/I'm sure you understand./ I tried to apply this patch at that time, but was already useless for newer versions:
_https://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20161123/5379...
Our workaround was to scale VM with 3 vCPU-s, since our average system load is 1.5 for BGP.
You can see what is happening:
Thanks Maria. I don't want to eat up your precious time so I'll try the VRF approach first. If that works we're good. K. On 06-05-19 23:04, Maria Matejka wrote:
Just shortly, I was trying BIRD in QEMUs connected via OVS bridges several years ago. It was even worse, I went to some segfaults in OVS. (A week later, one of my friends told me that it was an embargoed bug.) I didn't try any more, it wasn't feasible.
I'd like to promise that I'd look into it, anyway I'm busy a lot and I'll surely forget your issue in ten minutes. If you don't hear me until next Monday, please ping me.
Fine! Anyway, I can offer you this: Most of the time spent is the configuration of OVS as I'm not familiar with it. If you could provide me with a script that I could just run with no configured OVS before and the bug manifests… whooa, it would really help me. Maria On May 7, 2019 6:53:42 AM GMT+02:00, Kees Meijs <kees@nefos.nl> wrote:
Thanks Maria.
I don't want to eat up your precious time so I'll try the VRF approach first. If that works we're good.
K.
On 06-05-19 23:04, Maria Matejka wrote:
Just shortly, I was trying BIRD in QEMUs connected via OVS bridges several years ago. It was even worse, I went to some segfaults in OVS. (A week later, one of my friends told me that it was an embargoed bug.) I didn't try any more, it wasn't feasible.
I'd like to promise that I'd look into it, anyway I'm busy a lot and I'll surely forget your issue in ten minutes. If you don't hear me until next Monday, please ping me.
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi there, Just creating a bridge with no configuration and no ports attached is enough:
~# ovs-vsctl add-br foobar
Now host a BGP full feed and there's havoc. Hopefully I'll be able to configure VRFs today and'll if that helps. Cheers, Kees On 07-05-19 07:04, Maria Matějka wrote:
Fine! Anyway, I can offer you this: Most of the time spent is the configuration of OVS as I'm not familiar with it. If you could provide me with a script that I could just run with no configured OVS before and the bug manifests… whooa, it would really help me.
Hi again, Placing the routes in another table works fine:
# ip r s ta 10 | wc -l 744892
Meanwhile in the default table:
# ip r s | wc -l 3
However it seems the Open vSwitch daemon is again triggered and polls to synchronise the routes. Still eating it's way through a CPU thread:
top - 08:47:02 up 6 min, 1 user, load average: 1,11, 0,83, 0,40 Tasks: 123 total, 2 running, 121 sleeping, 0 stopped, 0 zombie %Cpu(s): 15,5 us, 11,3 sy, 0,0 ni, 73,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 32929556 total, 31573996 free, 1259204 used, 96356 buff/cache KiB Swap: 0 total, 0 free, 0 used. 31347300 avail Mem
PID USER PR NI VIRT RES SHR S *%CPU* %MEM TIME+ COMMAND 772 root 10 -10 1234200 884504 8440 R *105,0* 2,7 6:08.68 ovs-vswitchd
I believe it's a good thing to fix Open vSwitch (not BIRD) but meanwhile I'll try to figure out another approach, maybe using virtualisation to separate the physical world from the routing process. If there's any future testing or debugging to do I'm glad to help and make a test lab. Regards, Kees On 07-05-19 07:22, Kees Meijs wrote:
Hopefully I'll be able to configure VRFs today and'll if that helps.
Hi, I've been struggling with similiar issue while trying to setup BIRD on EdgeRouter platform. Unfortunately, VRF approach worked only until I put some ip rules. Best regards, Łukasz Jarosz wt., 7 maj 2019, 09:06 użytkownik Kees Meijs <kees@nefos.nl> napisał:
Hi again,
Placing the routes in another table works fine:
# ip r s ta 10 | wc -l 744892
Meanwhile in the default table:
# ip r s | wc -l 3
However it seems the Open vSwitch daemon is again triggered and polls to synchronise the routes.
Still eating it's way through a CPU thread:
top - 08:47:02 up 6 min, 1 user, load average: 1,11, 0,83, 0,40 Tasks: 123 total, 2 running, 121 sleeping, 0 stopped, 0 zombie %Cpu(s): 15,5 us, 11,3 sy, 0,0 ni, 73,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 32929556 total, 31573996 free, 1259204 used, 96356 buff/cache KiB Swap: 0 total, 0 free, 0 used. 31347300 avail Mem
PID USER PR NI VIRT RES SHR S *%CPU* %MEM TIME+ COMMAND 772 root 10 -10 1234200 884504 8440 R *105,0* 2,7 6:08.68 ovs-vswitchd
I believe it's a good thing to fix Open vSwitch (not BIRD) but meanwhile I'll try to figure out another approach, maybe using virtualisation to separate the physical world from the routing process.
If there's any future testing or debugging to do I'm glad to help and make a test lab.
Regards, Kees
On 07-05-19 07:22, Kees Meijs wrote:
Hopefully I'll be able to configure VRFs today and'll if that helps.
Hi and thanks again. We're in need of VRF support and maybe it works without overloading when placing the full feed in another than default VRF (which is good practice anyway). Hopefully OVS only synchronises the default system tables. I'll post my findings. Regards, Kees On 06-05-19 21:33, Saso Tavcar wrote:
The best solution would be a good OVS routing table patch as quoted.
Maybe BIRD developers can help, since they are native C developers.
We also tried bird on native (K)VM network interfaces. Since they are some kind of SW emulation too, we hit on unrecoverable network IRQ problems, thus overloaded OVS is still better solution for us.
participants (5)
-
Kees Meijs -
Maria Matejka -
Maria Matějka -
Saso Tavcar -
Łukasz Jarosz