Latency spikes on FreeBSD 8.2 with kernel scan function.
Hi list, Is it just me, or is something totally broken in scan function of kernel protocol? # ping -i 0.01 -c 2500 127.0.0.1 PING 127.0.0.1 (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.019 ms (...) 64 bytes from 127.0.0.1: icmp_seq=1149 ttl=64 time=0.018 ms 64 bytes from 127.0.0.1: icmp_seq=1150 ttl=64 time=0.018 ms 64 bytes from 127.0.0.1: icmp_seq=1151 ttl=64 time=176.896 ms 64 bytes from 127.0.0.1: icmp_seq=1152 ttl=64 time=204.953 ms 64 bytes from 127.0.0.1: icmp_seq=1153 ttl=64 time=0.027 ms 64 bytes from 127.0.0.1: icmp_seq=1154 ttl=64 time=0.026 ms (...) 64 bytes from 127.0.0.1: icmp_seq=2499 ttl=64 time=0.019 ms --- 127.0.0.1 ping statistics --- 2500 packets transmitted, 2500 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.007/0.187/204.953/5.495 ms This generates quite a lot of funny things, among them being SIP breakage, freezes in games, hair-pulling in my person - whole network freezes for the scan-time. Does this happen only on my system? Bird 1.3.4+FIB, 8-STABLE current as of 2 hours ago. Pawel.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Pawel Tyll wrote:
Hi list,
Is it just me, or is something totally broken in scan function of kernel protocol?
bird has nothing to do with kernel packet forwarding. It is control plane software.
# ping -i 0.01 -c 2500 127.0.0.1 PING 127.0.0.1 (127.0.0.1): 56 data bytes 64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.019 ms (...) 64 bytes from 127.0.0.1: icmp_seq=1149 ttl=64 time=0.018 ms 64 bytes from 127.0.0.1: icmp_seq=1150 ttl=64 time=0.018 ms 64 bytes from 127.0.0.1: icmp_seq=1151 ttl=64 time=176.896 ms 64 bytes from 127.0.0.1: icmp_seq=1152 ttl=64 time=204.953 ms
Do you have some kind of shaping (dummynet, altq) configured?
64 bytes from 127.0.0.1: icmp_seq=1153 ttl=64 time=0.027 ms 64 bytes from 127.0.0.1: icmp_seq=1154 ttl=64 time=0.026 ms (...) 64 bytes from 127.0.0.1: icmp_seq=2499 ttl=64 time=0.019 ms --- 127.0.0.1 ping statistics --- 2500 packets transmitted, 2500 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.007/0.187/204.953/5.495 ms
This generates quite a lot of funny things, among them being SIP breakage, freezes in games, hair-pulling in my person - whole network freezes for the scan-time.
Does this happen only on my system?
Bird 1.3.4+FIB, 8-STABLE current as of 2 hours ago.
Pawel.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7LO+8ACgkQwcJ4iSZ1q2nGagCfZoTBJR3DWa9y77Er+WJqLoLT ZE0AnAmV+nFaLU19VqtH1h9tPm0txvQH =csa1 -----END PGP SIGNATURE-----
Hi,
bird has nothing to do with kernel packet forwarding. It is control plane software. Do you have some kind of shaping (dummynet, altq) configured? I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing this? I can't believe scanning kernel routes is supposed to be this resource intensive?
Thanks, Pawel.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Pawel Tyll wrote:
Hi,
bird has nothing to do with kernel packet forwarding. It is control plane software. Do you have some kind of shaping (dummynet, altq) configured? I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing bird is stopped, traffic forwarding is stopped (since dynamic routes disappers) ? Perhaps your configuration shapes ALL (e.g. network AND local) traffic withing single queue? If so, when queue is full with network traffic even local traffic is delayed ?
this? I can't believe scanning kernel routes is supposed to be this resource intensive? No. It is not any kind of 'intensive' :)
Thanks,
Pawel.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7LXQwACgkQwcJ4iSZ1q2kIswCePVanjefokdKfgJvJoXlMVhuP TAYAn3ow7zufRfyMr60uYythJ2bda0KU =lrig -----END PGP SIGNATURE-----
I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing bird is stopped, traffic forwarding is stopped (since dynamic routes disappers) ? Perhaps your configuration shapes ALL (e.g. network AND local) traffic withing single queue? If so, when queue is full with network traffic even local traffic is delayed ? No no. And there's almost no traffic. There's one IBGP peer, this machine isn't in production yet. It was for a moment, but had to be pulled because of this. I'm 100% certain this has something to do with 'scan' facility of 'kernel' protocol.
ping -M time -i 0.01 127.0.0.1: 76 bytes from 127.0.0.1: icmp_seq=3207 ttl=64 time=0.015 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3208 ttl=64 time=0.016 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3209 ttl=64 time=149.246 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3210 ttl=64 time=169.547 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3211 ttl=64 time=0.028 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3212 ttl=64 time=0.022 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=4948 ttl=64 time=0.017 ms tso=08:40:27 tsr=08:40:27 tst=08:40:27 76 bytes from 127.0.0.1: icmp_seq=4949 ttl=64 time=0.016 ms tso=08:40:27 tsr=08:40:27 tst=08:40:27 76 bytes from 127.0.0.1: icmp_seq=4950 ttl=64 time=150.495 ms tso=08:40:27 tsr=08:40:28 tst=08:40:28 76 bytes from 127.0.0.1: icmp_seq=4951 ttl=64 time=0.026 ms tso=08:40:28 tsr=08:40:28 tst=08:40:28 76 bytes from 127.0.0.1: icmp_seq=4952 ttl=64 time=0.026 ms tso=08:40:28 tsr=08:40:28 tst=08:40:28 76 bytes from 127.0.0.1: icmp_seq=8586 ttl=64 time=0.016 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8587 ttl=64 time=0.016 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8588 ttl=64 time=149.361 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8589 ttl=64 time=168.944 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8590 ttl=64 time=0.026 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8591 ttl=64 time=0.021 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 Nov 22 09:40:08 storm bird: kernel1: Scanning routing table Nov 22 09:40:08 storm bird: kernel1: Pruning table master Nov 22 09:40:27 storm bird: kernel1: Scanning routing table Nov 22 09:40:28 storm bird: kernel1: Pruning table master Nov 22 09:40:47 storm bird: kernel1: Scanning routing table Nov 22 09:40:48 storm bird: kernel1: Pruning table master Nov 22 09:41:08 storm bird: kernel1: Scanning routing table Nov 22 09:41:09 storm bird: kernel1: Pruning table master bird> show protocols name proto table state since info kernel1 Kernel master up 02:52 device1 Device master up 02:52 NN_Gate BGP master up 02:52 Established static1 Static master down 09:36
On 22.11.2011 12:43, Pawel Tyll wrote:
I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing bird is stopped, traffic forwarding is stopped (since dynamic routes disappers) ? Perhaps your configuration shapes ALL (e.g. network AND local) traffic withing single queue? If so, when queue is full with network traffic even local traffic is delayed ? No no. And there's almost no traffic. There's one IBGP peer, this Okay. Can you shutdown your iBGP peer and check again? machine isn't in production yet. It was for a moment, but had to be pulled because of this. I'm 100% certain this has something to do with 'scan' facility of 'kernel' protocol. From kernel point of view scan is exactly the same as 'netstat -rn' execution (and, maybe, 'ifconfig')
ping -M time -i 0.01 127.0.0.1: You can try to use larger packets (-s 1400, for example) to make sure that large packets triggers delay much more often.
You can disable shaper and see if the problem disappears. You can send me your shaper configuration off-list.
76 bytes from 127.0.0.1: icmp_seq=3207 ttl=64 time=0.015 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3208 ttl=64 time=0.016 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3209 ttl=64 time=149.246 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3210 ttl=64 time=169.547 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3211 ttl=64 time=0.028 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08 76 bytes from 127.0.0.1: icmp_seq=3212 ttl=64 time=0.022 ms tso=08:40:08 tsr=08:40:08 tst=08:40:08
76 bytes from 127.0.0.1: icmp_seq=4948 ttl=64 time=0.017 ms tso=08:40:27 tsr=08:40:27 tst=08:40:27 76 bytes from 127.0.0.1: icmp_seq=4949 ttl=64 time=0.016 ms tso=08:40:27 tsr=08:40:27 tst=08:40:27 76 bytes from 127.0.0.1: icmp_seq=4950 ttl=64 time=150.495 ms tso=08:40:27 tsr=08:40:28 tst=08:40:28 76 bytes from 127.0.0.1: icmp_seq=4951 ttl=64 time=0.026 ms tso=08:40:28 tsr=08:40:28 tst=08:40:28 76 bytes from 127.0.0.1: icmp_seq=4952 ttl=64 time=0.026 ms tso=08:40:28 tsr=08:40:28 tst=08:40:28
76 bytes from 127.0.0.1: icmp_seq=8586 ttl=64 time=0.016 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8587 ttl=64 time=0.016 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8588 ttl=64 time=149.361 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8589 ttl=64 time=168.944 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8590 ttl=64 time=0.026 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08 76 bytes from 127.0.0.1: icmp_seq=8591 ttl=64 time=0.021 ms tso=08:41:08 tsr=08:41:08 tst=08:41:08
Nov 22 09:40:08 storm bird: kernel1: Scanning routing table Nov 22 09:40:08 storm bird: kernel1: Pruning table master Nov 22 09:40:27 storm bird: kernel1: Scanning routing table Nov 22 09:40:28 storm bird: kernel1: Pruning table master Nov 22 09:40:47 storm bird: kernel1: Scanning routing table Nov 22 09:40:48 storm bird: kernel1: Pruning table master Nov 22 09:41:08 storm bird: kernel1: Scanning routing table Nov 22 09:41:09 storm bird: kernel1: Pruning table master
bird> show protocols name proto table state since info kernel1 Kernel master up 02:52 device1 Device master up 02:52 NN_Gate BGP master up 02:52 Established static1 Static master down 09:36
-- WBR, Alexander
On Tue, Nov 22, 2011 at 09:43:52AM +0100, Pawel Tyll wrote:
I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing bird is stopped, traffic forwarding is stopped (since dynamic routes disappers) ? Perhaps your configuration shapes ALL (e.g. network AND local) traffic withing single queue? If so, when queue is full with network traffic even local traffic is delayed ? No no. And there's almost no traffic. There's one IBGP peer, this machine isn't in production yet. It was for a moment, but had to be pulled because of this. I'm 100% certain this has something to do with 'scan' facility of 'kernel' protocol.
You can verify that by disabling 'kernel' protocol. On BSD, kernel scan is implemented by requesting a big buffer and copying several tens of MB of route data from kernel to BIRD. I am not an expert on BSD, but perhaps there may be some kernel problem on some BSD version? Another possible explanation is that kernel scan for some reason generates a lot of log messages that strains the disk/systen, but i would doubt that. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 23.11.2011 19:27, Ondrej Zajicek wrote:
On Tue, Nov 22, 2011 at 09:43:52AM +0100, Pawel Tyll wrote:
I do have dummynet configured, but like I said - every 20 seconds network freezes for about 0.5s. Guess what happens every 20 seconds... You got it! Protocol "kernel" scans the kernel routes. Also, after /usr/local/etc/rc.d/bird stop, problem is gone. What could be causing bird is stopped, traffic forwarding is stopped (since dynamic routes disappers) ? Perhaps your configuration shapes ALL (e.g. network AND local) traffic withing single queue? If so, when queue is full with network traffic even local traffic is delayed ? No no. And there's almost no traffic. There's one IBGP peer, this machine isn't in production yet. It was for a moment, but had to be pulled because of this. I'm 100% certain this has something to do with 'scan' facility of 'kernel' protocol.
You can verify that by disabling 'kernel' protocol.
On BSD, kernel scan is implemented by requesting a big buffer and copying several tens of MB of route data from kernel to BIRD. I am not an expert on BSD, but perhaps there may be some kernel problem on some BSD version? This looks much more like some shaper configuration problem. There are no global locks in kernel for entire routing table. And no consumer can hold lock that can be acquired by ISR for such a long time.
I've sent my previous reply in this topic yesterday but, unfortunately, mailman blocks my e-mail due to unknown From address. (So I send it now as separate letter without any CC:)
Another possible explanation is that kernel scan for some reason generates a lot of log messages that strains the disk/systen, but i would doubt that.
-- WBR, Alexander
On Wed, Nov 23, 2011 at 07:25:18PM +0400, Alexander V. Chernikov wrote:
On BSD, kernel scan is implemented by requesting a big buffer and copying several tens of MB of route data from kernel to BIRD. I am not an expert on BSD, but perhaps there may be some kernel problem on some BSD version? This looks much more like some shaper configuration problem. There are no global locks in kernel for entire routing table. And no consumer can hold lock that can be acquired by ISR for such a long time.
You think that BSD kernel applies shaping on sysctl() calls? (which is how the table scan is implemented in krt_sysctl_scan() ). -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 23.11.2011 19:47, Ondrej Zajicek wrote:
On Wed, Nov 23, 2011 at 07:25:18PM +0400, Alexander V. Chernikov wrote:
On BSD, kernel scan is implemented by requesting a big buffer and copying several tens of MB of route data from kernel to BIRD. I am not an expert on BSD, but perhaps there may be some kernel problem on some BSD version? This looks much more like some shaper configuration problem. There are no global locks in kernel for entire routing table. And no consumer can hold lock that can be acquired by ISR for such a long time.
You think that BSD kernel applies shaping on sysctl() calls? (which is how the table scan is implemented in krt_sysctl_scan() ).
No. I think that network traffic shaper (which is configured) shapes ALL network traffic including local traffic traveling through loopback interface. (I've done similar funny things by mistake in the past) -- WBR, Alexander
There are no global locks in kernel for entire routing table.
Wait, what? This must be the problem that I had earlier when I got strange error messages from Bird with the routing prefixes. Since I disabled the SMP in all of the four machines that runs bird, I have not have this problem anymore. As Pawel sad about this spikes, I see this as well when I run traceroute for example. Now I know where they are coming from. -- //fredan
On 2011/11/23 16:27, Ondrej Zajicek wrote:
On BSD, kernel scan is implemented by requesting a big buffer and copying several tens of MB of route data from kernel to BIRD. I am not an expert on BSD, but perhaps there may be some kernel problem on some BSD version?
I don't know about FreeBSD but dumping the kernel route table can certainly be quite expensive on OpenBSD. It doesn't take all that long but locking means that interactive response can be quite poor while it happens.
participants (6)
-
Alexander V. Chernikov -
Alexander V. Chernikov -
fredrik danerklint -
Ondrej Zajicek -
Pawel Tyll -
Stuart Henderson