bird becoming unresponsive after a few hours
Hi. I run bird 1.3.7 on a Debian VPS which peers with three IPv4 peers and three IPv6 peers. I have no data plane traffic, as I use these peers to run these two twitter accounts: https://twitter.com/bgp4_table https://twitter.com/bgp6_table I needed more peers, so I moved to a new VPS with 2GB ram. The previous only had 1GB. I have have 7 global IPv4 peers and 7 IPv6 peers. I'm running bird 1.4.0 on Ubuntu 14.04 - I've noticed that if I leave it for about an hour, and then log into birdc and type 'show route count' - bird will stall for about a minute giving no response. At that time, if I check netstat my Recv-Q rapidly increases on all my BGP sessions. Now and then my BGP peers will also go down stating hold time expired. When I check at random times my Recv-Q is high on these sessions. My older server with three peers has never had this issue. Unfortunately bird's log is not showing anything. It only shows when a peer goes down due to the hold time expired. Any ideas on what I can check? Thanks Darren
Hi Darren, Do you see any high usage of CPU or RAM? How many prefixes do you receive? Can you please provide the config? Do you see any stuff going on f.e. via tcpdump? Filter on port 179? Is there any conspicuous bump in f.e. bgp updates going on? Filter on bgp.type or try to graph it. Overall you could try latest code from website, but I don't think that you are hitting a bug, at least at the moment ;-) Rgds, Stefan
On Wed, Jun 24, 2015 at 09:29:08AM +0100, Darren O'Connor wrote:
Hi.
I run bird 1.3.7 on a Debian VPS which peers with three IPv4 peers and three IPv6 peers. I have no data plane traffic, as I use these peers to run these two twitter accounts: https://twitter.com/bgp4_table https://twitter.com/bgp6_table
I needed more peers, so I moved to a new VPS with 2GB ram. The previous only had 1GB.
I have have 7 global IPv4 peers and 7 IPv6 peers. I'm running bird 1.4.0 on Ubuntu 14.04 - I've noticed that if I leave it for about an hour, and then log into birdc and type 'show route count' - bird will stall for about a minute giving no response. At that time, if I check netstat my Recv-Q rapidly increases on all my BGP sessions.
Hi Well, try version 1.5.0 (or at least 1.4.5), version 1.3.7 is just too old. I would also suggest to check free memory and swapping (and if the process is running, sleeping or in IO-wait state). 'show route count' would read the whole routing table, so if the VPS is somewhat overcommited and is swapping, reading whole table could be slow. You could also try to do 'show route' instead of 'show route count' to see if the behavior is the same and what is reported during it. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi. 1.3.7 is running on the older server and has no issues. It's the 1.4.0 install that's having the issue. I've just rebuilt my VPS as its still bare and the provider now has Debian 8 available. I've now installed that. This comes with bird 1.4.5 so I'll see how it goes today. To answer the questions above: CPU is hardly used at all. I have access to four cores and most of the time they are < 1% Server has 2GB RAM and I'm currently using 746MB of it. I have 7 full IPv4 peer and 7 full IPv6 peers so route count is as follows: root@bird:/etc/bird# birdc 'show route count' BIRD 1.4.5 ready. 4290064 of 4290064 routes for 554557 networks root@bird:/etc/bird# birdc6 'show route count' BIRD 1.4.5 ready. 179722 of 179722 routes for 22814 networks ulimit is sent to unlimited. More info on mem usage: root@bird:/etc/bird# birdc 'show mem' BIRD 1.4.5 ready. BIRD memory usage Routing tables: 395 MB Route attributes: 275 MB ROA tables: 192 B Protocols: 50 kB Total: 669 MB root@bird:/etc/bird# birdc6 'show mem' BIRD 1.4.5 ready. BIRD memory usage Routing tables: 17 MB Route attributes: 43 MB ROA tables: 192 B Protocols: 51 kB Total: 60 MB Config-wise this is what it looks like, I've just changed IPs and passwords. All my peers are set up exactly the same: log syslog all; router id x.x.x.x; protocol device { } protocol kernel { export none; import none; } function unwanted_bgp_routes() prefix set unwanted; { unwanted = [ 169.254.0.0/16+, 172.16.0.0/12+, 192.168.0.0/16+, 10.0.0.0/8+, 224.0.0.0/4+, 240.0.0.0/4+ ]; if net.ip = 0.0.0.0 then return true; if net.len <8 then return true; if net.len >24 then return true; if net ~ unwanted then return true; if ( bgp_path.len > 45 ) then return true; return false; } filter bgp_in { if unwanted_bgp_routes() then { reject; } else { accept; } } protocol bgp PEER1 { local as 64533; neighbor x.x.x.x as x; multihop; password "xxxx"; import keep filtered; import filter bgp_in; } log syslog all; router id x.x.x.x; protocol device { } function wanted_prefixes() prefix set wanted; { wanted = [ 2001:0::/32, 2001:200::/23{23,48}, 2001:400::/23{23,48}, 2001:600::/23{23,48}, 2001:800::/23{23,48}, 2001:A00::/23{23,48}, 2001:C00::/23{23,48}, 2001:E00::/23{23,48}, 2001:1200::/23{23,48}, 2001:1400::/23{23,48}, 2001:1600::/23{23,48}, 2001:1800::/23{23,48}, 2001:1A00::/23{23,48}, 2001:1C00::/22{22,48}, 2001:2000::/20{20,48}, 2001:3000::/21{21,48}, 2001:3800::/22{22,48}, 2001:4000::/23{23,48}, 2001:4200::/23{23,48}, 2001:4400::/23{23,48}, 2001:4600::/23{23,48}, 2001:4800::/23{23,48}, 2001:4A00::/23{23,48}, 2001:4C00::/23{23,48}, 2001:5000::/20{20,48}, 2001:8000::/19{19,48}, 2001:A000::/20{20,48}, 2001:B000::/20{20,48}, 2002:0000::/16{16,48}, 2003:0000::/18{18,48}, 2400:0000::/12{12,48}, 2600:0000::/12{12,48}, 2610:0000::/23{23,48}, 2620:0000::/23{23,48}, 2800:0000::/12{12,48}, 2A00:0000::/12{12,48}, 2C00:0000::/12{12,48} ]; if net ~ wanted then return true; return false; } function unwanted_prefixes() prefix set unwanted; { unwanted = [ 2001:db8::/32+, ::/0{49,128} ]; if net ~ unwanted then return true; if ( bgp_path.len > 45 ) then return true; return false; } filter bgp_in { if unwanted_prefixes() then reject; if wanted_prefixes() then accept; else reject; } protocol kernel { export none; import none; } protocol bgp PEER1 { local as 64533; neighbor xxx:xxxx::x as x; source address xxx:xxx::x; password "xxx"; multihop; import keep filtered; import filter bgp_in; } As for the behaviour. I noticed that when it stalls, even doing a 'show mem' will sit for about 10 seconds before giving me a result. But at that time I don't see high CPU or MEM usage Thanks Darren On 25 June 2015 at 09:54, Ondrej Zajicek <santiago@crfreenet.org> wrote:
On Wed, Jun 24, 2015 at 09:29:08AM +0100, Darren O'Connor wrote:
Hi.
I run bird 1.3.7 on a Debian VPS which peers with three IPv4 peers and three IPv6 peers. I have no data plane traffic, as I use these peers to run these two twitter accounts: https://twitter.com/bgp4_table https://twitter.com/bgp6_table
I needed more peers, so I moved to a new VPS with 2GB ram. The previous only had 1GB.
I have have 7 global IPv4 peers and 7 IPv6 peers. I'm running bird 1.4.0 on Ubuntu 14.04 - I've noticed that if I leave it for about an hour, and then log into birdc and type 'show route count' - bird will stall for about a minute giving no response. At that time, if I check netstat my Recv-Q rapidly increases on all my BGP sessions.
Hi
Well, try version 1.5.0 (or at least 1.4.5), version 1.3.7 is just too old.
I would also suggest to check free memory and swapping (and if the process is running, sleeping or in IO-wait state). 'show route count' would read the whole routing table, so if the VPS is somewhat overcommited and is swapping, reading whole table could be slow.
You could also try to do 'show route' instead of 'show route count' to see if the behavior is the same and what is reported during it.
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAlWLwbAACgkQw1GB2RHercPagwCcD5uWAkfRq/wZS/vortzEjlIL M+oAniMZWJ4AlnbMIiaXqo/75QyK+xuJ =pTk/ -----END PGP SIGNATURE-----
On Thu, Jun 25, 2015 at 09:20:57AM +0100, Darren O'Connor wrote:
As for the behaviour. I noticed that when it stalls, even doing a 'show mem' will sit for about 10 seconds before giving me a result. But at that time I don't see high CPU or MEM usage
When BIRD is unresponsive, it is not unusual that even simple commands may took about 10 seconds before being processed. When such behavior is experienced, CPU could be busy (user or system), idle or in IO-wait state - see top: Cpu(s): 7.7%us, 4.3%sy, 0.0%ni, 88.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st You say it is not busy. For common server, you would see high wait time and low memory if the system is swapping. But since it is VPS, i would guess that the server could be overcommited and swapping on the provider side, which would look like being idle and with plenty of memory in the inside of VPS but with the same performance problems. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi all. Last week I upgraded to Debian 8 running bird 1.4.5 - My sessions have been up non-stop for the last 5 days, both v4 and v6. There is still a slight delay on the first call to show route count, but not as bad a 1.4.0. I'll continue to monitor, but for now it seems okay. Thanks Darren On 25 June 2015 at 10:50, Ondrej Zajicek <santiago@crfreenet.org> wrote:
On Thu, Jun 25, 2015 at 09:20:57AM +0100, Darren O'Connor wrote:
As for the behaviour. I noticed that when it stalls, even doing a 'show mem' will sit for about 10 seconds before giving me a result. But at that time I don't see high CPU or MEM usage
When BIRD is unresponsive, it is not unusual that even simple commands may took about 10 seconds before being processed.
When such behavior is experienced, CPU could be busy (user or system), idle or in IO-wait state - see top:
Cpu(s): 7.7%us, 4.3%sy, 0.0%ni, 88.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
You say it is not busy. For common server, you would see high wait time and low memory if the system is swapping. But since it is VPS, i would guess that the server could be overcommited and swapping on the provider side, which would look like being idle and with plenty of memory in the inside of VPS but with the same performance problems.
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAlWLzugACgkQw1GB2RHercMSjwCfbgwNUuCrK9QBZGjDLGO8dHXF kmgAmQExNSpqrolsgnaY+TDR4Gs7ZZhB =rUtc -----END PGP SIGNATURE-----
participants (3)
-
Darren O'Connor -
Ondrej Zajicek -
Stefan Jakob