Bird6 freeze under high load

Baptiste Jonglez baptiste at bitsofnetworks.org
Sat Jan 31 14:47:51 CET 2015


Hi,

Thanks for this strace trick, I didn't know one could attach to a running
process like this.

The information returned by strace is not really helpful, since it
consists in a massive amount of netlink messages (and strace doesn't
decode netlink messages).  However, it indicates that Bird spends a lot of
time communicating with the kernel, which is a valuable information in
itself :)

After activating the kernel logs in bird (debug { packets }), here is a
sample of the logs during a freeze:

Jan 31 12:06:26 myrouter bird6: kernel1: 2001::/32: seen
Jan 31 12:06:26 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:32 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:32 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:38 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:50 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:50 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:50 myrouter bird6: kernel1: 2001::/32: already seen
Jan 31 12:06:56 myrouter bird6: kernel1: 2001::/32: already seen

So, it looks like bird is receiving many times the same routes from the
kernel when it scans its routing table.  It seems related to either
netlink or the kernel, since "ip" also exhibits the same behaviour:

$ ip -6 r | wc -l
7690163

This took several minutes to complete, and there certainly isn't so much
IPv6 routes in the kernel: routes appear several times in the output of
"ip -6 r".  Running this command multiple times yields very different
results each time.

Thus, I don't think the bug is in Bird.  Could it be some kind of race
condition with netlink?  I haven't been able to find any reference to this
bug, either in the kernel or in iproute2.  For reference, this is on a
Debian wheezy system, but I can reproduce the duplicate routes in "ip -6 r"
on Debian jessie as well.

Thanks,
Baptiste

PS: as a temporary solution, I increased the scan time of the kernel
protocol from one minute to several hours, so that the issue arises less
often.

On Sat, Jan 31, 2015 at 12:37:09AM +0000, Pendzik, Edward wrote:
> if the pid of bird6 is 123, run (as root)
> 
> strace -p 123 -o /tmp/bird6.strace.out
> 
> and then send in the bird6.strace.out.
> 
> strace uses the same hooks gdb uses but doesnt stop the process.
> The outfile will contain all the system calls and their args from the process in real time as it is running.
> control-C will kill strace and which should cause it to detach and let bird6 keep going unfettered.
> 
> Ed
> 
> -----Original Message-----
> From: bird-users-bounces at network.cz [mailto:bird-users-bounces at network.cz] On Behalf Of Chris Caputo
> Sent: Friday, January 30, 2015 7:28 PM
> To: Baptiste Jonglez
> Cc: bird-users at network.cz
> Subject: Re: Bird6 freeze under high load
> 
> If built with symbols, after it has gotten into the CPU busy-loop, use gdb 
> to attach to it, ala:
> 
>   gdb <path to bird6> <process id of bird6>
> 
> Ex:
> 
>   gdb /usr/local/sbin/bird6 `ps -C bird6 -o pid=`
> 
> then "bt for a stack trace, possibly showing where stuck.
> 
> "cont" to continue and then another control-c to check again.
> 
> Do this a few times.  Hopefully there will be a pattern.
> 
> Copy & paste the results to this list.
> 
> "quit" to exit gdb, allowing bird6 to continue.
> 
> Chris
> 
> On Sat, 31 Jan 2015, Baptiste Jonglez wrote:
> > I just tried downgrading from 1.4.5 to 1.4.4, using the 1.4.4-1~bpo70+1 
> > Debian package from http://bird.network.cz/?download&tdir=debian/
> > 
> > The result is the same, bird6 also freezes periodically with version 1.4.4.
> > 
> > By the way, I think I ruled out the possibility that a particular BGP peer
> > is sending garbage: the issue still arises when leaving only one BGP
> > session active, whichever it is.
> > 
> > Is there anything else I can do to help troubleshoot the root cause of
> > this issue?
> > 
> > On Thu, Jan 29, 2015 at 08:03:07PM +0100, Baptiste Jonglez wrote:
> > > Hi,
> > > 
> > > We are experiencing regular "freezes" of bird6 on a BGP router.  When this
> > > happens, bird6 maxes out a CPU for several minutes.  If a command is run
> > > in birdc6 during such a freeze, the command hangs, and the result is only
> > > returned when birdc6 has stopped using the CPU.  Note that this also
> > > applies to "cheap" commands like "show protocols", which usually complete
> > > instantly (both with bird, and with bird6 in non-freeze conditions).
> > > 
> > > Sometimes (but not always), the non-responsiveness of bird6 causes all BGP
> > > sessions to drop, which is really annoying on a full-view BGP router.
> > > 
> > > The freezes happen at random, but seem to happen more frequently when the
> > > router is under load (typically, at peak time, each CPU spends ~20%
> > > forwarding packets, on a 4-core box).
> > > 
> > > The BGP setup is made of multiple transit and peerings, on multiple VLANs
> > > (some BGP neighbours share the same VLAN).  The setup is pretty similar on
> > > bird and bird6, but only bird6 exhibits these freezes, bird works just fine.
> > > 
> > > The box is running Debian wheezy on amd64, with bird from backports: 1.4.5-1~bpo70+1
> > > 
> > > Attached is the configuration, and two extracts of the logs when all BGP
> > > sessions dropped (with debug { states, interfaces, events }).  All files
> > > are anonymised, but should be consistent.
> > > 
> > > What do you think?  It looks like bird6 gets stuck on some very expensive
> > > operation, which prevents it from doing anything else (include maintaining
> > > BGP sessions alive).
> > > 
> > > Thanks,
> > > Baptiste
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20150131/83bd41de/attachment.asc>


More information about the Bird-users mailing list