Just had another crash, 7 days after my previous email. Exact same symptoms, this time with the latest version from CZ repository: 1.6.2-3~bpo8+1. bird6 stuck on recvmsg using 100% CPU, getting EAGAIN in an infinite loop: # strace -p 23020 recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) [...] None of this happened in 1.5.0. What can I do to help troubleshoot this? This is a major regression and it's making me seriously concerned about both my edge routers using the same version of Bird. On 12/02/2016 06:46 PM, Israel G. Lugo wrote:
Hello,
I am getting some random crashes in bird6, running on Debian, version 1.6.2-1~bpo8+1 from your http://bird.network.cz/debian/ repository.
I've got a single OSPF instance with 74 routes, one eBGP session receiving a default route, and one iBGP session with another Bird router, which sends me its own default.
What happens is that, from time to time, bird6 becomes stuck in an infinite loop doing recvmsg() on a netlink socket, and IPv6 routes are lost. The interval seems random; it's been 3 days, and it's also been 2 weeks.
gk1 # strace -p 11465 recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource temporarily unavailable) [...]
File descriptor 7 is a netlink socket:
gk1 # lsof -p 11465 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bird6 11465 bird cwd DIR 253,0 4096 2 / bird6 11465 bird rtd DIR 253,0 4096 2 / bird6 11465 bird txt REG 253,0 540648 787381 /usr/sbin/bird6 bird6 11465 bird mem REG 253,0 47712 659204 /lib/x86_64-linux-gnu/libnss_files-2.19.so bird6 11465 bird mem REG 253,0 43592 659208 /lib/x86_64-linux-gnu/libnss_nis-2.19.so bird6 11465 bird mem REG 253,0 89104 659199 /lib/x86_64-linux-gnu/libnsl-2.19.so bird6 11465 bird mem REG 253,0 31632 659200 /lib/x86_64-linux-gnu/libnss_compat-2.19.so bird6 11465 bird mem REG 253,0 1738176 659160 /lib/x86_64-linux-gnu/libc-2.19.so bird6 11465 bird mem REG 253,0 137440 655379 /lib/x86_64-linux-gnu/libpthread-2.19.so bird6 11465 bird mem REG 253,0 140928 655799 /lib/x86_64-linux-gnu/ld-2.19.so bird6 11465 bird 0u CHR 1,3 0t0 1028 /dev/null bird6 11465 bird 1u CHR 1,3 0t0 1028 /dev/null bird6 11465 bird 2u CHR 1,3 0t0 1028 /dev/null bird6 11465 bird 3u unix 0xffff8803269f7c00 0t0 127941139 socket bird6 11465 bird 4u unix 0xffff8803269f7480 0t0 127941145 /run/bird/bird6.ctl bird6 11465 bird 5u netlink 0t0 127906248 ROUTE bird6 11465 bird 6u netlink 0t0 127906249 ROUTE bird6 11465 bird 7u netlink 0t0 127906250 ROUTE bird6 11465 bird 8u IPv6 127906251 0t0 TCP *:bgp (LISTEN) bird6 11465 bird 9u raw6 0t0 127906252 00000000000000000000000000000000:0059->00000000000000000000000000000000:0000 st=07 bird6 11465 bird 10u IPv6 127994711 0t0 TCP e0.gk1:bgp->e0.gk2:39074 (CLOSE_WAIT) bird6 11465 bird 11u IPv6 127965176 0t0 TCP [2001:w:y:x::133]:58268->[2001:w:y:x::1]:bgp (CLOSE_WAIT)
Unfortunately I didn't find any debug symbols for this package, so all I could get from gdb was the following:
(gdb) bt #0 0x00007f5ad1705e80 in __recvmsg_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f5ad1b90428 in ?? () #2 0x00007f5ad1b8956b in ?? () #3 0x00007f5ad1b8a06b in ?? () #4 0x00007f5ad1b3f0c7 in ?? () #5 0x00007f5ad136db45 in __libc_start_main (main=0x7f5ad1b3eb10, argc=5, argv=0x7ffe8cfece28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe8cfece18) at libc-start.c:287 #6 0x00007f5ad1b3f3ec in ?? () (gdb) info r rax 0xfffffffffffffff5 -11 rbx 0x7f5ad32aefe0 140028066590688 rcx 0xffffffffffffffff -1 rdx 0x0 0 rsi 0x7ffe8cfecb70 140731263929200 rdi 0x7 7 rbp 0x7f5ad1dba270 0x7f5ad1dba270 rsp 0x7ffe8cfecb18 0x7ffe8cfecb18 r8 0x7f5ad32aefe0 140028066590688 r9 0x0 0 r10 0x1 1 r11 0x246 582 r12 0x0 0 r13 0x7f5ad32c7f60 140028066692960 r14 0x100 256 r15 0x0 0 rip 0x7f5ad1705e80 0x7f5ad1705e80 <__recvmsg_nocancel+7> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
Unfortunately, I did not have debug on when this crashed. I had it on for several days, but either I was "lucky" or the debug prevented the crash somehow. I was having several MB worth of debug logs every day, so I ended up disabling debug.
I'm not 100% sure that this was installed from your CZ repository, it may have been from Debian backports. But I'm 95% sure it came from CZ. In any case the MD5 is as follows:
56e48e8e5a1380b384f1758df2077e53 bird_1.6.2-1~bpo8+1_amd64.deb
I have now upgraded to 1.6.2-3~bpo8+1, from your CZ repository.
I can provide the configuration file off-list, if that helps.
Regards,
Israel