Hi, I've been playing with the BIRD source code and implementing a new extension to Babel but that is not what this mail is about. While doing that I noticed passing --enable-debug to configure causes BIRD to crash regardless of configuration:
$ bt #0 0x00007ffff7d9741a in raise () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6 #1 0x00007ffff7d81523 in abort () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6 #2 0x00000000004912c7 in bug (msg=msg@entry=0x4a06d8 "Assertion '%s' failed at %s:%d") at sysdep/unix/log.c:281 #3 0x000000000049175f in add_tail (n=0x4e4880 <lc_syslog>, l=0x4e8060 <log_list>) at ./lib/lists.c:79 #4 default_log_list (syslog_name=<synthetic pointer>, initial=0x0) at sysdep/unix/log.c:345 #5 log_switch (initial=initial@entry=0x0, logs=logs@entry=0x4eecb0, new_syslog_name=0x0) at sysdep/unix/log.c:373 #6 0x000000000049216b in sysdep_commit (new=new@entry=0x4eec70, old=<optimized out>) at sysdep/unix/main.c:188 #7 0x0000000000420a1a in config_do_commit (c=c@entry=0x4eec70, type=type@entry=0x1) at conf/conf.c:265 #8 0x0000000000420c29 in config_commit (c=c@entry=0x4eec70, type=type@entry=0x1, timeout=timeout@entry=0x0) at conf/conf.c:361 #9 0x0000000000492f3a in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:912 #10 0x00007ffff7d82dbd in __libc_start_main () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6 #11 0x000000000041087a in _start () at ../sysdeps/x86_64/start.S:120
The statement triggering this is the check for n->prev == NULL in lib/lists.c:79: 75 │ LIST_INLINE void 76 │ add_tail(list *l, node *n) 77 │ { 78 │ EXPENSIVE_CHECK(check_list(l, NULL)); -> 79 │ ASSUME(n->prev == NULL); 80 │ ASSUME(n->next == NULL); 81 │ 82 │ node *z = l->tail; 83 │ 84 │ n->next = &l->tail_node; 85 │ n->prev = z; 86 │ z->next = n; 87 │ l->tail = n; 88 │ } Note: Expensive checks are disabled. To reproduce build BIRD on the current master (3347aaf as of this writing) pass --enable-debug to the configure script. Create a simple configuration file that contains at least one protocol. I used just "protocol device {}" as the whole of the configuration file. Then execute "./bird -c minimal.conf -s test.ctl -f" and observe it crash. Bisecting the issue from v2.0.7 to the latest master branch showed that this seems to have been broken / uncovered in baac700:
commit baac7009063d94799821422ecc63ea2af41561ea (refs/bisect/bad) Author: Maria Matejka <mq@ucw.cz> Date: Wed Aug 14 16:23:58 2019 +0200
List expensive check. (https://gitlab.nic.cz/labs/bird/-/commit/baac7009063d94799821422ecc63ea2af41...)
Looking at the backtrace we can see that `default_log_list` is being called just before the check is being made: sysdep/unix/log.c 341 │ #ifdef HAVE_SYSLOG_H 342 │ if (!dbgf) 343 │ { 344 │ static struct log_config lc_syslog = { .mask = ~0 }; -> 345 │ add_tail(&log_list, &lc_syslog.n); 346 │ *syslog_name = bird_name; 347 │ } 348 │ #endif This code path is being called twice during the startup. One of the call chains is the one visible in the backtrace (which happens after/during configuration commit) and the other happens very early in the process startup: sysdep/unix/main.c 843 │ int 844 │ main(int argc, char **argv) 845 │ { … 851 │ parse_args(argc, argv); -> 852 │ log_switch(1, NULL, NULL); … 912 │ config_commit(conf, RECONFIG_HARD, 0); // 2nd call, see bt Since the log_config struct is static (sysdep/unix/log.c:344) the same entry is being added twice to the log_list and thus at the 2nd attempt of adding it the prev field is no longer the expected NULL value and thus triggering the shown abort. The log_list itself is also static (in default_log_list) but initialized on every execution of the function. Probably purposely. This still leaves for a brief moment where structures are inconsistent. I am not yet familiar enough with the BIRD code base to reason about where threading is use and if this is an actual issue or not. Now the question is where the fix should happen? We could add code to the log_close logic that resets the node pointers of each of the entries? Another option could be in the default_log_list function by just reassigning the value of each fo the log_config structs on every single call to the function. As the log_list is also initialized on every call this is perhaps the simplest but also not cleanest way to deal with this? The initially described issue also occurs when not setting --enable-debug it just goes silently unnoticed. I went ahead and applied the below patch to make all ASSUME invocations log (in a normal build) to see how many more of those messages are going unnoticed. ------< snip >------ diff --git a/lib/birdlib.h b/lib/birdlib.h index 23036c1b..76848e2c 100644 --- a/lib/birdlib.h +++ b/lib/birdlib.h @@ -176,7 +176,7 @@ void debug(const char *msg, ...); /* Printf to debug output */ #endif #else #define ASSERT(x) do { if (!(x)) log(L_BUG "Assertion '%s' failed at %s:%d", #x, __FILE__, __LINE__); } while(0) -#define ASSUME(x) /* intentionally left blank */ +#define ASSUME(x) do { if (!(x)) log(L_BUG "Assertion '%s' failed at %s:%d\n", #x, __FILE__, __LINE__); } while(0) #endif ------< snip >------ On one of my machines with a few BGP sessions and Babel interfaces I instructed GDB to print a backtrace whenever the n->prev == NULL condition fails. Here the those few that I saw and could distinguish after a very brief look.
Breakpoint 1, add_tail (n=0x110f298, l=0x7fff7cb58b60) at ./lib/lists.c:80 #0 add_tail (n=0x110f298, l=0x7fff7cb58b60) at ./lib/lists.c:80 #1 babel_process_packet (pkt=<optimized out>, len=<optimized out>, saddr=..., ifa=<optimized out>) at proto/babel/packets.c:1393 #2 0x000000000043e2aa in babel_rx_hook (sk=<optimized out>, len=<optimized out>) at proto/babel/packets.c:1472 #3 0x000000000040a9d3 in sk_read (s=s@entry=0x110dc80, revents=<optimized out>) at sysdep/unix/io.c:1910 #4 0x000000000040b394 in io_loop () at sysdep/unix/io.c:2342 #5 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923
Breakpoint 1, add_tail (n=0x110f6d0, l=0x1108848) at ./lib/lists.c:80 #0 add_tail (n=0x110f6d0, l=0x1108848) at ./lib/lists.c:80 #1 babel_enqueue (msg=<optimized out>, ifa=0x11087b0) at proto/babel/packets.c:1309 #2 0x0000000000445143 in babel_send_seqno_request (p=p@entry=0x109bc20, e=e@entry=0x11050c8, sr=sr@entry=0x1271718) at proto/babel/babel.c:911 #3 0x00000000004459c5 in babel_add_seqno_request (p=p@entry=0x109bc20, e=e@entry=0x11050c8, router_id=<optimized out>, seqno=<optimized out>, hop_count=hop_count@entry=0 '\000', nbr=nbr@entry=0x0) at proto/babel/babel.c:337 #4 0x0000000000445cf3 in babel_select_route (p=p@entry=0x109bc20, e=0x11050c8, mod=<optimized out>) at proto/babel/babel.c:775 #5 0x0000000000445f13 in babel_update_cost (nbr=nbr@entry=0x1111710) at proto/babel/babel.c:611 #6 0x000000000044626b in babel_expire_hello (p=p@entry=0x109bc20, nbr=nbr@entry=0x1111710, now_=now_@entry=242732001004) at proto/babel/babel.c:486 #7 0x00000000004462ed in babel_expire_neighbors (p=p@entry=0x109bc20) at proto/babel/babel.c:506 #8 0x00000000004473ae in babel_timer (t=<optimized out>) at proto/babel/babel.c:2072 #9 0x000000000045da19 in timers_fire (loop=0x4db740 <main_timeloop>) at lib/timer.c:235 #10 0x000000000040afe9 in io_loop () at sysdep/unix/io.c:2192 #11 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923
Breakpoint 1, add_tail (n=0x22c61a8, l=0x7ffe4c97c250) at ./lib/lists.c:80 #0 add_tail (n=0x22c61a8, l=0x7ffe4c97c250) at ./lib/lists.c:80 #1 babel_send_unicast (msg=0x7ffe4c97c2a0, ifa=0x22bfa20, dest=...) at proto/babel/packets.c:1283 #2 0x00000000004451d9 in babel_send_route_request (p=<optimized out>, e=<optimized out>, n=0x22c8800) at proto/babel/babel.c:869 #3 0x0000000000445233 in babel_refresh_route (p=p@entry=0x2252c20, r=r@entry=0x2423108) at proto/babel/babel.c:220 #4 0x00000000004471c3 in babel_expire_routes_ (p=p@entry=0x2252c20, rtable=rtable@entry=0x2252d58) at proto/babel/babel.c:243 #5 0x0000000000447388 in babel_expire_routes (p=p@entry=0x2252c20) at proto/babel/babel.c:293 #6 0x00000000004473a6 in babel_timer (t=<optimized out>) at proto/babel/babel.c:2071 #7 0x000000000045da19 in timers_fire (loop=0x4db740 <main_timeloop>) at lib/timer.c:235 #8 0x000000000040afe9 in io_loop () at sysdep/unix/io.c:2192 #9 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923
Breakpoint 1, add_tail (n=0x234ac30, l=0x241f620) at ./lib/lists.c:82 #0 add_tail (n=0x234ac30, l=0x241f620) at ./lib/lists.c:82 #1 sl_alloc (s=0x21bd7d0) at lib/slab.c:264 #2 0x000000000044f3e2 in rte_get_temp (a=0x22ffe18) at nest/rt-table.c:282 #3 0x0000000000434220 in bgp_rte_update (s=s@entry=0x7ffe4c97c110, n=n@entry=0x7ffe4c97bf80, path_id=path_id@entry=0, a0=a0@entry=0x7ffe4c97bfd0) at proto/bgp/packets.c:1326 #4 0x00000000004359db in bgp_decode_nlri_ip4 (s=0x7ffe4c97c110, pos=<optimized out>, len=<optimized out>, a=0x7ffe4c97bfd0) at proto/bgp/packets.c:1479 #5 0x0000000000432f2b in bgp_decode_nlri (s=s@entry=0x7ffe4c97c110, afi=afi@entry=65537, nlri=<optimized out>, len=<optimized out>, ea=ea@entry=0x22fa4c0, nh=<optimized out>, nh_len=<optimized out>) at proto/bgp/packets.c:2410 #6 0x00000000004383a9 in bgp_rx_update (conn=conn@entry=0x2250728, pkt=pkt@entry=0x22c8a20 '\377' <repeats 16 times>, len=len@entry=227) at proto/bgp/packets.c:2505 #7 0x0000000000438ef9 in bgp_rx_packet (conn=conn@entry=0x2250728, pkt=pkt@entry=0x22c8a20 '\377' <repeats 16 times>, len=227) at proto/bgp/packets.c:3097 #8 0x0000000000438fb5 in bgp_rx (sk=0x22c88d0, size=<optimized out>) at proto/bgp/packets.c:3142 #9 0x00000000004039a0 in call_rx_hook (s=s@entry=0x22c88d0, size=<optimized out>) at sysdep/unix/io.c:1796 #10 0x000000000040a8dd in sk_read (s=s@entry=0x22c88d0, revents=1) at sysdep/unix/io.c:1884 #11 0x000000000040b394 in io_loop () at sysdep/unix/io.c:2342 #12 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923
Are these all false positives or should there be more cleanup code on these list entries? Sorry for the noise, Andi