On Fri, Feb 17, 2017 at 02:22:11PM +0100, Ondrej Zajicek wrote:
I work with João and I managed to cut our (huge) config down to a minimal reproducer. Our setup does something like this:
Hi
Thanks for trying to track the problem.
It was fun to do. :-) Thanks for the reply!
There seems to be a problem related to the memory management of strings used in per static route implicit filter expressions when a reconfiguration happens.
Does the problem disappear if you replace strings with numbers in your test case?
I had another look at this, and it's a lot easier to make it crash with strings for some reason, but if you look purely at valgrind warnings, it seems that you can trigger those without using strings. Sorry for the initial misinformation. I reproduced the valgrind warnings with this config file: === router id 1.2.3.4; log stderr all; debug protocols all; protocol device { scan time 3600; } function function1() { print 1234; } protocol static { route 1234::/32 via "lo" { function1(); }; } === When I run this with the following script: === #!/bin/sh valgrind ./bird -c b.conf -d & sleep 0.75 kill -HUP %1 sleep 0.1 kill -HUP %1 sleep 0.1 kill %1 === I get this (again, warnings on the second reconfigure): === # ./test.sh ==15175== Memcheck, a memory error detector ==15175== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==15175== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==15175== Command: ./bird -c b.conf -d ==15175== 2017-02-17 21:39:52 <TRACE> device1: Initializing 2017-02-17 21:39:52 <TRACE> static1: Initializing 2017-02-17 21:39:52 <TRACE> device1: Starting 2017-02-17 21:39:52 <TRACE> device1: Scanning interfaces 2017-02-17 21:39:52 <TRACE> device1: Connected to table master 2017-02-17 21:39:52 <TRACE> device1: State changed to feed 2017-02-17 21:39:52 <TRACE> static1: Starting 2017-02-17 21:39:52 <TRACE> static1: Connected to table master 2017-02-17 21:39:52 <TRACE> static1: State changed to feed 2017-02-17 21:39:52 <INFO> Started 2017-02-17 21:39:52 <TRACE> device1: State changed to up 2017-02-17 21:39:52 <TRACE> static1 < interface lo goes up 2017-02-17 21:39:52 <INFO> 1234 2017-02-17 21:39:52 <TRACE> static1 > added [best] 1234::/32 dev lo 2017-02-17 21:39:52 <TRACE> static1 < interface eth0 goes up 2017-02-17 21:39:52 <TRACE> static1: State changed to up 2017-02-17 21:39:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:39:52 <INFO> Reconfiguring 2017-02-17 21:39:52 <TRACE> device1: Reconfigured 2017-02-17 21:39:52 <TRACE> static1: Reconfigured 2017-02-17 21:39:52 <INFO> Reconfigured 2017-02-17 21:39:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:39:52 <INFO> Reconfiguring 2017-02-17 21:39:52 <TRACE> device1: Reconfigured ==15175== Invalid read of size 2 ==15175== at 0x805A499: i_same (filter.c:1539) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a46 is 3,326 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== ==15175== Invalid read of size 2 ==15175== at 0x805A4A7: i_same (filter.c:1541) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a44 is 3,324 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== ==15175== Invalid read of size 4 ==15175== at 0x805A677: i_same (filter.c:1626) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a48 is 3,328 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) [...] === This is with a build for an older distro for i686 (32 bit), but I can also reproduce the issue on the newest stock distro bird package (1.6.3) that ships with Fedora 25 for x86_64: === [...] 2017-02-17 21:51:32 <INFO> Reconfigured 2017-02-17 21:51:32 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:51:32 <INFO> Reconfiguring 2017-02-17 21:51:32 <TRACE> device1: Reconfigured ==15956== Invalid read of size 2 ==15956== at 0x1201FA: ??? (in /usr/sbin/bird) ==15956== by 0x120669: ??? (in /usr/sbin/bird) ==15956== by 0x142FCC: ??? (in /usr/sbin/bird) ==15956== by 0x143194: ??? (in /usr/sbin/bird) ==15956== by 0x119A2C: ??? (in /usr/sbin/bird) ==15956== by 0x14E934: ??? (in /usr/sbin/bird) ==15956== by 0x14ED5A: ??? (in /usr/sbin/bird) ==15956== by 0x15625F: ??? (in /usr/sbin/bird) ==15956== by 0x15263F: ??? (in /usr/sbin/bird) ==15956== by 0x1105FD: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) ==15956== Address 0x542846a is 1,338 bytes inside a block of size 4,096 free'd ==15956== at 0x4C2ED4A: free (vg_replace_malloc.c:530) ==15956== by 0x15719F: ??? (in /usr/sbin/bird) ==15956== by 0x15AF21: ??? (in /usr/sbin/bird) ==15956== by 0x15AFB1: ??? (in /usr/sbin/bird) ==15956== by 0x14E84C: ??? (in /usr/sbin/bird) ==15956== by 0x14ED5A: ??? (in /usr/sbin/bird) ==15956== by 0x15625F: ??? (in /usr/sbin/bird) ==15956== by 0x15263F: ??? (in /usr/sbin/bird) ==15956== by 0x1105FD: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) ==15956== Block was alloc'd at ==15956== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==15956== by 0x15DD3A: ??? (in /usr/sbin/bird) ==15956== by 0x15730A: ??? (in /usr/sbin/bird) ==15956== by 0x15735C: ??? (in /usr/sbin/bird) ==15956== by 0x15607E: ??? (in /usr/sbin/bird) ==15956== by 0x156198: ??? (in /usr/sbin/bird) ==15956== by 0x14F206: ??? (in /usr/sbin/bird) ==15956== by 0x155AE6: ??? (in /usr/sbin/bird) ==15956== by 0x1103C6: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) [...] ===
The crash goes away if I apply this patch (not the right fix, I'm sure):
=== diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..a56771f 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); +// return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + return 0; }
===
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table.
Internal structures of static protocols (like struct static_route) are replaced anyways.
I see. Well, I was hoping to avoid going too deeply into bird internals (I've already been labeled our resident ornithologist), but I can dig deeper if the problem doesn't reproduce on your end for some reason. Thanks again! Cheers, Lennert