Hello! On Fri, Nov 11, 2016 at 12:54:28PM +0100, Ondrej Zajicek wrote:
I've run into a bug which manifests itself during reconfiguration. Unfortunately, the bird config in question is extremely large and I can't easily cut it down to something replicable. I'm running into this particular issue in bird6, but have no reason to believe IPv4 would fair any better. All results are using bird 1.6.2. ... in this brief moment, a window of opportunity arises for our thus far silent transit to make an unwanted appearance:
sudo birdc6 show route for 2001:db8::/48 BIRD 1.6.2 ready. 2001:db8::/48 dev lo [static_protocol 05:44:07] * (200) via xxx on et3 [transit_3 05:44:07] (100) via xxx on et1 [transit_1 03:44:07] (100) via xxx on et2 [transit_2 03:44:07] (100)
I don't understand this. Even if the route from transit_3 appears during the window of opportunity, it should disappear as the static route is propagated again to transit_3 and transit_3 selected it (assuming that it is the reason why it was not propagated before).
This, it turns out, is tragic, because somehow the extended communities we apply to the route learnt from transit_3 clobber the ones we apply in static_protocol.
How does this ext community clobbering is expressed? Could you document it by appropriate 'show route all'?
I'm pretty sure something screws up the rta associated to the static route, but I'm also confused as to why the static protocol seems to get reloaded at all given there is no configuration change in the first place.
That is true, it should not.
I work with João and I managed to cut our (huge) config down to a minimal reproducer. Our setup does something like this: === router id 1.2.3.4; log stderr all; debug protocols all; protocol device { scan time 3600; } protocol direct { } function function1(string value) { print "function1 called with value ", value; bgp_ext_community.add((ro, 1, 1)); if "abcd" = value then { bgp_ext_community.add((ro, 2, 2)); } } function function2(string value) { print "function2 called with value ", value; if "efgh" = value then { bgp_ext_community.add((ro, 3, 3)); } } protocol static { route 1234::/32 via "lo" { function1("abcd"); function2("efgh"); }; } === There seems to be a problem related to the memory management of strings used in per static route implicit filter expressions when a reconfiguration happens. If I load bird 1.6.3 (built for IPv6) up with this config file and then send it a couple of SIGHUPs, the process crashes: === # ./bird -c b.conf -d 2017-02-17 08:49:46 <TRACE> device1: Initializing 2017-02-17 08:49:46 <TRACE> direct1: Initializing 2017-02-17 08:49:46 <TRACE> static1: Initializing 2017-02-17 08:49:46 <TRACE> device1: Starting 2017-02-17 08:49:46 <TRACE> device1: Scanning interfaces 2017-02-17 08:49:46 <TRACE> device1: Connected to table master 2017-02-17 08:49:46 <TRACE> device1: State changed to feed 2017-02-17 08:49:46 <TRACE> direct1: Starting 2017-02-17 08:49:46 <TRACE> direct1: Connected to table master 2017-02-17 08:49:46 <TRACE> direct1: State changed to feed 2017-02-17 08:49:46 <TRACE> static1: Starting 2017-02-17 08:49:46 <TRACE> static1: Connected to table master 2017-02-17 08:49:46 <TRACE> static1: State changed to feed 2017-02-17 08:49:46 <INFO> Started 2017-02-17 08:49:46 <TRACE> device1: State changed to up 2017-02-17 08:49:46 <TRACE> direct1 < interface lo goes up 2017-02-17 08:49:46 <TRACE> direct1 < primary address ::1/128 on interface lo added 2017-02-17 08:49:46 <TRACE> direct1 < interface eth0 goes up 2017-02-17 08:49:46 <TRACE> direct1 < primary address fe80::/64 on interface eth0 added 2017-02-17 08:49:46 <TRACE> direct1: State changed to up 2017-02-17 08:49:46 <TRACE> static1 < interface lo goes up 2017-02-17 08:49:46 <INFO> function1 called with value abcd 2017-02-17 08:49:46 <INFO> function2 called with value efgh 2017-02-17 08:49:46 <TRACE> static1 > added [best] 1234::/32 dev lo 2017-02-17 08:49:46 <TRACE> static1 < interface eth0 goes up 2017-02-17 08:49:46 <TRACE> static1: State changed to up 2017-02-17 08:49:51 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:51 <INFO> Reconfiguring 2017-02-17 08:49:51 <TRACE> device1: Reconfigured 2017-02-17 08:49:51 <TRACE> direct1: Reconfigured 2017-02-17 08:49:51 <TRACE> static1: Reconfigured 2017-02-17 08:49:51 <INFO> Reconfigured 2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:52 <INFO> Reconfiguring 2017-02-17 08:49:52 <TRACE> device1: Reconfigured 2017-02-17 08:49:52 <TRACE> direct1: Reconfigured 2017-02-17 08:49:52 <TRACE> static1: Reconfigured 2017-02-17 08:49:52 <INFO> Reconfigured 2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:52 <INFO> Reconfiguring 2017-02-17 08:49:52 <TRACE> device1: Reconfigured 2017-02-17 08:49:52 <TRACE> direct1: Reconfigured Segmentation fault === The crash goes away if I apply this patch (not the right fix, I'm sure): === diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..a56771f 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); +// return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + return 0; } === I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that. Running unmodified bird 1.6.3 under valgrind with the config above and then sending it a bunch of SIGHUPs produces plenty of crap on the second reconfigure (suggesting that the first reconfigure screws things up): === [...] 2017-02-17 08:59:25 <TRACE> direct1: Reconfigured ==3404== Invalid read of size 2 ==3404== at 0x805A499: i_same (filter.c:1539) ==3404== by 0x805A893: i_same (filter.c:1632) ==3404== by 0x8085D4C: static_match.isra.13 (static.c:501) ==3404== by 0x8085F95: static_reconfigure (static.c:557) ==3404== by 0x8053FB4: protos_commit (proto.c:420) ==3404== by 0x8095F42: config_do_commit (conf.c:271) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== Address 0x4217b5e is 3,606 bytes inside a block of size 4,088 free'd ==3404== at 0x4028F4F: free (vg_replace_malloc.c:446) ==3404== by 0x809E3E9: lp_free (mempool.c:211) ==3404== by 0x80A2020: pool_free (resource.c:81) ==3404== by 0x80A20AF: rfree (resource.c:165) ==3404== by 0x8095E30: config_do_commit (conf.c:250) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== ==3404== Invalid read of size 2 ==3404== at 0x805A4A7: i_same (filter.c:1541) ==3404== by 0x805A893: i_same (filter.c:1632) ==3404== by 0x8085D4C: static_match.isra.13 (static.c:501) ==3404== by 0x8085F95: static_reconfigure (static.c:557) ==3404== by 0x8053FB4: protos_commit (proto.c:420) ==3404== by 0x8095F42: config_do_commit (conf.c:271) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== Address 0x4217b5c is 3,604 bytes inside a block of size 4,088 free'd ==3404== at 0x4028F4F: free (vg_replace_malloc.c:446) ==3404== by 0x809E3E9: lp_free (mempool.c:211) ==3404== by 0x80A2020: pool_free (resource.c:81) ==3404== by 0x80A20AF: rfree (resource.c:165) ==3404== by 0x8095E30: config_do_commit (conf.c:250) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) [...] === Does this make any sense? I can dig further but I was hoping that this would be enough information for you to figure out what's going on and how to best fix it. :-) Thank you very much in advance! Cheers, Lennert