Hi, I've run into a bug which manifests itself during reconfiguration. Unfortunately, the bird config in question is extremely large and I can't easily cut it down to something replicable. I'm running into this particular issue in bird6, but have no reason to believe IPv4 would fair any better. All results are using bird 1.6.2. My setup is as follows: - a static protocol, in which the import stanza adds extended communities: route 2001:db8::/48 via "lo" { bgp_ext_community.add((ro, 2, 0)); bgp_ext_community.add((ro, 3, 0)); ... } - based on these communities, the static route makes its way through a pipe from the master table, to an external table Typically all is fine - we export the route to our transits. Since we have multiple egress devices, a "show route" will show our local, static route, as well as the same route learnt through our providers: sudo birdc6 show route for 2001:db8::/48 BIRD 1.6.2 ready. 2001:db8::/48 dev lo [static_protocol 04:42:26] * (200) via xxx on et1 [transit_1 03:44:07] (100) via xxx on et2 [transit_2 03:44:07] (100) We actually have three transits, but one of our transits does not export any routes we happen to be exporting to them. Such a case happens on reconfiguring BIRD. During a reconfiguration, the static protocol gets reloaded: <TRACE> static_protocol > 2001:db8::/48 added [best] dev lo in this brief moment, a window of opportunity arises for our thus far silent transit to make an unwanted appearance: sudo birdc6 show route for 2001:db8::/48 BIRD 1.6.2 ready. 2001:db8::/48 dev lo [static_protocol 05:44:07] * (200) via xxx on et3 [transit_3 05:44:07] (100) via xxx on et1 [transit_1 03:44:07] (100) via xxx on et2 [transit_2 03:44:07] (100) This, it turns out, is tragic, because somehow the extended communities we apply to the route learnt from transit_3 clobber the ones we apply in static_protocol. I'm pretty sure something screws up the rta associated to the static route, but I'm also confused as to why the static protocol seems to get reloaded at all given there is no configuration change in the first place. We definitely want to allow local as: since the routes are potentially anycast, it is useful to have an alternate route in case we shut down the static announcement. Thanks, - j
On Mon, Nov 07, 2016 at 09:18:46PM -0800, João Taveira Araújo wrote:
Hi,
I've run into a bug which manifests itself during reconfiguration. Unfortunately, the bird config in question is extremely large and I can't easily cut it down to something replicable. I'm running into this particular issue in bird6, but have no reason to believe IPv4 would fair any better. All results are using bird 1.6.2. ... in this brief moment, a window of opportunity arises for our thus far silent transit to make an unwanted appearance:
sudo birdc6 show route for 2001:db8::/48 BIRD 1.6.2 ready. 2001:db8::/48 dev lo [static_protocol 05:44:07] * (200) via xxx on et3 [transit_3 05:44:07] (100) via xxx on et1 [transit_1 03:44:07] (100) via xxx on et2 [transit_2 03:44:07] (100)
I don't understand this. Even if the route from transit_3 appears during the window of opportunity, it should disappear as the static route is propagated again to transit_3 and transit_3 selected it (assuming that it is the reason why it was not propagated before).
This, it turns out, is tragic, because somehow the extended communities we apply to the route learnt from transit_3 clobber the ones we apply in static_protocol.
How does this ext community clobbering is expressed? Could you document it by appropriate 'show route all'?
I'm pretty sure something screws up the rta associated to the static route, but I'm also confused as to why the static protocol seems to get reloaded at all given there is no configuration change in the first place.
That is true, it should not. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hello! On Fri, Nov 11, 2016 at 12:54:28PM +0100, Ondrej Zajicek wrote:
I've run into a bug which manifests itself during reconfiguration. Unfortunately, the bird config in question is extremely large and I can't easily cut it down to something replicable. I'm running into this particular issue in bird6, but have no reason to believe IPv4 would fair any better. All results are using bird 1.6.2. ... in this brief moment, a window of opportunity arises for our thus far silent transit to make an unwanted appearance:
sudo birdc6 show route for 2001:db8::/48 BIRD 1.6.2 ready. 2001:db8::/48 dev lo [static_protocol 05:44:07] * (200) via xxx on et3 [transit_3 05:44:07] (100) via xxx on et1 [transit_1 03:44:07] (100) via xxx on et2 [transit_2 03:44:07] (100)
I don't understand this. Even if the route from transit_3 appears during the window of opportunity, it should disappear as the static route is propagated again to transit_3 and transit_3 selected it (assuming that it is the reason why it was not propagated before).
This, it turns out, is tragic, because somehow the extended communities we apply to the route learnt from transit_3 clobber the ones we apply in static_protocol.
How does this ext community clobbering is expressed? Could you document it by appropriate 'show route all'?
I'm pretty sure something screws up the rta associated to the static route, but I'm also confused as to why the static protocol seems to get reloaded at all given there is no configuration change in the first place.
That is true, it should not.
I work with João and I managed to cut our (huge) config down to a minimal reproducer. Our setup does something like this: === router id 1.2.3.4; log stderr all; debug protocols all; protocol device { scan time 3600; } protocol direct { } function function1(string value) { print "function1 called with value ", value; bgp_ext_community.add((ro, 1, 1)); if "abcd" = value then { bgp_ext_community.add((ro, 2, 2)); } } function function2(string value) { print "function2 called with value ", value; if "efgh" = value then { bgp_ext_community.add((ro, 3, 3)); } } protocol static { route 1234::/32 via "lo" { function1("abcd"); function2("efgh"); }; } === There seems to be a problem related to the memory management of strings used in per static route implicit filter expressions when a reconfiguration happens. If I load bird 1.6.3 (built for IPv6) up with this config file and then send it a couple of SIGHUPs, the process crashes: === # ./bird -c b.conf -d 2017-02-17 08:49:46 <TRACE> device1: Initializing 2017-02-17 08:49:46 <TRACE> direct1: Initializing 2017-02-17 08:49:46 <TRACE> static1: Initializing 2017-02-17 08:49:46 <TRACE> device1: Starting 2017-02-17 08:49:46 <TRACE> device1: Scanning interfaces 2017-02-17 08:49:46 <TRACE> device1: Connected to table master 2017-02-17 08:49:46 <TRACE> device1: State changed to feed 2017-02-17 08:49:46 <TRACE> direct1: Starting 2017-02-17 08:49:46 <TRACE> direct1: Connected to table master 2017-02-17 08:49:46 <TRACE> direct1: State changed to feed 2017-02-17 08:49:46 <TRACE> static1: Starting 2017-02-17 08:49:46 <TRACE> static1: Connected to table master 2017-02-17 08:49:46 <TRACE> static1: State changed to feed 2017-02-17 08:49:46 <INFO> Started 2017-02-17 08:49:46 <TRACE> device1: State changed to up 2017-02-17 08:49:46 <TRACE> direct1 < interface lo goes up 2017-02-17 08:49:46 <TRACE> direct1 < primary address ::1/128 on interface lo added 2017-02-17 08:49:46 <TRACE> direct1 < interface eth0 goes up 2017-02-17 08:49:46 <TRACE> direct1 < primary address fe80::/64 on interface eth0 added 2017-02-17 08:49:46 <TRACE> direct1: State changed to up 2017-02-17 08:49:46 <TRACE> static1 < interface lo goes up 2017-02-17 08:49:46 <INFO> function1 called with value abcd 2017-02-17 08:49:46 <INFO> function2 called with value efgh 2017-02-17 08:49:46 <TRACE> static1 > added [best] 1234::/32 dev lo 2017-02-17 08:49:46 <TRACE> static1 < interface eth0 goes up 2017-02-17 08:49:46 <TRACE> static1: State changed to up 2017-02-17 08:49:51 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:51 <INFO> Reconfiguring 2017-02-17 08:49:51 <TRACE> device1: Reconfigured 2017-02-17 08:49:51 <TRACE> direct1: Reconfigured 2017-02-17 08:49:51 <TRACE> static1: Reconfigured 2017-02-17 08:49:51 <INFO> Reconfigured 2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:52 <INFO> Reconfiguring 2017-02-17 08:49:52 <TRACE> device1: Reconfigured 2017-02-17 08:49:52 <TRACE> direct1: Reconfigured 2017-02-17 08:49:52 <TRACE> static1: Reconfigured 2017-02-17 08:49:52 <INFO> Reconfigured 2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 08:49:52 <INFO> Reconfiguring 2017-02-17 08:49:52 <TRACE> device1: Reconfigured 2017-02-17 08:49:52 <TRACE> direct1: Reconfigured Segmentation fault === The crash goes away if I apply this patch (not the right fix, I'm sure): === diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..a56771f 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); +// return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + return 0; } === I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that. Running unmodified bird 1.6.3 under valgrind with the config above and then sending it a bunch of SIGHUPs produces plenty of crap on the second reconfigure (suggesting that the first reconfigure screws things up): === [...] 2017-02-17 08:59:25 <TRACE> direct1: Reconfigured ==3404== Invalid read of size 2 ==3404== at 0x805A499: i_same (filter.c:1539) ==3404== by 0x805A893: i_same (filter.c:1632) ==3404== by 0x8085D4C: static_match.isra.13 (static.c:501) ==3404== by 0x8085F95: static_reconfigure (static.c:557) ==3404== by 0x8053FB4: protos_commit (proto.c:420) ==3404== by 0x8095F42: config_do_commit (conf.c:271) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== Address 0x4217b5e is 3,606 bytes inside a block of size 4,088 free'd ==3404== at 0x4028F4F: free (vg_replace_malloc.c:446) ==3404== by 0x809E3E9: lp_free (mempool.c:211) ==3404== by 0x80A2020: pool_free (resource.c:81) ==3404== by 0x80A20AF: rfree (resource.c:165) ==3404== by 0x8095E30: config_do_commit (conf.c:250) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== ==3404== Invalid read of size 2 ==3404== at 0x805A4A7: i_same (filter.c:1541) ==3404== by 0x805A893: i_same (filter.c:1632) ==3404== by 0x8085D4C: static_match.isra.13 (static.c:501) ==3404== by 0x8085F95: static_reconfigure (static.c:557) ==3404== by 0x8053FB4: protos_commit (proto.c:420) ==3404== by 0x8095F42: config_do_commit (conf.c:271) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) ==3404== Address 0x4217b5c is 3,604 bytes inside a block of size 4,088 free'd ==3404== at 0x4028F4F: free (vg_replace_malloc.c:446) ==3404== by 0x809E3E9: lp_free (mempool.c:211) ==3404== by 0x80A2020: pool_free (resource.c:81) ==3404== by 0x80A20AF: rfree (resource.c:165) ==3404== by 0x8095E30: config_do_commit (conf.c:250) ==3404== by 0x8096172: config_commit (conf.c:361) ==3404== by 0x809D5C3: async_config (main.c:238) ==3404== by 0x8099B37: io_loop (io.c:2135) ==3404== by 0x804A733: main (main.c:885) [...] === Does this make any sense? I can dig further but I was hoping that this would be enough information for you to figure out what's going on and how to best fix it. :-) Thank you very much in advance! Cheers, Lennert
On Fri, Feb 17, 2017 at 09:06:55AM +0200, Lennert Buytenhek wrote:
Hello!
I work with João and I managed to cut our (huge) config down to a minimal reproducer. Our setup does something like this:
Hi Thanks for trying to track the problem.
There seems to be a problem related to the memory management of strings used in per static route implicit filter expressions when a reconfiguration happens.
Does the problem disappear if you replace strings with numbers in your test case?
The crash goes away if I apply this patch (not the right fix, I'm sure):
=== diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..a56771f 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); +// return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + return 0; }
===
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table. Internal structures of static protocols (like struct static_route) are replaced anyways. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Fri, Feb 17, 2017 at 02:22:11PM +0100, Ondrej Zajicek wrote:
I work with João and I managed to cut our (huge) config down to a minimal reproducer. Our setup does something like this:
Hi
Thanks for trying to track the problem.
It was fun to do. :-) Thanks for the reply!
There seems to be a problem related to the memory management of strings used in per static route implicit filter expressions when a reconfiguration happens.
Does the problem disappear if you replace strings with numbers in your test case?
I had another look at this, and it's a lot easier to make it crash with strings for some reason, but if you look purely at valgrind warnings, it seems that you can trigger those without using strings. Sorry for the initial misinformation. I reproduced the valgrind warnings with this config file: === router id 1.2.3.4; log stderr all; debug protocols all; protocol device { scan time 3600; } function function1() { print 1234; } protocol static { route 1234::/32 via "lo" { function1(); }; } === When I run this with the following script: === #!/bin/sh valgrind ./bird -c b.conf -d & sleep 0.75 kill -HUP %1 sleep 0.1 kill -HUP %1 sleep 0.1 kill %1 === I get this (again, warnings on the second reconfigure): === # ./test.sh ==15175== Memcheck, a memory error detector ==15175== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==15175== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==15175== Command: ./bird -c b.conf -d ==15175== 2017-02-17 21:39:52 <TRACE> device1: Initializing 2017-02-17 21:39:52 <TRACE> static1: Initializing 2017-02-17 21:39:52 <TRACE> device1: Starting 2017-02-17 21:39:52 <TRACE> device1: Scanning interfaces 2017-02-17 21:39:52 <TRACE> device1: Connected to table master 2017-02-17 21:39:52 <TRACE> device1: State changed to feed 2017-02-17 21:39:52 <TRACE> static1: Starting 2017-02-17 21:39:52 <TRACE> static1: Connected to table master 2017-02-17 21:39:52 <TRACE> static1: State changed to feed 2017-02-17 21:39:52 <INFO> Started 2017-02-17 21:39:52 <TRACE> device1: State changed to up 2017-02-17 21:39:52 <TRACE> static1 < interface lo goes up 2017-02-17 21:39:52 <INFO> 1234 2017-02-17 21:39:52 <TRACE> static1 > added [best] 1234::/32 dev lo 2017-02-17 21:39:52 <TRACE> static1 < interface eth0 goes up 2017-02-17 21:39:52 <TRACE> static1: State changed to up 2017-02-17 21:39:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:39:52 <INFO> Reconfiguring 2017-02-17 21:39:52 <TRACE> device1: Reconfigured 2017-02-17 21:39:52 <TRACE> static1: Reconfigured 2017-02-17 21:39:52 <INFO> Reconfigured 2017-02-17 21:39:52 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:39:52 <INFO> Reconfiguring 2017-02-17 21:39:52 <TRACE> device1: Reconfigured ==15175== Invalid read of size 2 ==15175== at 0x805A499: i_same (filter.c:1539) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a46 is 3,326 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== ==15175== Invalid read of size 2 ==15175== at 0x805A4A7: i_same (filter.c:1541) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a44 is 3,324 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== ==15175== Invalid read of size 4 ==15175== at 0x805A677: i_same (filter.c:1626) ==15175== by 0x805A893: i_same (filter.c:1632) ==15175== by 0x8085D4C: static_match.isra.13 (static.c:501) ==15175== by 0x8085F95: static_reconfigure (static.c:557) ==15175== by 0x8053FB4: protos_commit (proto.c:420) ==15175== by 0x8095F42: config_do_commit (conf.c:271) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) ==15175== Address 0x4217a48 is 3,328 bytes inside a block of size 4,088 free'd ==15175== at 0x4028F4F: free (vg_replace_malloc.c:446) ==15175== by 0x809E3E9: lp_free (mempool.c:211) ==15175== by 0x80A2020: pool_free (resource.c:81) ==15175== by 0x80A20AF: rfree (resource.c:165) ==15175== by 0x8095E30: config_do_commit (conf.c:250) ==15175== by 0x8096172: config_commit (conf.c:361) ==15175== by 0x809D5C3: async_config (main.c:238) ==15175== by 0x8099B37: io_loop (io.c:2135) ==15175== by 0x804A733: main (main.c:885) [...] === This is with a build for an older distro for i686 (32 bit), but I can also reproduce the issue on the newest stock distro bird package (1.6.3) that ships with Fedora 25 for x86_64: === [...] 2017-02-17 21:51:32 <INFO> Reconfigured 2017-02-17 21:51:32 <INFO> Reconfiguration requested by SIGHUP 2017-02-17 21:51:32 <INFO> Reconfiguring 2017-02-17 21:51:32 <TRACE> device1: Reconfigured ==15956== Invalid read of size 2 ==15956== at 0x1201FA: ??? (in /usr/sbin/bird) ==15956== by 0x120669: ??? (in /usr/sbin/bird) ==15956== by 0x142FCC: ??? (in /usr/sbin/bird) ==15956== by 0x143194: ??? (in /usr/sbin/bird) ==15956== by 0x119A2C: ??? (in /usr/sbin/bird) ==15956== by 0x14E934: ??? (in /usr/sbin/bird) ==15956== by 0x14ED5A: ??? (in /usr/sbin/bird) ==15956== by 0x15625F: ??? (in /usr/sbin/bird) ==15956== by 0x15263F: ??? (in /usr/sbin/bird) ==15956== by 0x1105FD: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) ==15956== Address 0x542846a is 1,338 bytes inside a block of size 4,096 free'd ==15956== at 0x4C2ED4A: free (vg_replace_malloc.c:530) ==15956== by 0x15719F: ??? (in /usr/sbin/bird) ==15956== by 0x15AF21: ??? (in /usr/sbin/bird) ==15956== by 0x15AFB1: ??? (in /usr/sbin/bird) ==15956== by 0x14E84C: ??? (in /usr/sbin/bird) ==15956== by 0x14ED5A: ??? (in /usr/sbin/bird) ==15956== by 0x15625F: ??? (in /usr/sbin/bird) ==15956== by 0x15263F: ??? (in /usr/sbin/bird) ==15956== by 0x1105FD: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) ==15956== Block was alloc'd at ==15956== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==15956== by 0x15DD3A: ??? (in /usr/sbin/bird) ==15956== by 0x15730A: ??? (in /usr/sbin/bird) ==15956== by 0x15735C: ??? (in /usr/sbin/bird) ==15956== by 0x15607E: ??? (in /usr/sbin/bird) ==15956== by 0x156198: ??? (in /usr/sbin/bird) ==15956== by 0x14F206: ??? (in /usr/sbin/bird) ==15956== by 0x155AE6: ??? (in /usr/sbin/bird) ==15956== by 0x1103C6: ??? (in /usr/sbin/bird) ==15956== by 0x5078400: (below main) (in /usr/lib64/libc-2.24.so) [...] ===
The crash goes away if I apply this patch (not the right fix, I'm sure):
=== diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..a56771f 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); +// return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + return 0; }
===
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table.
Internal structures of static protocols (like struct static_route) are replaced anyways.
I see. Well, I was hoping to avoid going too deeply into bird internals (I've already been labeled our resident ornithologist), but I can dig deeper if the problem doesn't reproduce on your end for some reason. Thanks again! Cheers, Lennert
On Fri, Feb 17, 2017 at 09:57:41PM +0200, Lennert Buytenhek wrote:
On Fri, Feb 17, 2017 at 02:22:11PM +0100, Ondrej Zajicek wrote:
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table.
Internal structures of static protocols (like struct static_route) are replaced anyways.
I see. Well, I was hoping to avoid going too deeply into bird internals (I've already been labeled our resident ornithologist), but I can dig deeper if the problem doesn't reproduce on your end for some reason.
I could not reproduce the crash, but i can reproduce the valgrind warnings. I will check it. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Fri, Feb 17, 2017 at 09:20:45PM +0100, Ondrej Zajicek wrote:
On Fri, Feb 17, 2017 at 09:57:41PM +0200, Lennert Buytenhek wrote:
On Fri, Feb 17, 2017 at 02:22:11PM +0100, Ondrej Zajicek wrote:
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table.
Internal structures of static protocols (like struct static_route) are replaced anyways.
I see. Well, I was hoping to avoid going too deeply into bird internals (I've already been labeled our resident ornithologist), but I can dig deeper if the problem doesn't reproduce on your end for some reason.
I could not reproduce the crash, but i can reproduce the valgrind warnings. I will check it.
Hi Here is the bugfix. The bug was caused by some arcane rewriting behavior in i_same() that works when called with argument order (new, old), but static protocol called it as (old, new). Note that the bug is specific to using functions in static routes. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Fri, Feb 17, 2017 at 10:00:05PM +0100, Ondrej Zajicek wrote:
I think that the reconfiguration code finds the old and the new implicit filter expression to be the same code, and therefore hangs onto the old version, but then the underlying memory for some of the old objects gets freed. Or something like that.
Well, the comparison in static_same_rte() is just for decision whether the route should be propagated again to the (main) routing table.
Internal structures of static protocols (like struct static_route) are replaced anyways.
I see. Well, I was hoping to avoid going too deeply into bird internals (I've already been labeled our resident ornithologist), but I can dig deeper if the problem doesn't reproduce on your end for some reason.
I could not reproduce the crash, but i can reproduce the valgrind warnings. I will check it.
Hi
Hello!
Here is the bugfix. The bug was caused by some arcane rewriting behavior in i_same() that works when called with argument order (new, old), but static protocol called it as (old, new).
Note that the bug is specific to using functions in static routes.
-- Elen sila lumenn' omentielvo
Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
diff --git a/proto/static/static.c b/proto/static/static.c index 0c088cd..849067b 100644 --- a/proto/static/static.c +++ b/proto/static/static.c @@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y) static inline int static_same_rte(struct static_route *x, struct static_route *y) { - return static_same_dest(x, y) && i_same(x->cmds, y->cmds); + /* Note that i_same() requires arguments in (new, old) order */ + return static_same_dest(x, y) && i_same(y->cmds, x->cmds); }
I can confirm that this fixes the issue for me. Many thanks!
participants (3)
-
João Taveira Araújo -
Lennert Buytenhek -
Ondrej Zajicek