bird-1.5.0 ipv4 segfaults on configure - is it safe to change bgp protocol names runtime?
Hi bird users, General question is: is it safe to change bgp protocol name runtime? We're running two instances of bird 1.5.0 on PLD Linux boxes. several days we've upgraded our two routers to 1.5.0 and made some changes in our configuration. There were no problems for about a week. We've decided to tidy up our bgp protocol names (few of them), and the problem occurred on both instances. First instance just after configure: bird[30925]: segfault at 31 ip 000000000040e5d9 sp 00007fffac0862f8 error 6 in bird[400000+64000] Then, on the other instance, ~15 hours after configure: bird[27708]: segfault at 41 ip 000000000040e5c3 sp 00007fff39e55a38 error 6 in bird[400000+64000] I can reproduce error that occurs at configure time in local envoroment, hovever there's no any established sessions at all in ths env, not sure if errors are the same. Here's backtrace: bird: F_1_0901_NEW_NAME: Initializing bird: F_1_0901_NEW_NAME: Starting bird: F_1_0901_NEW_NAME: State changed to start bird: F_1_0901_OLD_NAME: State changed to down Program received signal SIGSEGV, Segmentation fault. 0x0000000000455bfb in ?? () (gdb) bt #0 0x0000000000455bfb in ?? () #1 0x000000000040e4ee in olock_run_event (unused=<optimized out>) at ../../nest/locks.c:177 #2 0x000000000043b76e in ev_run (e=0x66c010) at event.c:85 #3 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #4 0x000000000043de3c in io_loop () at io.c:2061 #5 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833 Then i've used much simplier config (just 1 bgp session) in my local enviroment, after 3rd protocol name change and reconfiguration another error occurred: bird: Removing protocol one_SOME_LONGER bird: one_SOME_LONGER: Shutting down bird: one_SOME_LONGER: Shutdown requested bird: one_SOME_LONGER: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME bird: one_SOME_LONGER_NAME: Initializing bird: one_SOME_LONGER_NAME: Starting bird: one_SOME_LONGER_NAME: State changed to start bird: one_SOME_LONGER: Down Program received signal SIGSEGV, Segmentation fault. olock_free (r=0x6751b0) at ../../nest/locks.c:72 72 rem_node(n); (gdb) bt #0 olock_free (r=0x6751b0) at ../../nest/locks.c:72 #1 0x0000000000445752 in pool_free (P=<optimized out>) at resource.c:81 #2 0x00000000004457c3 in rfree (res=0x674830) at resource.c:165 #3 0x000000000040ae8f in proto_notify_state (p=0x674da0, ps=<optimized out>) at ../../nest/proto.c:1387 #4 0x000000000043b76e in ev_run (e=0x675120) at event.c:85 #5 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #6 0x000000000043de3c in io_loop () at io.c:2061 #7 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833 Another protocol name change triggered another error: bird: Removing protocol one_SOME_LONGER_NAME_2 bird: one_SOME_LONGER_NAME_2: Shutting down bird: one_SOME_LONGER_NAME_2: Shutdown requested bird: one_SOME_LONGER_NAME_2: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME_2_3 bird: one_SOME_LONGER_NAME_2_3: Initializing bird: one_SOME_LONGER_NAME_2_3: Starting bird: one_SOME_LONGER_NAME_2_3: State changed to start bird: one_SOME_LONGER_NAME_2: Down bird: one_SOME_LONGER_NAME_2: State changed to down bird: Reconfigured Program received signal SIGSEGV, Segmentation fault. 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 81 r->class->free(r); (gdb) bt #0 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 #1 0x00000000004457c3 in rfree (res=0x6807c0) at resource.c:165 #2 0x000000000043d94b in sk_read (s=s@entry=0x680660) at io.c:1786 #3 0x000000000043e23c in io_loop () at io.c:2158 #4 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833 Further details including core files from gdb may be provided if needed. -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
On Fri, Mar 18, 2016 at 08:36:53AM +0100, Bartosz Radwan wrote:
Hi bird users,
General question is: is it safe to change bgp protocol name runtime?
It is expected to be safe. Although because protocol names are used as keys to identify/match protocols, you cannot really rename a protocol - with a different name, a new one is created and an old one is removed.
I can reproduce error that occurs at configure time in local envoroment, hovever there's no any established sessions at all in ths env, not sure if errors are the same.
Here's backtrace:
bird: F_1_0901_NEW_NAME: Initializing bird: F_1_0901_NEW_NAME: Starting bird: F_1_0901_NEW_NAME: State changed to start bird: F_1_0901_OLD_NAME: State changed to down
Program received signal SIGSEGV, Segmentation fault. 0x0000000000455bfb in ?? () (gdb) bt #0 0x0000000000455bfb in ?? () #1 0x000000000040e4ee in olock_run_event (unused=<optimized out>) at ../../nest/locks.c:177 #2 0x000000000043b76e in ev_run (e=0x66c010) at event.c:85 #3 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #4 0x000000000043de3c in io_loop () at io.c:2061 #5 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Then i've used much simplier config (just 1 bgp session) in my local enviroment, after 3rd protocol name change and reconfiguration another error occurred:
bird: Removing protocol one_SOME_LONGER bird: one_SOME_LONGER: Shutting down bird: one_SOME_LONGER: Shutdown requested bird: one_SOME_LONGER: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME bird: one_SOME_LONGER_NAME: Initializing bird: one_SOME_LONGER_NAME: Starting bird: one_SOME_LONGER_NAME: State changed to start bird: one_SOME_LONGER: Down
Program received signal SIGSEGV, Segmentation fault. olock_free (r=0x6751b0) at ../../nest/locks.c:72 72 rem_node(n); (gdb) bt #0 olock_free (r=0x6751b0) at ../../nest/locks.c:72 #1 0x0000000000445752 in pool_free (P=<optimized out>) at resource.c:81 #2 0x00000000004457c3 in rfree (res=0x674830) at resource.c:165 #3 0x000000000040ae8f in proto_notify_state (p=0x674da0, ps=<optimized out>) at ../../nest/proto.c:1387 #4 0x000000000043b76e in ev_run (e=0x675120) at event.c:85 #5 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #6 0x000000000043de3c in io_loop () at io.c:2061 #7 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Another protocol name change triggered another error:
bird: Removing protocol one_SOME_LONGER_NAME_2 bird: one_SOME_LONGER_NAME_2: Shutting down bird: one_SOME_LONGER_NAME_2: Shutdown requested bird: one_SOME_LONGER_NAME_2: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME_2_3 bird: one_SOME_LONGER_NAME_2_3: Initializing bird: one_SOME_LONGER_NAME_2_3: Starting bird: one_SOME_LONGER_NAME_2_3: State changed to start bird: one_SOME_LONGER_NAME_2: Down bird: one_SOME_LONGER_NAME_2: State changed to down bird: Reconfigured
Program received signal SIGSEGV, Segmentation fault. 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 81 r->class->free(r); (gdb) bt #0 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 #1 0x00000000004457c3 in rfree (res=0x6807c0) at resource.c:165 #2 0x000000000043d94b in sk_read (s=s@entry=0x680660) at io.c:1786 #3 0x000000000043e23c in io_loop () at io.c:2158 #4 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Further details including core files from gdb may be provided if needed.
-- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On Fri, Mar 18, 2016 at 09:55:11AM +0100, Ondrej Zajicek wrote:
On Fri, Mar 18, 2016 at 08:36:53AM +0100, Bartosz Radwan wrote:
Hi bird users,
General question is: is it safe to change bgp protocol name runtime?
It is expected to be safe. Although because protocol names are used as keys to identify/match protocols, you cannot really rename a protocol - with a different name, a new one is created and an old one is removed.
I can reproduce error that occurs at configure time in local envoroment, hovever there's no any established sessions at all in ths env, not sure if errors are the same.
...
Further details including core files from gdb may be provided if needed.
I just noticed that a second half of my previous post is missing. Could you send me the core dumps and binary? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 18.03.2016 10:46, Ondrej Zajicek wrote:
I just noticed that a second half of my previous post is missing. Could you send me the core dumps and binary? I've just sent it to santiago@crfreenet.org.
-- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
Hi! On 03/18/2016 08:36 AM, Bartosz Radwan wrote:
Hi bird users,
General question is: is it safe to change bgp protocol name runtime?
It generally should be; I just assume that Bird doesn't know about rename. Therefore, after configure, it just kills the old BGP and starts a new instance of the protocol.
We're running two instances of bird 1.5.0 on PLD Linux boxes.
several days we've upgraded our two routers to 1.5.0 and made some changes in our configuration.
There were no problems for about a week.
We've decided to tidy up our bgp protocol names (few of them), and the problem occurred on both instances.
First instance just after configure: bird[30925]: segfault at 31 ip 000000000040e5d9 sp 00007fffac0862f8 error 6 in bird[400000+64000]
Then, on the other instance, ~15 hours after configure: bird[27708]: segfault at 41 ip 000000000040e5c3 sp 00007fff39e55a38 error 6 in bird[400000+64000]
I can reproduce error that occurs at configure time in local envoroment, hovever there's no any established sessions at all in ths env, not sure if errors are the same.
Can you please send in a reduced config set that still causes this bug? I'd like to reproduce it locally and fix it thereafter. Thanks a lot.
Here's backtrace:
bird: F_1_0901_NEW_NAME: Initializing bird: F_1_0901_NEW_NAME: Starting bird: F_1_0901_NEW_NAME: State changed to start bird: F_1_0901_OLD_NAME: State changed to down
Program received signal SIGSEGV, Segmentation fault. 0x0000000000455bfb in ?? () (gdb) bt #0 0x0000000000455bfb in ?? () #1 0x000000000040e4ee in olock_run_event (unused=<optimized out>) at ../../nest/locks.c:177 #2 0x000000000043b76e in ev_run (e=0x66c010) at event.c:85 #3 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #4 0x000000000043de3c in io_loop () at io.c:2061 #5 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Then i've used much simplier config (just 1 bgp session) in my local enviroment, after 3rd protocol name change and reconfiguration another error occurred:
bird: Removing protocol one_SOME_LONGER bird: one_SOME_LONGER: Shutting down bird: one_SOME_LONGER: Shutdown requested bird: one_SOME_LONGER: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME bird: one_SOME_LONGER_NAME: Initializing bird: one_SOME_LONGER_NAME: Starting bird: one_SOME_LONGER_NAME: State changed to start bird: one_SOME_LONGER: Down
Program received signal SIGSEGV, Segmentation fault. olock_free (r=0x6751b0) at ../../nest/locks.c:72 72 rem_node(n); (gdb) bt #0 olock_free (r=0x6751b0) at ../../nest/locks.c:72 #1 0x0000000000445752 in pool_free (P=<optimized out>) at resource.c:81 #2 0x00000000004457c3 in rfree (res=0x674830) at resource.c:165 #3 0x000000000040ae8f in proto_notify_state (p=0x674da0, ps=<optimized out>) at ../../nest/proto.c:1387 #4 0x000000000043b76e in ev_run (e=0x675120) at event.c:85 #5 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #6 0x000000000043de3c in io_loop () at io.c:2061 #7 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Another protocol name change triggered another error:
bird: Removing protocol one_SOME_LONGER_NAME_2 bird: one_SOME_LONGER_NAME_2: Shutting down bird: one_SOME_LONGER_NAME_2: Shutdown requested bird: one_SOME_LONGER_NAME_2: State changed to stop bird: Adding protocol one_SOME_LONGER_NAME_2_3 bird: one_SOME_LONGER_NAME_2_3: Initializing bird: one_SOME_LONGER_NAME_2_3: Starting bird: one_SOME_LONGER_NAME_2_3: State changed to start bird: one_SOME_LONGER_NAME_2: Down bird: one_SOME_LONGER_NAME_2: State changed to down bird: Reconfigured
Program received signal SIGSEGV, Segmentation fault. 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 81 r->class->free(r); (gdb) bt #0 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 #1 0x00000000004457c3 in rfree (res=0x6807c0) at resource.c:165 #2 0x000000000043d94b in sk_read (s=s@entry=0x680660) at io.c:1786 #3 0x000000000043e23c in io_loop () at io.c:2158 #4 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
Further details including core files from gdb may be provided if needed.
On 18.03.2016 10:07, Jan Matejka wrote:
Can you please send in a reduced config set that still causes this bug? I'd like to reproduce it locally and fix it thereafter.
Thanks a lot. Minimal config causing the errors:
protocol direct { interface "br0.*", "br1", "br1.*"; debug all; } protocol kernel { learn; persist; scan time 1; export all; } protocol device { scan time 1; } protocol bgp one { router id 10.13.0.31; debug all; next hop self; direct; import none; export filter { if net = 10.10.10.0/24 then {accept; accept; } reject; }; local 10.13.0.1 as 65532; neighbor 10.13.0.2 as 65531; check link 1; } Testing this one ive got another segfault, seems quite random…: bird: Reconfiguring bird: direct1: Reconfigured bird: Removing protocol one_SOME_LONGER_NAME_2_3 bird: one_SOME_LONGER_NAME_2_3: Shutting down bird: one_SOME_LONGER_NAME_2_3: Shutdown requested bird: one_SOME_LONGER_NAME_2_3: State changed to stop bird: Adding protocol one bird: one: Initializing bird: one: Starting bird: one: State changed to start bird: one_SOME_LONGER_NAME_2_3: Down bird: one_SOME_LONGER_NAME_2_3: State changed to down bird: Reconfigured Program received signal SIGSEGV, Segmentation fault. 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 81 r->class->free(r); (gdb) bt #0 0x000000000044574f in pool_free (P=<optimized out>) at resource.c:81 #1 0x00000000004457c3 in rfree (res=0x6807c0) at resource.c:165 #2 0x000000000043a89d in config_free (c=<optimized out>) at conf.c:174 #3 0x000000000043a8d3 in config_do_commit (c=c@entry=0x68db10, type=type@entry=1) at conf.c:230 #4 0x000000000043ac54 in config_commit (c=0x68db10, type=type@entry=1, timeout=timeout@entry=0) at conf.c:348 #5 0x0000000000441997 in cmd_reconfig (name=<optimized out>, type=type@entry=1, timeout=0) at main.c:312 #6 0x0000000000437245 in cf_parse () at cf-parse.y:917 #7 0x000000000043a87f in cli_parse (c=c@entry=0x7fffffffdb20) at conf.c:159 #8 0x000000000040de34 in cli_command (c=c@entry=0x66eb30) at ../../nest/cli.c:270 #9 0x000000000040e00c in cli_event (data=0x66eb30) at ../../nest/cli.c:297 #10 0x000000000043b76e in ev_run (e=0x66db10) at event.c:85 #11 ev_run_list (l=0x66b2e0 <global_event_list>) at event.c:142 #12 0x000000000043de3c in io_loop () at io.c:2061 #13 0x00000000004031d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:833 I'm not sure if it's bird error at all. If you do not reproduce this errors it will be hint for me and i'll start investigtion in linked libs etc. Tkanks a lot. -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
On 03/18/2016 10:28 AM, Bartosz Radwan wrote:
On 18.03.2016 10:07, Jan Matejka wrote:
Can you please send in a reduced config set that still causes this bug? I'd like to reproduce it locally and fix it thereafter.
Thanks a lot. Minimal config causing the errors:
protocol direct {
interface "br0.*", "br1", "br1.*"; debug all; } protocol kernel { learn; persist; scan time 1; export all; } protocol device { scan time 1; } protocol bgp one {
router id 10.13.0.31; debug all; next hop self; direct; import none; export filter { if net = 10.10.10.0/24 then {accept; accept; } reject; }; local 10.13.0.1 as 65532; neighbor 10.13.0.2 as 65531; check link 1; }
Thanks!
Testing this one ive got another segfault, seems quite random…:
There may be some broken code that stumps over random memory like a bull in a china shop which causes segfault ... sometimes. JM
On 03/18/2016 10:44 AM, Jan Matejka wrote:
On 03/18/2016 10:28 AM, Bartosz Radwan wrote:
On 18.03.2016 10:07, Jan Matejka wrote:
Can you please send in a reduced config set that still causes this bug? I'd like to reproduce it locally and fix it thereafter.
Thanks a lot. Minimal config causing the errors:
protocol direct {
interface "br0.*", "br1", "br1.*"; debug all; } protocol kernel { learn; persist; scan time 1; export all; } protocol device { scan time 1; } protocol bgp one {
router id 10.13.0.31; debug all; next hop self; direct; import none; export filter { if net = 10.10.10.0/24 then {accept; accept; } reject; }; local 10.13.0.1 as 65532; neighbor 10.13.0.2 as 65531; check link 1; }
Thanks!
I'm unable to reproduce it at fresh built bird-1.5.0 from Git. But I don't have any traffic at the BGP and the segfault may come from it (we know about another issue which may be related). Maybe … does it occur even if you just restart the BGP protocol from birdc? JM
On 18.03.2016 11:05, Jan Matejka wrote:
I'm unable to reproduce it at fresh built bird-1.5.0 from Git. But I don't have any traffic at the BGP and the segfault may come from it (we know about another issue which may be related). Maybe … does it occur even if you just restart the BGP protocol from birdc?
This happens even without Established bgp session at all - configured session is enough. Seems like removing or adding protocol possibli with the same settings and just different name is problem here. I'll keep investigating and let the list know when find out anything new. -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
On 18.03.2016 11:05, Jan Matejka wrote:
Thanks! > I'm unable to reproduce it at fresh built bird-1.5.0 from Git.
This issue is PLD specific - PLD CFLAGS misses -fno-strict-aliasing added in https://gitlab.labs.nic.cz/labs/bird/commit/efd6d12b975441c7e1875a59dd9e0f3d... -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
On 18.03.2016 10:07, Jan Matejka wrote:
Can you please send in a reduced config set that still causes this bug? I'd like to reproduce it locally and fix it thereafter. Some additional info:
I've used such oneliner: while :; do birdc configure ; sleep 1 ; done and were playint with protocol name - crash happens every 1…3 name changes. I've been playing with another parameters of existing configured protocols - no crash at all for ~20 changes. -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Z poważaniem Bartosz Radwan
participants (4)
-
Bartosz Radwan -
Christian Tacke -
Jan Matejka -
Ondrej Zajicek