Bug? / Patch for BGP next hop issue with frr peers
Hi, let me preface this that I very much do not know what I am doing here, and have been somewhat unsuccessful in trying to understand what's going on by searching online. I would love an explanation though! In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be a rare configuration. I added this simple patch: --- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_ if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~ which worked for me to resolve the problem. Thanks for any consideration! Cheers Sebastian
Sebastian - I cannot speak towards bird's behavior here but I can say that FRR has fixed a couple of nexthop related issues with what we send to our peers since the 6.0 release. I would please consider upgrading to a much later version if you can, 7.2 or 7.3 should have the fixes. thanks! donald On Sun, Apr 19, 2020 at 6:48 PM Sebastian Hahn <bird_users@sebastianhahn.net> wrote:
Hi,
let me preface this that I very much do not know what I am doing here, and have been somewhat unsuccessful in trying to understand what's going on by searching online. I would love an explanation though!
In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be a ! rare configuration.
I added this simple patch:
--- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_
if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~
which worked for me to resolve the problem.
Thanks for any consideration!
Cheers Sebastian
Is this an ipv6 route? nh length 32 means both a global and link-local address is being set. RFC2545 section 3 On Sun, 19 Apr 2020 at 21:39, Donald Sharp <sharpd@cumulusnetworks.com> wrote:
Sebastian -
I cannot speak towards bird's behavior here but I can say that FRR has fixed a couple of nexthop related issues with what we send to our peers since the 6.0 release. I would please consider upgrading to a much later version if you can, 7.2 or 7.3 should have the fixes.
thanks!
donald
On Sun, Apr 19, 2020 at 6:48 PM Sebastian Hahn <bird_users@sebastianhahn.net> wrote:
Hi,
let me preface this that I very much do not know what I am doing here,
and have been somewhat unsuccessful in trying to understand what's going on by searching online. I would love an explanation though!
In a bird 2.0.7 setup, I was unable to import routes from one of my
peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be ! a !
rare configuration.
I added this simple patch:
--- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_
if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~
which worked for me to resolve the problem.
Thanks for any consideration!
Cheers Sebastian
On 20. Apr 2020, at 04:02, Darren O'Connor <mellow.drifter@gmail.com> wrote:
Is this an ipv6 route? nh length 32 means both a global and link-local address is being set. RFC2545 section 3
Yes, this is indeed an ipv6 route. I am using many IPv6 routes with other peers, they do not hit the same code path. Thank you for the RFC pointer! Cheers Sebastian
On Sun, 19 Apr 2020 at 21:39, Donald Sharp <sharpd@cumulusnetworks.com> wrote: Sebastian -
I cannot speak towards bird's behavior here but I can say that FRR has fixed a couple of nexthop related issues with what we send to our peers since the 6.0 release. I would please consider upgrading to a much later version if you can, 7.2 or 7.3 should have the fixes.
thanks!
donald
On Sun, Apr 19, 2020 at 6:48 PM Sebastian Hahn <bird_users@sebastianhahn.net> wrote:
Hi,
let me preface this that I very much do not know what I am doing here, and have been somewhat unsuccessful in trying to understand what's going on by searching online. I would love an explanation though!
In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be !
a !
rare configuration.
I added this simple patch:
--- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_
if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~
which worked for me to resolve the problem.
Thanks for any consideration!
Cheers Sebastian
On 20. Apr 2020, at 10:06, Sebastian Hahn <bird_users@sebastianhahn.net> wrote:
On 20. Apr 2020, at 04:02, Darren O'Connor <mellow.drifter@gmail.com> wrote:
Is this an ipv6 route? nh length 32 means both a global and link-local address is being set. RFC2545 section 3
Yes, this is indeed an ipv6 route. I am using many IPv6 routes with other peers, they do not hit the same code path. Thank you for the RFC pointer!
On Sun, 19 Apr 2020 at 21:39, Donald Sharp <sharpd@cumulusnetworks.com> wrote: I cannot speak towards bird's behavior here but I can say that FRR has fixed a couple of nexthop related issues with what we send to our peers since the 6.0 release. I would please consider upgrading to a much later version if you can, 7.2 or 7.3 should have the fixes.
On Sun, Apr 19, 2020 at 6:48 PM Sebastian Hahn
In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be !
a !
rare configuration.
I added this simple patch:
--- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_
if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~
which worked for me to resolve the problem.
Hi, so I have an update here. My peer is now using an updated FRR, version 7.3-1~deb10u1 (installed from https://deb.frrouting.org), yet it still isn't working (in a normal bird 2.0.7, not using above patch). It seems to be FRR is advertising its link-local address twice as next-hop: /* GW_DIRECT -> single_hop -> p->neigh != NULL */ - if (ipa_nonzero(gw)) + if (ipa_nonzero(gw)) { + log(L_ERR "looking for neigh %I ll %I", gw, ll); nbr = neigh_find(&p->p, gw, NULL, 0); + } else if (ipa_nonzero(ll)) This small patch causes fe80::1 (my peer's link local address) to be logged for both gw and ll address. This seems like a problem with FRR/the peer's configuration to me, yes? Nevertheless, I wonder if bird's behaviour is correct here. It seems to me from reading the pointed out RFC that in the face of two nexthop addresses, bird should attempt the connection using the specified link-local address, not the global address (basicall, replacing the order of the if ... else if above), while passing on the global address to its peers. My first attempt at a patch seems pretty wrong to me from a look at the RFC. Thanks for any ideas :) Cheers Sebastian
On Wed, Apr 22, 2020 at 10:53:12PM +0200, Sebastian Hahn wrote:
On 20. Apr 2020, at 10:06, Sebastian Hahn <bird_users@sebastianhahn.net> wrote:
On 20. Apr 2020, at 04:02, Darren O'Connor <mellow.drifter@gmail.com> wrote:
Is this an ipv6 route? nh length 32 means both a global and link-local address is being set. RFC2545 section 3
Yes, this is indeed an ipv6 route. I am using many IPv6 routes with other peers, they do not hit the same code path. Thank you for the RFC pointer! Hi,
so I have an update here. My peer is now using an updated FRR, version 7.3-1~deb10u1 (installed from https://deb.frrouting.org), yet it still isn't working (in a normal bird 2.0.7, not using above patch). It seems to be FRR is advertising its link-local address twice as next-hop:
/* GW_DIRECT -> single_hop -> p->neigh != NULL */ - if (ipa_nonzero(gw)) + if (ipa_nonzero(gw)) { + log(L_ERR "looking for neigh %I ll %I", gw, ll); nbr = neigh_find(&p->p, gw, NULL, 0); + } else if (ipa_nonzero(ll))
This small patch causes fe80::1 (my peer's link local address) to be logged for both gw and ll address. This seems like a problem with FRR/the peer's configuration to me, yes?
Nevertheless, I wonder if bird's behaviour is correct here. It seems to me from reading the pointed out RFC that in the face of two nexthop addresses, bird should attempt the connection using the specified link-local address, not the global address (basicall, replacing the order of the if ... else if above), while passing on the global address to its peers. My first attempt at a patch seems pretty wrong to me from a look at the RFC.
Hi Well, the RFC 2545 is a bit vague and AFAIK nobody standardized link-local only sessions. Our position is that the first address is always global (as that is necessary for next hop resolving) and the second (optional) is link-local, therefore in cases where no global address is available the proper format of next hop should be (:: ll). Unfortunately, there are other implementations that use in such cases (ll) or (ll ll), we should handle that in bgp_decode_next_hop_ip(), but the second case is not handled there. Will send you a patch. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 23. Apr 2020, at 04:35, Ondrej Zajicek <santiago@crfreenet.org> wrote: On Wed, Apr 22, 2020 at 10:53:12PM +0200, Sebastian Hahn wrote:
so I have an update here. My peer is now using an updated FRR, version 7.3-1~deb10u1 (installed from https://deb.frrouting.org), yet it still isn't working (in a normal bird 2.0.7, not using above patch). It seems to be FRR is advertising its link-local address twice as next-hop:
/* GW_DIRECT -> single_hop -> p->neigh != NULL */ - if (ipa_nonzero(gw)) + if (ipa_nonzero(gw)) { + log(L_ERR "looking for neigh %I ll %I", gw, ll); nbr = neigh_find(&p->p, gw, NULL, 0); + } else if (ipa_nonzero(ll))
This small patch causes fe80::1 (my peer's link local address) to be logged for both gw and ll address. This seems like a problem with FRR/the peer's configuration to me, yes?
Nevertheless, I wonder if bird's behaviour is correct here. It seems to me from reading the pointed out RFC that in the face of two nexthop addresses, bird should attempt the connection using the specified link-local address, not the global address (basicall, replacing the order of the if ... else if above), while passing on the global address to its peers. My first attempt at a patch seems pretty wrong to me from a look at the RFC.
Well, the RFC 2545 is a bit vague and AFAIK nobody standardized link-local only sessions. Our position is that the first address is always global (as that is necessary for next hop resolving) and the second (optional) is link-local, therefore in cases where no global address is available the proper format of next hop should be (:: ll).
Unfortunately, there are other implementations that use in such cases (ll) or (ll ll), we should handle that in bgp_decode_next_hop_ip(), but the second case is not handled there. Will send you a patch.
Hi Ondrej, thanks for the explanations! In the meantime I have found this bugreport from the FRR project, https://github.com/FRRouting/frr/issues/6259 - it appears they also want to change the behaviour on their side. Thanks Sebastian
On Thu, Apr 23, 2020 at 12:15:33PM +0200, Sebastian Hahn wrote:
Well, the RFC 2545 is a bit vague and AFAIK nobody standardized link-local only sessions. Our position is that the first address is always global (as that is necessary for next hop resolving) and the second (optional) is link-local, therefore in cases where no global address is available the proper format of next hop should be (:: ll).
Unfortunately, there are other implementations that use in such cases (ll) or (ll ll), we should handle that in bgp_decode_next_hop_ip(), but the second case is not handled there. Will send you a patch.
Hi Ondrej,
thanks for the explanations! In the meantime I have found this bugreport from the FRR project, https://github.com/FRRouting/frr/issues/6259 - it appears they also want to change the behaviour on their side.
Hi Here is the promised patch. Could you try it? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 29. Apr 2020, at 03:01, Ondrej Zajicek <santiago@crfreenet.org> wrote: On Thu, Apr 23, 2020 at 12:15:33PM +0200, Sebastian Hahn wrote:
Well, the RFC 2545 is a bit vague and AFAIK nobody standardized link-local only sessions. Our position is that the first address is always global (as that is necessary for next hop resolving) and the second (optional) is link-local, therefore in cases where no global address is available the proper format of next hop should be (:: ll).
Unfortunately, there are other implementations that use in such cases (ll) or (ll ll), we should handle that in bgp_decode_next_hop_ip(), but the second case is not handled there. Will send you a patch.
Hi Ondrej,
thanks for the explanations! In the meantime I have found this bugreport from the FRR project, https://github.com/FRRouting/frr/issues/6259 - it appears they also want to change the behaviour on their side.
Hi
Here is the promised patch. Could you try it?
Hi Ondrej, the patch seems to work fine in my quick testing. Thanks Sebastian
On 20. Apr 2020, at 03:36, Donald Sharp <sharpd@cumulusnetworks.com> wrote:
Sebastian -
I cannot speak towards bird's behavior here but I can say that FRR has fixed a couple of nexthop related issues with what we send to our peers since the 6.0 release. I would please consider upgrading to a much later version if you can, 7.2 or 7.3 should have the fixes.
Hi Donald, I do not operate the FRR peer, but I will ask the colleague to upgrade if possible. Thank you for the advice! Cheers Sebastian
On Mon, Apr 20, 2020 at 12:46:17AM +0200, Sebastian Hahn wrote:
Hi,
let me preface this that I very much do not know what I am doing here, and have been somewhat unsuccessful in trying to understand what's going on by searching online. I would love an explanation though!
In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be a ! rare configuration.
Hi What is your BIRD config? It seems that most likely BIRD cannot resolve the global next hop IP address (the on in gw). Because it is GW_DIRECT, it is resolved only in local addresses.
I added this simple patch:
--- bird2-2.0.7.orig/proto/bgp/packets.c +++ bird2-2.0.7/proto/bgp/packets.c @@ -1174,6 +1174,9 @@ bgp_decode_next_hop_ip(struct bgp_parse_
if (ipa_is_ip4(nh[0]) || !ip6_is_link_local(nh[1])) nh[1] = IPA_NONE; + + if (ip6_is_link_local(nh[1])) + nh[0] = IPA_NONE; } else bgp_parse_error(s, 9); ~
which worked for me to resolve the problem.
Thanks for any consideration!
Cheers Sebastian
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
On 20. Apr 2020, at 05:05, Ondrej Zajicek <santiago@crfreenet.org> wrote: On Mon, Apr 20, 2020 at 12:46:17AM +0200, Sebastian Hahn wrote:
In a bird 2.0.7 setup, I was unable to import routes from one of my peers. It is the only one using frr (version 6.02-2 on debian), most other peers use bird1 or bird2. I noticed immediately after adding the peering that I received "Invalid NEXT_HOP attribute" errors in the log. Unfortunately, no more information than that was provided in the log, so I went on a little hunt. Since the exact error is raised from many places, I added some logs to identify the GW_DIRECT case in bgp_apply_next_hop() as the culprit. Here I realized that both gw and ll are set, which means a neighbor is tried to be found which doesn't involve the ll case. I then noticed that in bgp_decode_next_hop_ip(), this can only happen if len == 32. This is where I am absolutely clueless what it means for the nh len to be 32, and thus I don't know if the patch I came up with is correct, even though it works for my testing. Only this one peer using frr causes nh len to be set to 32, so I suppose it might be a ! rare configuration.
What is your BIRD config? It seems that most likely BIRD cannot resolve the global next hop IP address (the on in gw). Because it is GW_DIRECT, it is resolved only in local addresses.
Hi, and thanks for the responses. Here is the config: protocol bgp v6_frr_peer { local as <myas>; direct; neighbor fe80::1 as <theiras>; interface "frr_peer"; ipv6 { import keep filtered; import filter import_filter; export filter export_filter; }; } I use fe80::42, the peer uses fe80::1 on this link. Please let me know if I can provide more information in any way. Thanks Sebastian
participants (4)
-
Darren O'Connor -
Donald Sharp -
Ondrej Zajicek -
Sebastian Hahn