IPv6 BGP & kernel 4.19

newer
comparing open source BGP stacks...

older
Routes imported from BGP doesn't...

Robert Sander

19 Jun 2019 19 Jun '19

7:10 a.m.

Hi, our routers run on Debian stretch with bird 1.6.4 from bird.network.cz/debian. Yesterday I tried kernel 4.19 from backports.debian.org and ran into a weird issue with IPv6 BGP sessions: All Peerings reported "Error: Hold timer expired" ca. every 40 minutes. IPv6 forwarding was flapping all the time. After rebooting into kernel 4.9 everything worked again. IPv4 BGP was not affected and also OSPF (v4 and v6). I could disable all IPv6 BGP peerings on this router and then it forwarded to another router learned via OSPF for IPv6 without issues. Has anyone seen such a behaviour? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 93818 B Geschäftsführer: Peer Heinlein - Sitz: Berlin

Attachments:

signature.asc (application/pgp-signature — 833 bytes)

Show replies by date

Alarig Le Lay

19 Jun 19 Jun

6:09 p.m.

Hi, On mer. 19 juin 09:10:53 2019, Robert Sander wrote:

...

Hi,

our routers run on Debian stretch with bird 1.6.4 from bird.network.cz/debian.

Yesterday I tried kernel 4.19 from backports.debian.org and ran into a weird issue with IPv6 BGP sessions:

All Peerings reported "Error: Hold timer expired" ca. every 40 minutes.

IPv6 forwarding was flapping all the time.

After rebooting into kernel 4.9 everything worked again.

IPv4 BGP was not affected and also OSPF (v4 and v6). I could disable all IPv6 BGP peerings on this router and then it forwarded to another router learned via OSPF for IPv6 without issues.

Has anyone seen such a behaviour?

I’ve seen this with 4.19 on gentoo. For now I’m still running 4.14. https://archives.gentoo.org/gentoo-user/message/fab628cc53e4a55589410f9dff6a... -- Alarig

Benedikt Neuffer

20 Jun 20 Jun

4:13 p.m.

Hi, On 19.06.19 20:09, Alarig Le Lay wrote:

...

Hi,

On mer. 19 juin 09:10:53 2019, Robert Sander wrote:

...
Hi,

our routers run on Debian stretch with bird 1.6.4 from bird.network.cz/debian.

Yesterday I tried kernel 4.19 from backports.debian.org and ran into a weird issue with IPv6 BGP sessions:

All Peerings reported "Error: Hold timer expired" ca. every 40 minutes.

IPv6 forwarding was flapping all the time.

After rebooting into kernel 4.9 everything worked again.

IPv4 BGP was not affected and also OSPF (v4 and v6). I could disable all IPv6 BGP peerings on this router and then it forwarded to another router learned via OSPF for IPv6 without issues.

Has anyone seen such a behaviour?

I’ve seen this with 4.19 on gentoo. For now I’m still running 4.14. https://archives.gentoo.org/gentoo-user/message/fab628cc53e4a55589410f9dff6a...

Same here. Gentoo, Linux 4.19.52, Bird 2.0.4. I am running a full table using a separate VRF and the default table as management VRF. Without traffic through the box (all IPv6 prefixes filtered) the bgp sessions is stable. With traffic the bgp session dies after some time and ssh connections in the default table freezes. I did some packet captures and saw tcp retransmissions before hold timer expires. Kernel 4.14.127 is here stable, too. Sadly I have no time for a kernel bisect until September. (And no glue where to start and how to trigger the bug faster.) Regards Bene -- Karlsruher Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Benedikt Neuffer Netze und Telekommunikation (NET) Hermann-von-Helmholtz-Platz 1 Gebäude 442 Raum 185 76344 Eggenstein-Leopoldshafen Telefon: +49 721 608-24502 Fax: +49 721 608-47763 E-Mail: benedikt.neuffer@kit.edu Web: https://www.scc.kit.edu Sitz der Körperschaft: Kaiserstraße 12, 76131 Karlsruhe KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft Signaturversion: 19.1.0 beta

Andrew Hearn

21 Nov 21 Nov

4:09 p.m.

On 20/06/2019 17:13, Benedikt Neuffer wrote:

...

Hi,

On 19.06.19 20:09, Alarig Le Lay wrote:

...
Hi,

On mer. 19 juin 09:10:53 2019, Robert Sander wrote:

...
Hi,

our routers run on Debian stretch with bird 1.6.4 from bird.network.cz/debian.

Yesterday I tried kernel 4.19 from backports.debian.org and ran into a weird issue with IPv6 BGP sessions:

All Peerings reported "Error: Hold timer expired" ca. every 40 minutes.

IPv6 forwarding was flapping all the time.

After rebooting into kernel 4.9 everything worked again.

IPv4 BGP was not affected and also OSPF (v4 and v6). I could disable all IPv6 BGP peerings on this router and then it forwarded to another router learned via OSPF for IPv6 without issues.

Has anyone seen such a behaviour?

I’ve seen this with 4.19 on gentoo. For now I’m still running 4.14. https://archives.gentoo.org/gentoo-user/message/fab628cc53e4a55589410f9dff6a...

Same here. Gentoo, Linux 4.19.52, Bird 2.0.4. I am running a full table using a separate VRF and the default table as management VRF.

Without traffic through the box (all IPv6 prefixes filtered) the bgp sessions is stable. With traffic the bgp session dies after some time and ssh connections in the default table freezes.

I did some packet captures and saw tcp retransmissions before hold timer expires.

Kernel 4.14.127 is here stable, too. Sadly I have no time for a kernel bisect until September. (And no glue where to start and how to trigger the bug faster.)

Sorry to bring up a fairly old thread... We believe we are seeing this problem too, since a Stretch->Buster upgrade - was there a solution to this? Thanks Andrew.

Benedikt Neuffer

4:46 p.m.

Hi Andrew, On 21.11.19 17:09, Andrew Hearn wrote:

...

Sorry to bring up a fairly old thread...

We believe we are seeing this problem too, since a Stretch->Buster upgrade - was there a solution to this?

Thanks

The problem still exists. We are still running on kernel 4.14.x. I had no time to do any further debugging. Regards, Benedikt -- Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing (SCC) Benedikt Neuffer Netze und Telekommunikation (NET) Hermann-von-Helmholtz-Platz 1 Gebäude 442 Raum 185 76344 Eggenstein-Leopoldshafen Telefon: +49 721 608-24502 Fax: +49 721 608-47763 E-Mail: benedikt.neuffer@kit.edu Web: https://www.scc.kit.edu Sitz der Körperschaft: Kaiserstraße 12, 76131 Karlsruhe KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft Signaturversion: 19.1.0 beta

Alarig Le Lay

5:04 p.m.

Hi, On 21/11/2019 17:46, Benedikt Neuffer wrote:

...

Hi Andrew,

On 21.11.19 17:09, Andrew Hearn wrote:

...
Sorry to bring up a fairly old thread...

We believe we are seeing this problem too, since a Stretch->Buster upgrade - was there a solution to this?

Thanks

The problem still exists. We are still running on kernel 4.14.x. I had no time to do any further debugging.

Regards, Benedikt

I also had the problem with 5.x on proxmox 6. But I didn’t begin my debugging either, E_NOTIME… -- Alarig

Ondrej Zajicek

5:12 p.m.

On Thu, Nov 21, 2019 at 04:09:24PM +0000, Andrew Hearn wrote:

...

...
Without traffic through the box (all IPv6 prefixes filtered) the bgp sessions is stable. With traffic the bgp session dies after some time and ssh connections in the default table freezes.

I did some packet captures and saw tcp retransmissions before hold timer expires.

Kernel 4.14.127 is here stable, too. Sadly I have no time for a kernel bisect until September. (And no glue where to start and how to trigger the bug faster.)

Sorry to bring up a fairly old thread...

We believe we are seeing this problem too, since a Stretch->Buster upgrade - was there a solution to this?

Perhaps try kernel 5.2.x or 5.3.x from buster-backports? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Alarig Le Lay

23 Nov 23 Nov

5:46 p.m.

On jeu. 21 nov. 18:12:17 2019, Ondrej Zajicek wrote:

...

Perhaps try kernel 5.2.x or 5.3.x from buster-backports?

I’m very interested by test results from newer kernels than 5.0.x -- Alarig

Stefan Jakob

30 Nov 30 Nov

10:43 a.m.

Can anyone provide test configs? Is it testable inside two or three VMs? Could offer 5.3.X tests here. On Sat, Nov 23, 2019 at 6:48 PM Alarig Le Lay <alarig@swordarmor.fr> wrote:

...

On jeu. 21 nov. 18:12:17 2019, Ondrej Zajicek wrote:

...
Perhaps try kernel 5.2.x or 5.3.x from buster-backports?

I’m very interested by test results from newer kernels than 5.0.x

-- Alarig

Benedikt Neuffer

11:19 a.m.

Hi all, On 30.11.19 11:43, Stefan Jakob wrote:

...

Can anyone provide test configs?

Is it testable inside two or three VMs?

Could offer 5.3.X tests here.

On Sat, Nov 23, 2019 at 6:48 PM Alarig Le Lay <alarig@swordarmor.fr> wrote:

...
On jeu. 21 nov. 18:12:17 2019, Ondrej Zajicek wrote:

...
Perhaps try kernel 5.2.x or 5.3.x from buster-backports?

I’m very interested by test results from newer kernels than 5.0.x

-- Alarig

as far as I see one need some traffic to reproduce the issue. Without traffic I haven't seen the issue. Regards, Benedikt -- Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing (SCC) Benedikt Neuffer Netze und Telekommunikation (NET) Hermann-von-Helmholtz-Platz 1 Gebäude 442 Raum 185 76344 Eggenstein-Leopoldshafen Telefon: +49 721 608-24502 Fax: +49 721 608-47763 E-Mail: benedikt.neuffer@kit.edu Web: https://www.scc.kit.edu Sitz der Körperschaft: Kaiserstraße 12, 76131 Karlsruhe KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft Signaturversion: 19.1.0 beta

Frederik Kriewitz

10:31 p.m.

On Sat, Nov 30, 2019 at 12:26 PM Benedikt Neuffer <benedikt.neuffer@kit.edu> wrote:

...

as far as I see one need some traffic to reproduce the issue. Without traffic I haven't seen the issue.

Yes, we saw this behaviour too using the buster kernel. It seems to be traffic and/or neighbours related. Forwarding itself seems to work but neighbour discovery stops working (that's why multicast based OSPF sessions are not affected). In this state the kernel doesn't generate any neighbor solicitation packets (not visible using tcpdump). Once the neighbour cache times out IPv6 connectivity is broken. We don't know if this might be NIC related yet. We're seeing it happen with Intel X710 NICs (With all offloading features disabled). Which NICs are you using? Resetting the NIC using ethtool -r $INTERFACE seems to have fixed it once for us. The problem fixes itself after ~ 90 to 110 minutes too until it appears again.

Alarig Le Lay

10:50 p.m.

On sam. 30 nov. 23:31:39 2019, Frederik Kriewitz wrote:

...

We don't know if this might be NIC related yet. We're seeing it happen with Intel X710 NICs (With all offloading features disabled). Which NICs are you using?

We are using “Intel Corporation 82576 Gigabit Network Connection” NICs. -- Alarig

Alarig Le Lay

11:02 p.m.

On sam. 30 nov. 23:50:48 2019, Alarig Le Lay wrote:

...

We are using “Intel Corporation 82576 Gigabit Network Connection” NICs.

And “Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet”, sorry I forgot this box. -- Alarig

Benedikt Neuffer

1 Dec 1 Dec

10:43 a.m.

Hi Frederik, On 30.11.19 23:31, Frederik Kriewitz wrote:

...

On Sat, Nov 30, 2019 at 12:26 PM Benedikt Neuffer <benedikt.neuffer@kit.edu> wrote: Which NICs are you using?

We are using Intel X520. Regards, Benedikt -- Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing (SCC) Benedikt Neuffer Netze und Telekommunikation (NET) Hermann-von-Helmholtz-Platz 1 Gebäude 442 Raum 185 76344 Eggenstein-Leopoldshafen Telefon: +49 721 608-24502 Fax: +49 721 608-47763 E-Mail: benedikt.neuffer@kit.edu Web: https://www.scc.kit.edu Sitz der Körperschaft: Kaiserstraße 12, 76131 Karlsruhe KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft Signaturversion: 19.1.0 beta

Alarig Le Lay

30 Nov 30 Nov

11:32 a.m.

I saw it in production with ~20 VMs, but I don’t know how much is needed to trigger it. On sam. 30 nov. 11:43:29 2019, Stefan Jakob wrote:

...

Can anyone provide test configs?

Is it testable inside two or three VMs?

Could offer 5.3.X tests here.

On Sat, Nov 23, 2019 at 6:48 PM Alarig Le Lay <alarig@swordarmor.fr> wrote:

...
On jeu. 21 nov. 18:12:17 2019, Ondrej Zajicek wrote:

...
Perhaps try kernel 5.2.x or 5.3.x from buster-backports?

I’m very interested by test results from newer kernels than 5.0.x

-- Alarig

Daniel Suchy

1 Dec 1 Dec

11:51 a.m.

Hello, I'm running bird 1.6.x branch (packages from Debian/Buster; currently 1.6.6) on recent 4.19 custom-build kernels without any issues (on armhf hardware). My BGP sessions are carrying only few routes (default + some more specifics). One idea that comes in my mind is default kernel limit for IPv6 routes in memory (sysctl net.ipv6.route.max_size); and such default is quite low for fullbgp/DFZ IPv6 deployments and it's still set to 4096 on Debian/Buster with stock kernels. Can people having issues with 4.19 kernels check sysctl mentioned above? - Daniel On 11/21/19 6:12 PM, Ondrej Zajicek wrote:

...

On Thu, Nov 21, 2019 at 04:09:24PM +0000, Andrew Hearn wrote:

...
...
Without traffic through the box (all IPv6 prefixes filtered) the bgp sessions is stable. With traffic the bgp session dies after some time and ssh connections in the default table freezes.

I did some packet captures and saw tcp retransmissions before hold timer expires.

Kernel 4.14.127 is here stable, too. Sadly I have no time for a kernel bisect until September. (And no glue where to start and how to trigger the bug faster.)

Sorry to bring up a fairly old thread...

We believe we are seeing this problem too, since a Stretch->Buster upgrade - was there a solution to this?

Perhaps try kernel 5.2.x or 5.3.x from buster-backports?

Frederik Kriewitz

12:43 p.m.

On Sun, Dec 1, 2019 at 12:57 PM Daniel Suchy <danny@danysek.cz> wrote:

...

One idea that comes in my mind is default kernel limit for IPv6 routes in memory (sysctl net.ipv6.route.max_size); and such default is quite low for fullbgp/DFZ IPv6 deployments and it's still set to 4096 on Debian/Buster with stock kernels. Can people having issues with 4.19 kernels check sysctl mentioned above?

This is our current suspicion too. neighbours and routes are well below 4096 in our case. We also had to adjust net.ipv6.neigh.default.gc_thresh1/2/3. Since the adjustment it's been working fine.

Clément Guivy

6:20 p.m.

On 01/12/2019 13:43, Frederik Kriewitz wrote:

...

This is our current suspicion too. neighbours and routes are well below 4096 in our case. We also had to adjust net.ipv6.neigh.default.gc_thresh1/2/3. Since the adjustment it's been working fine.

Hi, that's good news. One thing that still confuses me though is that the default values for these settings are the same in Debian 9 (4.9 kernel) and Debian 10 (4.19 kernel), so I would expect the behaviour to be the same between both versions in that regard. Also I'm not sure to understand what this max_size parameter actually does since I have it to default value (4096), and yet ipv6 route table at the moment is >70k entries large without the kernel complaining.

Andrew Hearn

2 Dec 2 Dec

3:56 p.m.

On 01/12/2019 18:20, Clément Guivy wrote:

...

On 01/12/2019 13:43, Frederik Kriewitz wrote:

...
This is our current suspicion too. neighbours and routes are well below 4096 in our case. We also had to adjust net.ipv6.neigh.default.gc_thresh1/2/3. Since the adjustment it's been working fine.

Hi, that's good news. One thing that still confuses me though is that the default values for these settings are the same in Debian 9 (4.9 kernel) and Debian 10 (4.19 kernel), so I would expect the behaviour to be the same between both versions in that regard. Also I'm not sure to understand what this max_size parameter actually does since I have it to default value (4096), and yet ipv6 route table at the moment is >70k entries large without the kernel complaining.

To add our info - We're using Intel 82599ES NICs. We have full table on v4 and v6, and about 20 neighbors on each. Our route/max_size for v4 and and v6 are defaults (2M and 4096 respectively) - and as noted, these values are the same in our Stretch and Buster boxes. Andrew

Vincent Bernat

8:38 p.m.

❦ 1 décembre 2019 19:20 +01, Clément Guivy <clement@guivy.fr>:

...

Hi, that's good news. One thing that still confuses me though is that the default values for these settings are the same in Debian 9 (4.9 kernel) and Debian 10 (4.19 kernel), so I would expect the behaviour to be the same between both versions in that regard. Also I'm not sure to understand what this max_size parameter actually does since I have it to default value (4096), and yet ipv6 route table at the moment is >70k entries large without the kernel complaining.

For IPv4, the parameter is ignored since Linux 3.6. For IPv6, this is the size of the routing cache. If you have more than 4096 active hosts, Linux will aggressively try to run garbage collection, eating CPU. In this case, increase both net.ipv6.route.max_size and net.ipv6.route.gc_thresh. That's a pity, but this value is not easily observable, so it's hard to know when you hit it. Also, while IPv4 recently got the ability back to enumerate the cache, this is not the case for IPv6. This setting is a bit confusing as it is not documented and in the past, it was limiting the whole IPv6 route table (before Linux 3.0). -- Write clearly - don't sacrifice clarity for "efficiency". - The Elements of Programming Style (Kernighan & Plauger)

Alarig Le Lay

8:58 p.m.

Hi Vincent, On lun. 2 déc. 21:38:21 2019, Vincent Bernat wrote:

...

For IPv6, this is the size of the routing cache. If you have more than 4096 active hosts, Linux will aggressively try to run garbage collection, eating CPU. In this case, increase both net.ipv6.route.max_size and net.ipv6.route.gc_thresh.

Do you know what are the risks when we raise those parameters? A bit more RAM consumption? Regards, -- Alarig

Vincent Bernat

9:48 p.m.

❦ 2 décembre 2019 21:58 +01, Alarig Le Lay <alarig@swordarmor.fr>:

...

...
For IPv6, this is the size of the routing cache. If you have more than 4096 active hosts, Linux will aggressively try to run garbage collection, eating CPU. In this case, increase both net.ipv6.route.max_size and net.ipv6.route.gc_thresh.

Do you know what are the risks when we raise those parameters? A bit more RAM consumption?

You are mostly safe with RAM. Increasing the value to 512k would eat 256MB of RAM. However, if an attacker is still able to overflow the cache, it is costly in term of CPU. This is a bit similar to the route cache for IPv4, so you need to play with threshold, interval and timeout to try to keep CPU usage down, but ultimately, a fast enough attacker can do a lot of damage here. I don't have real-life experience with this aspect. Also, from 4.2, the cache entries are only created for exceptions (PMTU notably). So, in fact, the initial value should be mostly safe. You can monitor it with `/proc/net/rt6_stats`. This is the before last value. If you can share what you have, I would be curious to know how low it is (compared to the 4th entry notably). -- Writing is turning one's worst moments into money. -- J.P. Donleavy

Vincent Bernat

10:04 p.m.

❦ 2 décembre 2019 22:48 +01, Vincent Bernat <bernat@luffy.cx>:

...

Also, from 4.2, the cache entries are only created for exceptions (PMTU notably). So, in fact, the initial value should be mostly safe. You can monitor it with `/proc/net/rt6_stats`. This is the before last value. If you can share what you have, I would be curious to know how low it is (compared to the 4th entry notably).

Just to be clear: I did forget this fact and therefore my initial recommendation to increase max_size with more than 4096 active hosts does not apply anymore (as long as you have a 4.2+ kernel). Keep the default value and watch `/proc/net/rt6_stats`. -- Program defensively. - The Elements of Programming Style (Kernighan & Plauger)

Alarig Le Lay

3 Dec 3 Dec

7:56 a.m.

On 02/12/2019 23:04, Vincent Bernat wrote:

...

Just to be clear: I did forget this fact and therefore my initial recommendation to increase max_size with more than 4096 active hosts does not apply anymore (as long as you have a 4.2+ kernel). Keep the default value and watch `/proc/net/rt6_stats`.

core01-arendal ~ # cat /proc/net/rt6_stats 0048 002c 5e56 0050 0000 0056 0020 It is supposed to be understandable? :D -- Alarig

Vincent Bernat

8:40 a.m.

❦ 3 décembre 2019 08:56 +01, Alarig Le Lay <alarig@swordarmor.fr>:

...

...
Just to be clear: I did forget this fact and therefore my initial recommendation to increase max_size with more than 4096 active hosts does not apply anymore (as long as you have a 4.2+ kernel). Keep the default value and watch `/proc/net/rt6_stats`.

core01-arendal ~ # cat /proc/net/rt6_stats 0048 002c 5e56 0050 0000 0056 0020

It is supposed to be understandable? :D

So, there is 0x56 entries in the cache. Isn't that clear? :) https://elixir.bootlin.com/linux/latest/source/net/ipv6/route.c#L6006 -- Modularise. Use subroutines. - The Elements of Programming Style (Kernighan & Plauger)

Alarig Le Lay

10:46 a.m.

On mar. 3 déc. 09:40:31 2019, Vincent Bernat wrote:

...

So, there is 0x56 entries in the cache. Isn't that clear? :)

https://elixir.bootlin.com/linux/latest/source/net/ipv6/route.c#L6006

I did a quick test on some routers: core01-arendal, no fullview, on my own ASN, no so much traffic, using tunnels https://pix.milkywan.fr/apWaD84h.png core01-arendal ~ # while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 86 (0056) 86 (0056) 86 (0056) core01-arendal ~ # ip -6 r | wc -l 64 core01-arendal ~ # uname -a Linux core01-arendal.no.swordarmor.fr 4.19.86-gentoo #1 SMP Mon Dec 2 19:02:33 CET 2019 x86_64 AMD GX-412TC SOC AuthenticAMD GNU/Linux core02-arendal, no fullview, on my own ASN, no so much traffic, using tunnels https://pix.milkywan.fr/NF3jNY9K.png core02-arendal ~ # while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 28 (001c) 30 (001e) 30 (001e) core02-arendal ~ # ip -6 r | wc -l 39 core02-arendal ~ # uname -a Linux core02-arendal.no.swordarmor.fr 4.19.86-gentoo #1 SMP Mon Dec 2 22:08:21 CET 2019 x86_64 AMD G-T40E Processor AuthenticAMD GNU/Linux edge01-terrahost, fullview, on my own ASN, no so much traffic, using one tunnel https://pix.milkywan.fr/6AVwYkY8.png edge01-terrahost ~ # while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 96 (0060) 101 (0065) 101 (0065) edge01-terrahost ~ # ip -6 r | wc -l 77439 edge01-terrahost ~ # uname -a Linux edge01-terrahost.no.swordarmor.fr 4.19.82-gentoo #2 SMP Tue Nov 12 22:08:28 CET 2019 x86_64 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz GenuineIntel GNU/Linux edge02-fjordane, fullview, on my own ASN, no so much traffic, using one tunnel https://pix.milkywan.fr/J4rOuylq.png edge02-fjordane ~ # while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 110 (006e) 110 (006e) 110 (006e) edge02-fjordane ~ # ip -6 r | wc -l 77433 edge02-fjordane ~ # uname -a Linux edge02-fjordane.no.swordarmor.fr 4.19.86-gentoo #1 SMP Thu Nov 28 16:47:53 CET 2019 x86_64 Common KVM processor GenuineIntel GNU/Linux regis, fullview, on my own ASN, a bit more traffic, using one tunnel https://pix.milkywan.fr/5XeaK2du.png regis ~ # while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 0 (0000) 1 (0001) 1 (0001) regis ~ # ip -6 r | wc -l 77538 regis ~ # uname -a Linux regis.swordarmor.fr 4.14.83-gentoo #2 SMP Sat Feb 2 16:50:41 CET 2019 x86_64 Intel(R) Xeon(R) CPU X3450 @ 2.67GHz GenuineIntel GNU/Linux asbr02, fullview, on a not-for-profit ASN providing services for others, 100M of traffic, using one tunnel https://pix.milkywan.fr/l1hfAAIn.png alarig@asbr02 ~ $ while :; do awk --non-decimal-data '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats; sleep 120; done 4 (0004) 3 (0003) 0 (0000) alarig@asbr02 ~ $ ip -6 r | wc -l 77525 alarig@asbr02 ~ $ uname -a Linux asbr02.cogent-rns.grifon.fr 4.14.156-gentoo #1 SMP Tue Dec 3 09:53:23 CET 2019 x86_64 Intel(R) Xeon(R) CPU X3450 @ 2.67GHz GenuineIntel GNU/Linux So, I have more routes in cache than in FIB on my two core routers, I’m pretty sure there is a bug there :p I have less routes in cache on 4.14 kernels but more traffic. I don’t know which function is feeding the cache, but I think that it’s doing too much. -- Alarig

Vincent Bernat

10:58 a.m.

❦ 3 décembre 2019 11:46 +01, Alarig Le Lay <alarig@swordarmor.fr>:

...

So, I have more routes in cache than in FIB on my two core routers, I’m pretty sure there is a bug there :p

It's not unexpected. A cache entry is for a /128.

...

I have less routes in cache on 4.14 kernels but more traffic.

I don’t know which function is feeding the cache, but I think that it’s doing too much.

The function is ip6_rt_cache_alloc(). It is being called on PMTU exceptions, on redirects and in this last case I currently fail to understand:

...

ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set

This patch always creates RTF_CACHE clone with DST_NOCACHE when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to the fl6->daddr.

-- It is a wise father that knows his own child. -- William Shakespeare, "The Merchant of Venice"

Alarig Le Lay

11:48 a.m.

On 03/12/2019 11:58, Vincent Bernat wrote:

...

It's not unexpected. A cache entry is for a /128.

When I’m routing 80k prefixes I don’t want to have n /128 routes because someone doesn’t have 1500 of MTU. Is their a way to disable this behaviour? -- Alarig

Vincent Bernat

1:16 p.m.

❦ 3 décembre 2019 12:48 +01, Alarig Le Lay <alarig@swordarmor.fr>:

...

...
It's not unexpected. A cache entry is for a /128.

When I’m routing 80k prefixes I don’t want to have n /128 routes because someone doesn’t have 1500 of MTU. Is their a way to disable this behaviour?

I don't think there is. The information needs to be stored somewhere. With IPv6, they are materialized as regular route entries tagged as "cached routes". With IPv4, they are stored inside a route entry. -- Don't stop with your first draft. - The Elements of Programming Style (Kernighan & Plauger)

Alarig Le Lay

6:05 p.m.

On 03/12/2019 14:16, Vincent Bernat wrote:

...

The information needs to be stored somewhere.

Why has it to be stored? It’s not really my problem if someone else has a non-stantard MTU and can’t do TCP-MSS or PMTUd. -- Alarig

Oliver

24 Sep 24 Sep

12:37 p.m.

Hello, after upgrading to debian buster with kernel 4.19 we also had problems. By adjusting net.ipv6.route.max_size we have fixed the following messages: watchdog: BUG: soft lockup - CPU#X stuck for 22s! and ixgbe 0000:02:00.0 ens2fX: initiating reset due to tx timeout But we still had a lot of jitter on the line. Downgrading to 4.9.0 fixed the problem, but this is not a permanent solution. What else did we tried: * Increasing gc_threshX net.ipv6.neigh.default.gc_thresh1 = 2048 net.ipv6.neigh.default.gc_thresh2 = 4096 net.ipv6.neigh.default.gc_thresh3 = 8192 => Did not help * Going to a backports kernel (5.7.0) => Did not help @Frederik Kriewitz: What did you do fix that problem? Oliver

Frederik Kriewitz

1:19 p.m.

On Thu, Sep 24, 2020 at 2:47 PM Oliver <bird-o@sernet.de> wrote:

...

@Frederik Kriewitz: What did you do fix that problem?

We're not having jitter issues with debians 4.19 kernel: 100 packets transmitted, 100 received, 0% packet loss, time 99143ms rtt min/avg/max/mdev = 0.138/0.174/0.236/0.015 ms With these settings: net.ipv6.neigh.default.gc_thresh1 = 1024 net.ipv6.neigh.default.gc_thresh2 = 20000 net.ipv6.neigh.default.gc_thresh3 = 65535 net.ipv6.route.max_size=500000

Oliver

2:53 p.m.

On Thu, 24 Sep 2020, Frederik Kriewitz wrote:

...

On Thu, Sep 24, 2020 at 2:47 PM Oliver <bird-o@sernet.de> wrote:

...
@Frederik Kriewitz: What did you do fix that problem?

We're not having jitter issues with debians 4.19 kernel:

100 packets transmitted, 100 received, 0% packet loss, time 99143ms rtt min/avg/max/mdev = 0.138/0.174/0.236/0.015 ms

With these settings: net.ipv6.neigh.default.gc_thresh1 = 1024 net.ipv6.neigh.default.gc_thresh2 = 20000 net.ipv6.neigh.default.gc_thresh3 = 65535 net.ipv6.route.max_size=500000 Thank for you charing your values. This did not help in our situation:

Sometimes ping times are going up from under 1ms to over 100ms. With kernel 4.9.0 we see no problems. Oliver

micah anderson

3:03 p.m.

Oliver <bird-o@sernet.de> writes:

...

Hello,

after upgrading to debian buster with kernel 4.19 we also had problems.

By adjusting net.ipv6.route.max_size we have fixed the following messages: watchdog: BUG: soft lockup - CPU#X stuck for 22s! and ixgbe 0000:02:00.0 ens2fX: initiating reset due to tx timeout

But we still had a lot of jitter on the line. Downgrading to 4.9.0 fixed the problem, but this is not a permanent solution.

What else did we tried: * Increasing gc_threshX net.ipv6.neigh.default.gc_thresh1 = 2048 net.ipv6.neigh.default.gc_thresh2 = 4096 net.ipv6.neigh.default.gc_thresh3 = 8192 => Did not help

The linux kernel is getting rid of ipv6 caching, like it did with ipv4, but it will take some time to get there. It seems that in this kernel they have set a small value for net.ipv6.route.max_size (4096!), and when this parameter is increased (e.g. 1048576).... the problem went away for us. I'm not 100% clear on what units this value is, I had around 89k ipv6 routes, so this value is definitely higher. I'm sure that setting t too high could result in some memory issues. Additionally, you also want to raise net.ipv6.route.gc_thresh to avoid running the garbage collector too often. I found that the rule of thumb here is 1/4 the size of ipv6.route.max_size. I did find that in Linux kernel 5.2 there is a message output to the kernel ring buffer when the ipv6.route.max_size is hit, so you at least have a *clue* what is going on. In 4.19, which is what Debian Buster is, you don't get that clue. -- micah

Benedikt Neuffer

3:30 p.m.

Hi all, On 24.09.20 17:03, micah anderson wrote:

...

Oliver <bird-o@sernet.de> writes:

...
Hello,

after upgrading to debian buster with kernel 4.19 we also had problems.

By adjusting net.ipv6.route.max_size we have fixed the following messages: watchdog: BUG: soft lockup - CPU#X stuck for 22s! and ixgbe 0000:02:00.0 ens2fX: initiating reset due to tx timeout

But we still had a lot of jitter on the line. Downgrading to 4.9.0 fixed the problem, but this is not a permanent solution.

What else did we tried: * Increasing gc_threshX net.ipv6.neigh.default.gc_thresh1 = 2048 net.ipv6.neigh.default.gc_thresh2 = 4096 net.ipv6.neigh.default.gc_thresh3 = 8192 => Did not help

The linux kernel is getting rid of ipv6 caching, like it did with ipv4, but it will take some time to get there. It seems that in this kernel they have set a small value for net.ipv6.route.max_size (4096!), and when this parameter is increased (e.g. 1048576).... the problem went away for us.

I'm not 100% clear on what units this value is, I had around 89k ipv6 routes, so this value is definitely higher. I'm sure that setting t too high could result in some memory issues.

Additionally, you also want to raise net.ipv6.route.gc_thresh to avoid running the garbage collector too often. I found that the rule of thumb here is 1/4 the size of ipv6.route.max_size.

I did find that in Linux kernel 5.2 there is a message output to the kernel ring buffer when the ipv6.route.max_size is hit, so you at least have a *clue* what is going on. In 4.19, which is what Debian Buster is, you don't get that clue.

In iernel 4.14 we haven't seen the issue. In 5.7 the issue still exists. I can confirm that increasing net.ipv6.route.max_size is a workaround. Regards, Bene -- Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing (SCC) Benedikt Neuffer Netze und Telekommunikation (NET) Hermann-von-Helmholtz-Platz 1 Gebäude 442 Raum 122 76344 Eggenstein-Leopoldshafen Telefon: +49 721 608-24502 Fax: +49 721 608-47763 E-Mail: benedikt.neuffer@kit.edu Web: https://www.scc.kit.edu Sitz der Körperschaft: Kaiserstraße 12, 76131 Karlsruhe KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft Signaturversion: 19.1.0 beta

Clément Guivy

6:25 p.m.

On 24/09/2020 14:37, Oliver wrote:

...

Hello,

after upgrading to debian buster with kernel 4.19 we also had problems.

How filled is your route cache compared to the sysctl treshold? See the (hex) value with : cut -d'' -f 6 /proc/net/rt6_stats Do you get a "Network is unreachable" error at some point if you do an "ip route get" on each prefix of your (ipv6) routing table? (while you are doing this test you should see the cache being filled according to the rt6_stats file as said before) How filled is your neighbor table compared to the sysctl treshold? You can read it with : ip -6 neigh sh | wc -l Do you notice random drops on Bird sessions?

Oliver

7:52 p.m.

On Thu, 24 Sep 2020, Clément Guivy wrote:

...

On 24/09/2020 14:37, Oliver wrote:

...
Hello,

after upgrading to debian buster with kernel 4.19 we also had problems.

How filled is your route cache compared to the sysctl treshold? See the (hex) value with : cut -d'' -f 6 /proc/net/rt6_stats awk '{ print ("0x"$6)+0, "(" $6 ")" }' /proc/net/rt6_stats It is between 1315 (0523) and 1738 (06ca)

...

Do you get a "Network is unreachable" error at some point if you do an "ip route get" on each prefix of your (ipv6) routing table? (while you are doing this test you should see the cache being filled according to the rt6_stats file as said before) After we set "net.ipv6.route.max_size = 400000" we do not get any "Network is unreachable" anymore. This is how we tested it: ip -6 route |egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null

...

How filled is your neighbor table compared to the sysctl treshold? You can read it with : ip -6 neigh sh | wc -l 18 (so very low)

...

Do you notice random drops on Bird sessions? After we set "net.ipv6.route.max_size = 400000" we do not have any drops anymore.

We have many ipv6_routes: cat /proc/net/ipv6_route | wc -l 207281 (so more then the normal full IPv6 BGP table which is around 90000) At the moment just the "jitter" is the problem we have. I just increased net.ipv6.route.gc_thresh to 102400 as suggested from micah. (1/4 of ipv6.route.max_size) With the increased value /proc/net/rt6_stats is going up to 2186 (088a) and stayes in that region. So for the past minutes with this config everything runs smoothly: net.ipv6.route.max_size = 400000 net.ipv6.route.gc_thresh = 102400 I did not changed the net.ipv6.neigh.default.gc_thresh* values. I will monitor the values and write again after some time. Oliver

Oliver

25 Sep 25 Sep

11:22 a.m.

Hello, today I monitored the numbers and did more testing. The jitter starts, when $6 (total number of routes alloced) in /proc/net/rt6_stats is higher then the value of net.ipv6.route.gc_thresh. I proveked this by running: ip -6 route |egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null I thought the garbage collector would remove old entries, but this seems not to be the case. The numbers stays now over 120.000 and I had to increase net.ipv6.route.gc_thresh to a higher number to get it stable again. After running the "route get" command after this again the numbers are still going up. Should not be everything now already in the cache? I looked into the kernel code to understand more of /proc/net/rt6_stats numbers (include/net/ip6_fib.h), but I still have questions: fib_nodes; /* all fib6 nodes */ => What does this number mean? What are fib6 nodes? Current Value: 405615 (6306f) fib_route_nodes; /* intermediate nodes */ => Is around the size of: ip -6 r l table all | wc -l Current Value: 207305 (329c9) fib_rt_entries; /* rt entries in fib table */ => How can I see all of this rt entries? Current Value: 32730956 (1f36f4c) fib_rt_cache; /* cached rt entries in exception table */ => This number is higher then fib_route_nodes Current Value: 207330 (329e2) fib_discarded_routes; /* total number of routes delete */ => Are discarded routes removed by the gc or removed by bird? 1 (0001) fib_rt_alloc; /* total number of routes alloced */ 124170 (1e50a) fib_rt_uncache; /* rt entries in uncached list */ => This number is higher then fib_nodes 439679 (6b57f) Maybe someone can bring light on this. Oliver

Oliver

25 Aug 25 Aug

1:46 p.m.

New subject: IPv6 BGP & kernel 4.19 (and upto 5.10.46)

Hello, back again on this topic. This problem is still not completely fixed with Debian Bullseye and kernel 5.10.46. The workaround is still: net.ipv6.route.max_size = 400000 net.ipv6.route.gc_thresh = 102400 On https://bird.network.cz/pipermail/bird-users/2020-March/014406.html is mentioned that you can also set: net.ipv6.route.gc_thresh = -1 But is this value save to use? This is also the default for IPv4: net.ipv4.route.gc_thresh = -1 With the default value of net.ipv6.route.gc_thresh = 1024 we have still much jitter on the line. Why is still the default value of net.ipv6.route.max_size still 4096? Compared to IPv4 value: net.ipv4.route.max_size = 2147483647 Has someone done more research on this topic? Best regards, Oliver

Nico Schottelius

2:36 p.m.

New subject: IPv6 BGP & kernel 4.19 (and upto 5.10.46)

Hey Oliver, Oliver <bird-o@sernet.de> writes:

...

[...] Why is still the default value of net.ipv6.route.max_size still 4096? Compared to IPv4 value: net.ipv4.route.max_size = 2147483647

Has someone done more research on this topic?

I believe this is a question that should be asked on the LKML or linux-net mailing list - it's very valid and I'd be in favor for aligning it with the IPv4 value. Cheers, Nico -- Sustainable and modern Infrastructures by ungleich.ch

1797

Age (days ago)

2595

Last active (days ago)

List overview

Download

39 comments

13 participants

participants (13)

Alarig Le Lay
Andrew Hearn
Benedikt Neuffer
Clément Guivy
Daniel Suchy
Frederik Kriewitz
micah anderson
Nico Schottelius
Oliver
Ondrej Zajicek
Robert Sander
Stefan Jakob
Vincent Bernat