System requirements for BIRD

Mon Dec 23 16:03:33 CET 2013

В письме от 23 декабря 2013 14:08:26 пользователь Ruben Laban написал:
> Hi Sergey,
> 
> On 23-12-2013 13:53, Sergey Popovich wrote:
> > Also having threads (Hyper-Threading) has no good effect on traffic
> > forwarding workloads, as CPU cache is shared between physical cores,
> > causing more cache misses and thus lower performance.
> 
> This one I should keep in mind myself, hadn't really thought of that
> one. While we're at this topic, do you happen to have some more
> (effective) changes/configurations to improve your max throughput?
> 
> For instance, you mentioned the disabling of connection tracking. Which
> method do you use for that? Blacklist the modules or smart use of the
> NOTRACK netfilter target?

We do not use NAT/PAT and statefull firewall on forwarded packets (for
input we have one, to controll access to management), so using conntrack
is useless and moreover harmful, as it stores/update information in internal
hashtables state of flows and examines each packet arriwing on machine.

Routing information (routes) in modern kernels stored using trie structure 
(same used in BIRD), which is faster than hash tables.

Having something something like

iptables -A PREROUTING -t raw -m addrtype ! --dst-type LOCAL -j CT --notrack

to disable tracking of forwarded packets is good enough.

NOTRACK target is obsolete for now (however still might be supported), its 
functionality superceeded by CT target.

Tuning conntrack (nf_conntrack_*) variables is useless, as we goning to track
only traffic destined for machine itself (INPUT chain in iptables).

For example in my common deployments as Access Server where is only 3 rules in 
FORWARD chain (2 to stablish two way filtering on customer request using 
ipset, hash:net,port,net or hash:ip,port,net and 1 - is a billing hook). 
Filter table and 2 in PREROUTING raw table (1 - rpfilter match, uRPF, 2 - 
notrack). INPUT chain - is left to you how to filter traffic going to your 
router itself, as it is not performance critical.

1. You may get more benefits from tuning Linux routing cache (which is removed
from starting from 3.6 tree), there should be information about routing cache
on the web. In general: routing cache is bad, avid it by using kernels 
starting from 3.8 (before might have performance regressions due to cache 
removals), as it has poor performance under DoS.

Here is example on how I tuned out routing cache for 3.2 series on system with 
2GB ram:

# RT cache
net/ipv4/route/gc_thresh = 786432
net/ipv4/route/max_size = 1572864
net/ipv4/route/gc_min_interval = 0
net/ipv4/route/gc_min_interval_ms = 500
net/ipv4/route/gc_timeout = 600
net/ipv4/route/gc_interval = 300
net/ipv4/route/gc_elasticity = 4
#net/ipv4/route/mtu_expires = 600
#net/ipv4/route/min_pmtu = 552
#net/ipv4/route/min_adv_mss = 256

Please note in IPv6 routing cache does not exists and some parameters (if 
exists) has different meaning!

gc_* variables controlls garbage collector parameters.
max_size = maximum number of entries in hash table.

And one much more interesting parameter:

rt_cache_rebuild_count - by default it is equals to 4, however using value -1
turns off routing cache. Turning off routing cache might have positive effect 
on performance (however this depends on traffic patterns in your network).

Details on these and other parameters could be found in 
Documentation/networking/ip-sysctl.txt in Linux kernel tree.

See lnstat(8) to obtain various network statistics (including routing cache 
information under rt_cache file).

Tuning other sysctl variables (e.g. not related to routing cache, or for L3 
stack - ARP, Proxy ARP, turn/on off forwarding on per interface basic etc) 
seems useless (especially tcp_* variables, related to TCP implementation :-)).

2. Other important point: do not use too much iptables rules in FORWARD and 
PREROUTING - iptables in Linux kernel implemented as list of chains, called 
sequentially. There is no cache in firewall implementation, no short cuts, 
just linear lookup of each entry in chain.

Use ipset(8) to store big number of IPs and match with single iptables rule, 
which stores data in bitmaps and hashes instead for performance.

Implementing QoS (shaping, policing, etc.) requires packet classification.
Typically classification is done by using src/dst from packet. Use
hashed mode with cls_u32 or cls_flow classifiers, if possible with cls_flow
use classification based on realms (which is set during packet forwarding in 
IP stack, and thus requires not additional lookup in skb in cls_flow).

Use pfifo as qdisc on leaf classes.

In general: using shaping instead of policing by ISP is bad idea, as it 
requires a lot more resources than simply policing traffic and thus
increases load on your routers (requires to enqueue/dequeue packet on router),
Also shaping increases latency (not much for real tasks as for me).

However much people likes shaping, as it makes less packet loss for end users
due to its queuing nature. Also scheduling module HTB which is commonly 
deployed - proven and mature, well maintained in kernel.

Relatively to HTB: use sch_htb from latest kernels, at least with upstream
commit 64153ce0a7b6 (net_sched: htb: do not setup default rate estimators).
This commits disables setup of rate estimators by default (no real time 
statistic about packet rate in class), but greatly reduces number of locks
in queuing code, and prevents cache starvations. Really good for performance.
Optionally rate estimators could be enabled/disabled at runtime 
(htb_rate_est). Also never sch_htb implementations has more precise traffic
accounging (especially TSO mode accounting is fixed).

So in QoS principials most important part: classification.

3. Interrupts balancing

Using NIC with hardware multiqueuing support and proper interrupt load 
balancing accross all cores in system is essential.

Be avare, when using 802.1ad (QinQ) your NIC seems looses ability to 
distribute received packets to multiple queues, among with other
hardware offloading capabilies (IP packet checksum offloading, etc).
However in my practice, on servers with 8 NIC (4 - Up, 4 - Down, aggregated
with LACP by bonding functionality in kernel) on 4 Core CPU has less impact,
as RX-0 of each NIC bound to separate core.

Interrupts could be balanced manually (by custom written script), or by using
irqbalance. Of course many places recommends to use balance IRQs manually, 
however I have no point why this should be done and why this is best practice?

On my own experience irqbalance of the latest versions do its job wery well.
However in your kernel support for msi to irq assignment mapping in sysfs is a 
MUST, overwise you get your interrupts balanced with much less success (see
output of irqbalance -d on console, it warns about missed support from kernel
sysfs).

4. Use 1G NIC in bonding instead of 10G where appropriate.

Yes, bonding as any other part in software router is implemented in software.
But it seems relatively lightweight on term of CPU.

To utilize 10G with shaping and firewall you need much powerful CPU with more 
cores. Even hawing such one would not help, as locking in QoS schedulers may
become bottleckeck. There is work in progress in kernel to introduce HTB on
multiq scheduler.

By using layer2+3 policy is sufficient (and better than layer3+4, which is not 
fully compliant 802.1ad in terms of load distribution) and provides good load 
sharing across slaves.

5. Use NAT/PAT in separate network namespace.

NAT/PAT requires conntrack, enabling conntrack for forwarded traffic is not
good in performance terms.

Linux has support for network namespaces for years. This allows you to 
establish new copy of network stack, attach network interfaces to it, 
configure IP addresses (even overlapping) and iptables/ipset (only on very
recent kernels) rules, separately from core system.

See ip-netns(8) for more information on how to use this.

6. FV vs default + local networks.

This is more ram question as modern Linux kernels use trie to store routing
table entries. I see wery little performance impact FV vs local networks (less 
than 1% of CPU). FV consumes nearly 60MB in kernel trie structures.

Using default + local networks seems more than enought to perform routing and
policy on traffic.

7. Tuning igb(7) driver parameters.

Original Intel igb driver has number of options, which is not present in
the mainline kernel driver.

However changing same of them could increase packet latencies, which is not
good enought for modern realtime application, so I would not recommended to
touch these parameters unless you have extra reason.

They have resonable default values.

----

There are lots of things to consider to, but these are essential.

> 
> I'm currently about to replace some of my software routers, and have
> been looking around for optimizations like these.
> 
> Regards,
> Ruben

-- 
SP5474-RIPE
Sergey Popovich