System requirements for BIRD

Mon Dec 23 16:26:58 CET 2013

Wow, thanks a lot for the extensive response, Sergey! This gives me 
plenty of stuff to research further.

I like the iptables addrtype "trick". I was focusing on how to maintain 
a list of local IPs to exclude from the conntrack exclude. But this is 
way cleaner, great!

This might even be a good start for an optimizations page on the Bird wiki?

Regards,
Ruben

On 23-12-2013 16:03, Sergey Popovich wrote:
> В письме от 23 декабря 2013 14:08:26 пользователь Ruben Laban написал:
>> Hi Sergey,
>>
>> On 23-12-2013 13:53, Sergey Popovich wrote:
>>> Also having threads (Hyper-Threading) has no good effect on traffic
>>> forwarding workloads, as CPU cache is shared between physical cores,
>>> causing more cache misses and thus lower performance.
>>
>> This one I should keep in mind myself, hadn't really thought of that
>> one. While we're at this topic, do you happen to have some more
>> (effective) changes/configurations to improve your max throughput?
>>
>> For instance, you mentioned the disabling of connection tracking. Which
>> method do you use for that? Blacklist the modules or smart use of the
>> NOTRACK netfilter target?
>
> We do not use NAT/PAT and statefull firewall on forwarded packets (for
> input we have one, to controll access to management), so using conntrack
> is useless and moreover harmful, as it stores/update information in internal
> hashtables state of flows and examines each packet arriwing on machine.
>
> Routing information (routes) in modern kernels stored using trie structure
> (same used in BIRD), which is faster than hash tables.
>
> Having something something like
>
> iptables -A PREROUTING -t raw -m addrtype ! --dst-type LOCAL -j CT --notrack
>
> to disable tracking of forwarded packets is good enough.
>
> NOTRACK target is obsolete for now (however still might be supported), its
> functionality superceeded by CT target.
>
> Tuning conntrack (nf_conntrack_*) variables is useless, as we goning to track
> only traffic destined for machine itself (INPUT chain in iptables).
>
> For example in my common deployments as Access Server where is only 3 rules in
> FORWARD chain (2 to stablish two way filtering on customer request using
> ipset, hash:net,port,net or hash:ip,port,net and 1 - is a billing hook).
> Filter table and 2 in PREROUTING raw table (1 - rpfilter match, uRPF, 2 -
> notrack). INPUT chain - is left to you how to filter traffic going to your
> router itself, as it is not performance critical.
>
> 1. You may get more benefits from tuning Linux routing cache (which is removed
> from starting from 3.6 tree), there should be information about routing cache
> on the web. In general: routing cache is bad, avid it by using kernels
> starting from 3.8 (before might have performance regressions due to cache
> removals), as it has poor performance under DoS.
>
> Here is example on how I tuned out routing cache for 3.2 series on system with
> 2GB ram:
>
> # RT cache
> net/ipv4/route/gc_thresh = 786432
> net/ipv4/route/max_size = 1572864
> net/ipv4/route/gc_min_interval = 0
> net/ipv4/route/gc_min_interval_ms = 500
> net/ipv4/route/gc_timeout = 600
> net/ipv4/route/gc_interval = 300
> net/ipv4/route/gc_elasticity = 4
> #net/ipv4/route/mtu_expires = 600
> #net/ipv4/route/min_pmtu = 552
> #net/ipv4/route/min_adv_mss = 256
>
> Please note in IPv6 routing cache does not exists and some parameters (if
> exists) has different meaning!
>
> gc_* variables controlls garbage collector parameters.
> max_size = maximum number of entries in hash table.
>
> And one much more interesting parameter:
>
> rt_cache_rebuild_count - by default it is equals to 4, however using value -1
> turns off routing cache. Turning off routing cache might have positive effect
> on performance (however this depends on traffic patterns in your network).
>
> Details on these and other parameters could be found in
> Documentation/networking/ip-sysctl.txt in Linux kernel tree.
>
> See lnstat(8) to obtain various network statistics (including routing cache
> information under rt_cache file).
>
> Tuning other sysctl variables (e.g. not related to routing cache, or for L3
> stack - ARP, Proxy ARP, turn/on off forwarding on per interface basic etc)
> seems useless (especially tcp_* variables, related to TCP implementation :-)).
>
> 2. Other important point: do not use too much iptables rules in FORWARD and
> PREROUTING - iptables in Linux kernel implemented as list of chains, called
> sequentially. There is no cache in firewall implementation, no short cuts,
> just linear lookup of each entry in chain.
>
> Use ipset(8) to store big number of IPs and match with single iptables rule,
> which stores data in bitmaps and hashes instead for performance.
>
> Implementing QoS (shaping, policing, etc.) requires packet classification.
> Typically classification is done by using src/dst from packet. Use
> hashed mode with cls_u32 or cls_flow classifiers, if possible with cls_flow
> use classification based on realms (which is set during packet forwarding in
> IP stack, and thus requires not additional lookup in skb in cls_flow).
>
> Use pfifo as qdisc on leaf classes.
>
> In general: using shaping instead of policing by ISP is bad idea, as it
> requires a lot more resources than simply policing traffic and thus
> increases load on your routers (requires to enqueue/dequeue packet on router),
> Also shaping increases latency (not much for real tasks as for me).
>
> However much people likes shaping, as it makes less packet loss for end users
> due to its queuing nature. Also scheduling module HTB which is commonly
> deployed - proven and mature, well maintained in kernel.
>
> Relatively to HTB: use sch_htb from latest kernels, at least with upstream
> commit 64153ce0a7b6 (net_sched: htb: do not setup default rate estimators).
> This commits disables setup of rate estimators by default (no real time
> statistic about packet rate in class), but greatly reduces number of locks
> in queuing code, and prevents cache starvations. Really good for performance.
> Optionally rate estimators could be enabled/disabled at runtime
> (htb_rate_est). Also never sch_htb implementations has more precise traffic
> accounging (especially TSO mode accounting is fixed).
>
> So in QoS principials most important part: classification.
>
> 3. Interrupts balancing
>
> Using NIC with hardware multiqueuing support and proper interrupt load
> balancing accross all cores in system is essential.
>
> Be avare, when using 802.1ad (QinQ) your NIC seems looses ability to
> distribute received packets to multiple queues, among with other
> hardware offloading capabilies (IP packet checksum offloading, etc).
> However in my practice, on servers with 8 NIC (4 - Up, 4 - Down, aggregated
> with LACP by bonding functionality in kernel) on 4 Core CPU has less impact,
> as RX-0 of each NIC bound to separate core.
>
> Interrupts could be balanced manually (by custom written script), or by using
> irqbalance. Of course many places recommends to use balance IRQs manually,
> however I have no point why this should be done and why this is best practice?
>
> On my own experience irqbalance of the latest versions do its job wery well.
> However in your kernel support for msi to irq assignment mapping in sysfs is a
> MUST, overwise you get your interrupts balanced with much less success (see
> output of irqbalance -d on console, it warns about missed support from kernel
> sysfs).
>
> 4. Use 1G NIC in bonding instead of 10G where appropriate.
>
> Yes, bonding as any other part in software router is implemented in software.
> But it seems relatively lightweight on term of CPU.
>
> To utilize 10G with shaping and firewall you need much powerful CPU with more
> cores. Even hawing such one would not help, as locking in QoS schedulers may
> become bottleckeck. There is work in progress in kernel to introduce HTB on
> multiq scheduler.
>
> By using layer2+3 policy is sufficient (and better than layer3+4, which is not
> fully compliant 802.1ad in terms of load distribution) and provides good load
> sharing across slaves.
>
> 5. Use NAT/PAT in separate network namespace.
>
> NAT/PAT requires conntrack, enabling conntrack for forwarded traffic is not
> good in performance terms.
>
> Linux has support for network namespaces for years. This allows you to
> establish new copy of network stack, attach network interfaces to it,
> configure IP addresses (even overlapping) and iptables/ipset (only on very
> recent kernels) rules, separately from core system.
>
> See ip-netns(8) for more information on how to use this.
>
> 6. FV vs default + local networks.
>
> This is more ram question as modern Linux kernels use trie to store routing
> table entries. I see wery little performance impact FV vs local networks (less
> than 1% of CPU). FV consumes nearly 60MB in kernel trie structures.
>
> Using default + local networks seems more than enought to perform routing and
> policy on traffic.
>
> 7. Tuning igb(7) driver parameters.
>
> Original Intel igb driver has number of options, which is not present in
> the mainline kernel driver.
>
> However changing same of them could increase packet latencies, which is not
> good enought for modern realtime application, so I would not recommended to
> touch these parameters unless you have extra reason.
>
> They have resonable default values.
>
> ----
>
> There are lots of things to consider to, but these are essential.
>
>
>>
>> I'm currently about to replace some of my software routers, and have
>> been looking around for optimizations like these.
>>
>> Regards,
>> Ruben
>