Identifying BGP convergence bottleneck
Hi everyone! I am currently looking into the performance of BIRD (bgp) as a route-server with ~700 peers with 10k prefixes each. I'm noticing an increase in convergence time as I increase these numbers (which is not a surprise). Currently, I am forcing these convergence times by flapping the link which causes all peers to start sending updates at the same time. My particular interest here is to find out what action specifically is causing this convergence time. With convergence time I mean the time it takes for the bird to process all updates and drop down from 100% cpu to a more "idle" level. In order to identify what is causing this, I'm looking to mark the start and end of each phase as described in RFC4271 9.1. Decision Process. This way I may be able to get an idea of where the cpu time in spent during all this (best path calc, sending out updates, etc..). Unfortunately, I do not have a deep enough understanding of the code and have not managed to identify these points. Is anyone here able to give some pointers as to where in the code we could place these markings to measure this? Other comments and insights are also welcome. Kind regards, Patrick
On Thu, Jun 22, 2017 at 10:26:03AM +0200, Patrick.deNiet@os3.nl wrote:
Hi everyone!
I am currently looking into the performance of BIRD (bgp) as a route-server with ~700 peers with 10k prefixes each. I'm noticing an increase in convergence time as I increase these numbers (which is not a surprise).
Unfortunately, I do not have a deep enough understanding of the code and have not managed to identify these points. Is anyone here able to give some pointers as to where in the code we could place these markings to measure this?
Hi It is hard to tell, esp. it depends on your setting and configuration and would need some benchmarking. It is single-table or multi-table route server [1][2]? Personally i would guess that it is in TX, because you receive a route and then send it about ~700 times (but that depends on whether prefixes from different peers are shared or unique). You could benchmark it with 'export none' to see the difference. Also note that hash table used for routing table has max size of 64k buckets. You could try attached patches to fix that. That may help significantly. I would be interested in your performance results. [1] https://gitlab.labs.nic.cz/labs/bird/wikis/Route_server_with_community_based... [2] https://gitlab.labs.nic.cz/labs/bird/wikis/Route_server_with_community_based... -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Ondrej Zajicek -
Patrick.deNiet@os3.nl