MultiBird on L2 - A crazy idea for Fail Over y and Load Balancing

Sun Jan 24 11:48:58 CET 2021

On Fri, Jan 22, 2021 at 3:24 PM Douglas Fischer
<fischerdouglas at gmail.com> wrote:
>
> The big difference is the single point of failure.
>
> With a host doing that redirection, We will have load balancing and different boxes(physically different) running multiple instances.
> But We will still suffer from single-point-of-failure.
>
> To solve that, We will need another layer of redundancy of that redirector layer.
> And will be necessary at least two other machines. Or even more, depending on how strong is the need for redundancy.
> More things to deal with...

I still don't get it much. Yes, if you place one balancer, you have a
single point of failure. But you can make backup balancers.
So you have your pool of bird daemons on balancers or somewhere else,
then you have several balancers that balance connections to birds and
those are redundant, so that when one fails, the balancing function is
moved to another.
Balancers can be made redundant with VRRP for example and then route
connections to the pool of birds. What is missing here for your aim?

>
>
>
> My idea is to join Redundancy and Load-Balancing on the same layer solution.
>
> Exemplifying the concept:
> A scenario of an IXP with 2000 Participants, 15 Facilities been 5 of those designed for Computing Resources.
> (Let's consider, just to simplify the example, just one Route-Server. To have two Route-Servers, just double the recipe.)
> - Slice the 2000 Peers in 5 Resource Pools, according to the facilities where those peers are connected.
> - 7 Route-Servers, been 5 of then the primary of each resource pool, and the other two being the secondary and tertiary failover.
> - All the route-server with the same Route-Polices and Peers configuration provisioned by a central CI/CD.
> - Adjust Heart-Beat to deal with those resource pools.
>
> In this scenario, on an event where a facility(in Brazil the common name is PIX) became isolated from the rest of the Mesh of the rest of the IXP Lan, those participants in that facility will still be exchanging routes with each other.

You have several facilities in one L2 network segment? With VRRP, when
parts of it become isolated, I think, you will get a master in each
segment (split-brain) and participants in the isolated segment will
have their own master balancer, and if there are also some birds from
the pool in this facility, they will be able to exchange routes.

>
>
>
> Em qua., 20 de jan. de 2021 às 14:25, Alexander Zubkov <green at qrator.net> escreveu:
>>
>> Hi,
>>
>> Thank you for the link, I looked at the presentation. It looks like
>> what I thought it was (and you explained it the same way) - balancing
>> connections to multiple Bird instances. So I am still a little
>> confused about the features you want Bird to support in relation to
>> that. Because most of the "magic" you want - like HA on L2/L3, running
>> on different machines etc, can still be done by other networking tools
>> and configuration. And if you have some real case, with that you have
>> problems - it looks quite interesting to me and I could try to help
>> you with the configuration if you wish and maybe we could even make
>> some useful case of it for other bird users.
>>
>> On Tue, Jan 19, 2021 at 10:01 PM Douglas Fischer
>> <fischerdouglas at gmail.com> wrote:
>> >
>> > Vertical Scalability of Route-Servers on very large IXP is a challenge!
>> > We are talking about 400-2200 peers...
>> > https://ixpdb.euro-ix.net/en/ixpdb/ixps/?sort=participants&reverse=1&
>> >
>> > As already mentioned, Bird still does not deal very well with multi-threading(even on version 2).
>> > So, for that, threads with more gigahertz are better de several threads.
>> >
>> > In Bird's world, the solution for that is the MultiBird.
>> > That solution is explained here:
>> >  -> https://www.euro-ix.net/media/filer_public/40/8b/408bd0bb-6835-4807-8677-0a1961bd3fba/flock-of-birds_ljtmypd.pdf
>> >  -> https://www.youtube.com/watch?v=dwRwF7Bu8as
>> >     In pt_BR, but I believe that if you activate the automatic subtitles and automatic translation to your language will be enough to understand.
>> >
>> >
>> > What I'm proposing here is just a different method of doing multiple instances of Bird, with the possibility of those being on diferente boxes, or even different sites.
>> >
>> >
>> >
>> > Em ter., 19 de jan. de 2021 às 12:22, Alexander Zubkov <green at qrator.net> escreveu:
>> >>
>> >> But you wrote that for scaling there are load balancers to balance
>> >> sessions among different bird instances. So VRRP + Load Balancer will
>> >> give you what you want. You can also try to bind several birds to a
>> >> single address in linux (probably little patchin is required to set
>> >> socket options) and linux will balance sessions between them. You may
>> >> also want to exchange routing information somehow between your bird
>> >> instances, but I think it also can be solved somehow, a couple of
>> >> route reflectors for example.
>> >> I still do not understand what you want to see in Bird itself. I
>> >> haven't run large IXPs, so I may be not aware of something and would
>> >> be glad if you explained it in more detail.
>> >>
>> >> On Tue, Jan 19, 2021 at 3:22 PM Douglas Fischer
>> >> <fischerdouglas at gmail.com> wrote:
>> >> >
>> >> > As I mentioned initially, my focus was on "large environments of IXPs".
>> >> > Considering that, L3 anycast does not apply very well to that scenario.
>> >> > (I don't know any IXPs that use Route-Servers outside of the MPLA-LAN of the IXP.)
>> >> >
>> >> > Using VRRP is an excellent method to provide fail-over on L2.
>> >> > (I used it a lot on several application scenarios).
>> >> > But it does not provide load-balancing, just fail-over.
>> >> >
>> >> > Considering "large environments of IXPs", and the fact that even on Bird 2, the multi-thread limitation is not completely solved.
>> >> > The solution for that is Load-Balance. MultiBird does it VERY WELL.
>> >> > But until now we(at least me) have seen only "single-host" based solutions, using nat/forwarding connections.
>> >> >
>> >> > With this suggestion, using L2 load-balancing based on MAC-IP-Mapping manipulations, is possible to remove the "single-host" point of failure.
>> >> >
>> >> > Em ter., 19 de jan. de 2021 às 10:48, Alexander Zubkov <green at qrator.net> escreveu:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> You can use VRRP or alike protocol on L2 or dynamic routing with
>> >> >> anycast on L3 for reliability. I do not see what you want in Bird.
>> >> >> Could you explain more?
>> >> >>
>> >> >> On Tue, Jan 19, 2021 at 1:26 PM Douglas Fischer
>> >> >> <fischerdouglas at gmail.com> wrote:
>> >> >> >
>> >> >> > I was studying the concepts of multi-bird for large environments of IXPs.
>> >> >> >
>> >> >> > And, beyond the extra complexity that it brings to the environment, one of the weak points I saw was the fact that all the Bird instances are at the same box(vm, container, etc...).
>> >> >> >
>> >> >> > A friend mentioned that some tests were made with a LoadBalancer redirecting the post-nated connections to other boxes.
>> >> >> > But even in that scenario, that load balancer would be a single-point-of-failure/bottleneck.
>> >> >> >
>> >> >> > So I was remembering Cisco GLBP and Heart-Beat protocol.
>> >> >> > Those protocols inform different Mac-Addresses to the same IPv4/IPv6 Address, based on the source of the ARP/ND query.
>> >> >> > Making a load-balance/fail-over based on the glue between layer2 and layer3.
>> >> >> > P.S.: Several scenarios uses that concept. Corosync, Windows Cluster, Orale RAC, etc...
>> >> >> >
>> >> >> > Considering that concept, and joining it with multibird:
>> >> >> >  Would be possible to create groups of sources and assigning different priorities to those groups on each instance of Bird.
>> >> >> >  In this case, each Bird instance could run on a different box, or even on a different site.
>> >> >> >
>> >> >> > Further than that, on IXPs with a large number of participants, would be possible to define some affinity between that group of priority based for example on the facility where those participants are connected.
>> >> >> >
>> >> >> > I have a feeling that this would be especially useful for remote peering scenarios.
>> >> >> >
>> >> >> >
>> >> >> > Just a crazy idea to share with colleagues.
>> >> >> > Maybe from here, some good thing could rise.
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Douglas Fernando Fischer
>> >> >> > Engº de Controle e Automação
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Douglas Fernando Fischer
>> >> > Engº de Controle e Automação
>> >
>> >
>> >
>> > --
>> > Douglas Fernando Fischer
>> > Engº de Controle e Automação
>
>
>
> --
> Douglas Fernando Fischer
> Engº de Controle e Automação