BIRD patches for IP-in-IP

Tue Sep 27 18:15:24 CEST 2016

Hi, and thanks for your answer!

Yes, we can certainly use that approach instead.  In some of our testing we
use L2TP to create tunnels as you suggest, and then run BIRD through those
tunnels.  This approach doesn't require any BIRD modification.

However, the big advantage of the IP-in-IP approach is that it doesn't
require us to allocate (and manage) an extra IP address on every compute
host (or to do the L2TP tunnel setup, of course).  That makes many
deployments a lot simpler - and so possibly justifies the kind of BIRD
enhancement that I've described?

Regards,
    Neil

On Tue, Sep 27, 2016 at 4:51 PM Baptiste Jonglez <
baptiste at bitsofnetworks.org> wrote:

> Hi,
>
> On Tue, Sep 27, 2016 at 03:09:52PM +0000, Neil Jerram wrote:
> > Attached are 3 patches that my team has been using for routing through
> > IP-in-IP tunnels, rebased on 1.6.1.  I'd like to explain why we find them
> > useful, and start a conversation about whether they or something like
> them
> > could be upstreamed (or perhaps if there's some better way of achieving
> our
> > aims).
> >
> > Calico [1] uses BIRD for BGP routing between the hosts in various cloud
> > orchestration systems (Kubernetes, OpenStack etc.), to distribute routes
> to
> > the pods/VMs/containers in those systems, each of which has its own IP.
> If
> > all the hosts are directly connected to each other, this is
> > straightforward, but sometimes they are not.  For example GCE instances
> are
> > not directly connected to each other: there is at least one router
> between
> > them, that knows about routing GCE addresses, and to/from the Internet,
> and
> > we cannot peer with it or otherwise tell it how to route pod/VM/container
> > IPs.  So if we use GCE to create e.g. OpenStack compute hosts, with
> Calico
> > networking, we need to do something extra to allow VM-addressed data to
> > pass between the compute hosts.
> >
> > One of our solutions is to use IP-in-IP; it works as shown by this
> diagram:
> >
> >        10.65.0.3 via 10.240.0.5 dev tunl0 onlink
> >        default via 10.240.0.1
> >                |
> >              +-|----------+                             +------------+
> >              | o          |                             |            |
> >              |   Host A   |         +--------+          |   Host B   |
> >              |            |---------| Router |----------|            |
> >              | 10.240.0.4 |         +--------+          | 10.240.0.5 |
> >              |            |---.                         |            |
> >              +------------+    |                        +------------+
> >                ^       ^   +---v---+                                |
> >  src 10.65.0.2 |       |   | tunl0 |                                |
> >  dst 10.65.0.3 |       |   +-------+                                |
> >                |        \      |                                    v
> >          +-----------+   '----'
>  +-----------+
> >          |   Pod A   |      src 10.240.0.4                    |   Pod B
>  |
> >          | 10.65.0.2 |      dst 10.240.0.5                    |
> 10.65.0.3 |
> >          +-----------+          ------
> +-----------+
> >                              src 10.65.0.2
> >                              dst 10.65.0.3
>
> Can't you just use a tunnel between Host A and Host B and run BGP on top
> of this tunnel?  It would seem to be cleaner than hacking multi-hop BGP to
> obtain appriopriate next-hop values, unless I am missing something.
>
> It would look something like this:
>
>              +-|----------+                             +------------+
>              | o Host A   |                             |   Host B   |
>              |            |         +--------+          |            |
>              |  10.240.0.4|---------| Router |----------|10.240.0.5  |
>              |            |         +--------+          |            |
>              |   10.65.0.4|--.  +-------+   +-------+ .->10.65.0.5   |
>              +------------+   `>| tunlA |-->| tunlB |-  +------------+
>                                 +-------+   +-------+
>
>
> The BGP session would be established between 10.65.0.4 (IP of host A on
> tunlA) and 10.65.0.5 (IP of host B on tunlB), so that the routes learnt
> via BGP would be immediately correct.
>
> Basically, it's a simple overlay network.
>
> > The diagram shows Pod A sending a packet to Pod B, using IP addresses
> that
> > are unknown to the 'Router' between the two hosts.  Host A has an
> IP-in-IP
> > device, tunl0, and a route that says to use that device for data to Pod
> B's
> > address (10.65.0.3).  When the packet has passed through that device, it
> > has a new outer IP header, with src 10.240.0.4 and dst 10.240.0.5, and is
> > routed again according to the routing table - so now it can successfully
> > reach Host B.
> >
> > So how is BIRD involved?  We statically program the local Pod route on
> each
> > host:
> >
> > On Host A: 10.65.0.2 dev <interface to Pod A>
> > On Host B: 10.65.0.3 dev <interface to Pod B>
> >
> > then run a BIRD BGP session between Host A and Host B to propagate those
> > routes to the other host - which would normally give us:
> >
> > On Host A: 10.65.0.3 via 10.240.0.5
> > On Host B: 10.65.0.2 via 10.240.0.4
> >
> > But we don't want those normal routes, because then the data would get
> lost
> > at 'Router'.  So we enhance and configure BIRD as follows.
> >
> > - In the export filter for protocol kernel, for the relevant routes, we
> set
> > an attribute 'krt_tunnel = tunl0'.
> >
> > - We modify BIRD, as in the attached patches, to understand that that
> means
> > that those routes should have 'dev tunl0'.
> >
> > Then instead, we get:
> >
> > On Host A: 10.65.0.3 via 10.240.0.5 dev tunl0 onlink
> > On Host B: 10.65.0.2 via 10.240.0.4 dev tunl0 onlink
> >
> > which allows successful routing of data between the Pods.
> >
> >
> > Thanks for reading this far!  I now have three questions:
> >
> > 1. Does the routing approach above make sense?  (Or is there some better
> or
> > simpler or already supported way that we could achieve the same thing?)
> >
> > 2. If (1), would the BIRD team accept patches broadly on the lines of
> those
> > that are attached?
> >
> > 3. If (2), please let me know if the attached patches are already
> > acceptable, or otherwise what further work is needed for them.
> >
> > Many thanks,
> >     Neil
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20160927/059b99be/attachment.html>