Re: Merging bird and bird6

12 Jul 2011

      Ondrej Zajicek wrote:
...
On Thu, Jul 07, 2011 at 03:12:15PM +0400, Alexander V. Chernikov wrote:
...
It depends on FIB/tricks implementation :)
Actually, I'm trying to discuss those implementation details.
At the moment we have at least the following non-standard things in the
world of routing:
* VPNv4 address family (RFC 4364) (8-bytes route distinguisher, 4 byte
IPv4 address)
* VPNv6 address family (RFC 4659) (8-byte route distinguisher, 16 byte
IPv6 address)
* MPLS address family (RFC 3032) (size varies, it may be 16 bytes or
more since  label stack depth is implementation-specific. for example,
sizeof(sockaddr_mpls) is ~80 bytes for my implementation)
So you are developing doing some MPLS for BIRD development [*]? This is
interesting. Definitely we could merge it when it will be ready. Before
talking about some implementation details i would like to know your
overall idea of how MPLS could be integrated to BIRD and how it will
interact with other BIRD parts. It seems that MPLS is a bit different
from traditional routing in several ways, so it is not straightforward
(for example, how MPLS 'routes' will be represented in BIRD core, in
struct rte like common routes? In separate tables from IPv4/v6 routes?)
What are interactions (in BIRD) between MPLS and IPv4/v6 routing?
What new concepts have to be introduced?
[*] Rhetorical question, i found http://freebsd.mpls.in/
To show "overall" view we have to describe what we will add and what
will be required from BIRD first.

First of all, mpls operates labels (20-bit number). Labels can be
assigned to different entities (IPv4/IPv6 prefix, for example). Label is
associated with action to do with this packet and has _local_
significance. The following actions are  defined:
* POP (pop one label from label stack)
* PUSH (add one or more label to existing packet)
* SWAP (replace top label)
Small picture illustrating:
http://upload.wikimedia.org/wikipedia/commons/e/eb/MPLS-swapping-071218.JPG

All packets entering MPLS network are prepended by (one or more) MPLS
labels (1). Router doing this is called Ingress LSR (Label Switch
Router) and is PE (Provider Edge) router. After that, packet travels in
MPLS networks via P (Provider) routers routed by its label (getting
rewritten on every router) (2). In the end, packet exits from MPLS
network on Egress LSR which is PE router, too (3). Appropriate signaling
is needed to permit all this happen.

Label exchange can be done different ways, for example:
* LDP (RFC 5036), easy p2mp protocol like OSPF
* RSVP-TE (RFC 3209) extension to RSVP, focused on  provider features
(Qos, fast rerouting, tunnels for explicit traffic flow, ..)
* MPBGP (RFC 4364) BGP extension carrying labels and prefixes via
extended communities in VPNv4/VPNv6 address family

I've got more or less (actually less) working LDP implementation at the
moment.

MPLS labels can (and will be!) stacked together. This is used to provide
    services on MPLS network: top label is mostly used to reach
destination PE router in MPLS cloud, and upper label(s) are used as
service identification.

I will describe L3 VPN setup from BIRD point of view.
Very good and easy VPN explanation (using RSVP, but this doesn't matter)
: http://www.ist-nobel.org/Nobel2/imatges/L3VPN_Training_course.pdf

* ABSTRACT VIEW

Imagine Provider network with P and PE routers. IGP is OSPF and LDP is
enabled on all appropriate interfaces.

OSPF is running in GRT (Global Routing Table), LDP connects to this
table, too. LDP establishes relationship with all routers (exactly like
OSPF does) and begin exchanging LMAPs (label mappings) (map FEC
(forwading equivalent class, IPv4/IPv6 prefix for simplicity) to some
number (label)). Every router generate LMAPs for every prefix in its GRT.
After LMAP for some prefix is received and verified we need to notify
kernel that route to given prefix is MPLS-enabled (case 1).
Additionally, we assign local label to that prefix and install MPLS
label with IGP nexthop for this prefix into kernel (AF_MPLS "route")
(case 2).
There are some special labels with pre-defined meaning.
Label 0, for example is called "IPv4NULL". Router receiving packet with
this label pops it and, if there is last label on stack assumes packet
data to be IPv4 packet. Usual IP routing is the used to send packet.

Imagine now we have a customer asking for L3 VPN for its 3 sites
connected to our ISP routers PE1, PE2 and PE3.

We now configure separate routing table on those 3 PE routers.
Some globally unique RD (Route distinguisher) has to be assigned to this
VPN instance (assigned by user). We than have to convert routes received
from new routing table to VPNv4 routes with some custom attributes
(stored in ext communities) containing RD, label assigned to this route
and vice versa.
We also have to notify kernel (update IP route and add AF_MPLS "route")
for every prefix we need to.

As a result, we will get the following picture (prefixes/label are
random) on every PE router:

route table "new"

Prefix              RD         Router      LABEL
192.168.1.0/24     31337:1       PE1         --
192.168.2.0/24     31337:1       PE2        PUSH {31, 47}
192.168.3.0/24     31337:1       PE3        PUSH {44, 35}
192.168.13.0/24    31337:1       PE3        PUSH {44, 36}

* BIRD USER VIEW

table new;

protocol ospf ospf0 {
# some OSPF configuration in GRT
}

protocol ldp {
        export all;
        label range 20 4000;

        interface "em*" {};
        interface "vlan*" {};
}

protocol bgp bgp0 {
	description "Link to RR";
	mpls vpn;
# Some usual configuration
}

protocol l3vpn {
	table new;
        rd 31337:1;
# Some import/export filters, by default - import
# all routes with RT (route target) equal to RD
# and export all routes with RT equal to RD
}

protocol direct {
	table new;
	interfaces "vlan136";
}

* UNDER THE HOOD
*** KERNEL INTERACTION ***
Case 1:
Route update can happen differently: we can install updated route IFF
* LDP label exists
* IGP nexthop is one of advertised LDP neighbour nexthops.

LDP LMAP can arrive before or after IGP announce, so there is 2
different cases:
1) Prefix already exists from some IGP and LMAP arrives.
Here we can find appropriate kernel instance and feed exactly the same
route with new attribute (EAP_ADDITIONAL) containing MPLS sockaddr.

We can, of course, call rt_update stuff, but:
* Route is not considered better if some extended attributes are changed
at the moment
* There is no need to call all other protocols since they should not be
interested in such update

Direct updating seems much more appropriate

2) LMAP already exists and rt_notify is called.
At the moment it is not possible for a protocol to alter announce:
rt_notify calling order cannot be predicted, import_control has only
local significance.
Some pre-announce hook should be added permitting all interested
protocols to add their attributes, at least. (we will insert
EAP_ADDITIONAL attribute here)

Case 2:
This is more tricky.
We can handle LIB (Label Information Database) as internal hash table,
the main problem is kernel interaction.

We can handle this either
* adding some private hook to kernel (since there is no need to notify
other protocols even in case of LDP + RSVP-TE (separate label space
should be used). However, dumping (for the purpose of cleaning) AF_MPLS
table requires another hook

* By upgrading FIB / rtable:
If (from the point of user) config tables will be not AF_bound (e.g.
IPv4+IPv6) we will have to do enhance FIB api.

My vision is the following:
* make fib AF_ bound, specifying AF and sizeof(object) at fib_init (or
fib2_init)
* pass pointers to all fib_* related functions instead of addresses
* do compare by memcpy() for searching (and use AF-dependent hash based
on value passed in _init)
* Pass AF in appropriate protocol hooks
* Change struct network (move rte *routes up) to permit adding some
dynamic-sized address after struct fib_node

* Each rtable contains fib pointers to supported AFs

Using this approach we can send route update by simply announcing label
in GRT table with AF_MPLS

*** PROTOCOL INTERACTION ****
We have to change paradigm "All protocols are equal" to
"All protocols are equal, but some protocols are more equal than others"

We need some sort of API which permits to call some protocol-specific
hook for given protocol type in given rtable.
This is needed due to
* LDP -> kernel protocol invocation
* L3vpn label requests from LDP (see below)

Alternatively, post-configure protocol hook should be added (current
postconfigure is actually
post-successfull-parse-and-config-structure-filled hook) to permit
updating protocol pointers after config change

Assuming multi protocol rtables each L3vpn instance subscribes on
VPNv4/VPNv6 GRT fib instances and listens for routes announced by bgp
sessions with "mpls vpn" keyword.

(Some filters work should be done to permit filtering by RT)

After route passed filtering l3vpn instance retrieves actual IPv4/IPv6
prefix and remote label, does recursive route lookup, requests local
label from LDP and installs/updates kernel routes. Very much like pipe
protocol

In case of new route appearing into l3vpn base table ("new") it passes
export filter, packs into VPNv4/VPNv6 family, another local label gets
requested from LDP and finally route gets announced into appropriate VPN
family in GRT.

* PORTABILITY

Actually, even all [known] mpls label,sockaddr and other definitions are
not portable across platforms:

Linux one (linux-mpls project):
http://mpls-linux.git.sourceforge.net/git/gitweb.cgi?p=mpls-linux/mpls-linux...

OpenBSD one:
http://fxr.watson.org/fxr/source/netmpls/mpls.h?v=OPENBSD

NetBSD one:
http://fxr.watson.org/fxr/source/netmpls/mpls.h?v=NETBSD

FreeBSD (from my implementation):
http://freebsd.mpls.in/svn/filedetails.php?repname=FREEBSD+MPLS&path=%2Frele...

rtsock is more or less the same for *BSD
netlink is different, as usual

Some additional (system-dependent) run-time kernel configuration is
required at least in Linux and FreeBSD.

I'm implementing FreeBSD part only (however I'll try to do as much
portability as I can)

Least common denominator we can get:
* More flexible API permitting RR/MPLS
* BGP VPN extensions
* LDP implementation
* L3VPN implementaion
* Protocols interaction