On 02/06/2014 04:14 PM, Ondrej Zajicek wrote:
On Thu, Feb 06, 2014 at 02:17:12PM +0100, Peter Christensen wrote:
Hi, Hello
I noticed that the multipath support in OSPF seems to be fairly limited. Essentially I was only able to make it do multipath if I had two interfaces connecting to the same router. At my company, we need true multipath between multiple routers using a single interface. (If I needed the other, I could use LACP) Not if such multipath spans multiple routers (e.g. a network consists of several routers connected by ptp links to a circle. True. I was just considering the simple case with two routers with two interfaces each connecting to a switch in between. Here, LACP would work just fine.
Also note that even if you have just one interface, you still get ECMP if there are several paths (through different neighbor routers) to one router few hops away. Apparently I didn't. I essentially tries to make my routers balance traffic across multiple load-balancers running OSPF with BIRD. Their setup looks something like this (simplified):
protocol ospf { import none; export none; area 1 { interface "eth0"; interface "lo"; }; } The loopback interface contains a number of anycast addresses which appears as stubnets in OSPF. The routers see the stubnet on both load-balancer, but only pick one (seemingly) random load-balancer when inserting into the routing table. If both the router and one of the load-balancers participated in the area on two interfaces, I got a multipath route entry. I traced the flow and found that stubnets never visited the current multipath code.
I am aware of the implications the default multipath implementation in Linux which operates on a per-packet basis, which is why we've patched our kernels to do it per-flow instead. Really? AFAIK default Linux implementation is per-flow, not per-packet, unless this was changed recently.
The IPv4 multipath code in the kernel actually picks a pseudo-random route in a round-robin fashion. The route cache would however ensure that the flow stayed on a particular path for a while if the route was used continuously. In Linux 3.6 the route cache was removed from the kernel (apparently the route cache behaved badly under heavy load), effectively turning the multipath code from per-flow to per-packet. The IPv6 multipath code has always used a hash-based modulo-N algorithm which ensured consistent flow-based multipath. So we basically added an option in the kernel allowing for hash-based modulo-N based multipath in IPv4 (as an added bonus, the round-robin code required a spinlock, while the hash-based code is lock-free). Unfortunately our implementation disregard multipath weights, so I haven't bothered sending it to any kernel mailing list. By recommendation of RFC 2992 (Analysis of an Equal-Cost Multi-Path Algorithm) I'll probably change our hash modulo-N algorithm to a hash-threshold algorithm, which have better behavior in case of gateways being added or removed to the multipath.
Anyway, I seemed to have managed to make multipath work as expected - at least in our setup. (Patch attached) Well, what is expected is the question. BIRD currently do multipath on idea that multiple paths through OSPF network topology to one destination in one area are merged, but two same routes originated by two different routers are considered different destinations (which makes perfect sense for propagated default gateways or anycast destinations).
The way I interpret the OSPFv2 spec, a destination is simply an IP address prefix. There may be several routes to a particular destination through a lot of routers, but if multiple routes to that destination exist whcih seems identical in quality (cost etc.), those routes are eligible for multipath - even though those destinations are default gateways or anycast destinations (anycast destination are after all indistinguishable from ordinary destinations). So at least what I expect is that /any/ seemingly equal route to a given network should be merged into a multipath route if ecmp is enabled. RFC 4786 (Operation of Anycast Services) talks about using ECMP with anycast services, obviously mentioning that per-packet load-balancing can be problematic with anycast, and that hash-based ECMP is preferred. In other words, combining hash-based multipath with anycast may often be preferable, and the OSPF algorithm ought to ensure that all active routes to the anycast destination are of equal best cost.
You patch merges such routes from different routers, but still keeps routes from different area. Few months ago, Volodymyr Samodid commented that ECMP in OSPF should merge paths from multiple areas.
Really? From RFC 2328 (OSPF Version 2) section 16.8 (page 178): "Each one of the multiple routes will be of the same type (intra-area, inter-area, type 1 external or type 2 external), cost, and will have the same associated area. However, each route may specify a separate next hop and Advertising router." Arent't they saying that each route in the multipath entry must share the same associated area?
So it seems that this should be at least configurable (like 'ecmp merge internal <bool>', 'ecmp merge external <bool>', 'ecmp merge areas <bool>'). The question is how much detailed such configuration shouldbe. For example, it may be useful to merge external routes with the same route tag, but not merge external routes with different ones. And what about merging internal and external routes together, is this useful?
Any thoughts on this issue?
At least from the RFC 2328 point of view, it apparently doesn't make sense to merge the routes across different types of routes. But I guess that boils down to the fact that they usually have different costs.
Essentially, I've hooked my multipath code into ri_install_ext() and ri_install_net(), where I add the equal routes if the routes share the same type, metrics and OSPF area. I realize that my add_nexthops() is /very/ similar to merge_nexthops() in functionality, but it seemed that the top_hash_entry() could be null, so I wrote a new method which did not rely on that - at the cost of more calls to copy_nexthop(), I guess.
Any thoughts? The implementation looks clear and simple, i will look at it thoroughly in a few days. On the first look i see that the patch forgot to zero orta->rid and perhaps orta->tag if merged routes have it different.
Yeah, I guess clearing rid makes sense since the route is really from different routers. As for the tag, I'm not sure what the expected behavior is, since it is out of the scope of the OSPFv2 spec. Maybe that is cause enough to make it tunable whether routes with different tags can be merged. Another thing I've personally noticed, is that I should probably also check ORTA_NSSA, ORTA_PROP and ORTA_PREF when verifying equal cost routes in ri_install_ext. ri_better_ext is after all taking them into consideration.