"gw" attribute assignment in filter invalidates routes learned via BGP, static, and possibly others?
Hello! Another issue I spot last time: assigning value in protocol export filter invalidates route and prevents its from being installed in KRT. Simple config to test this issue: ---------------------------------------- ### SYSTEM # ip link add dev lo255 type dummy # ip link set up dev lo255 # ip -4 addr add 192.0.2.254/24 dev lo255 # ip -4 addr add 172.16.2.254/24 dev lo255 # ip link set up dev eth0 # ip -4 addr add 192.168.1.2/24 dev eth0 ### BIRD router id 10.10.10.10; protocol device devices { scan time 120; } # Main routing table (master), mapped to ipt_main kernel table. # ipt_main == 254 (see /etc/iproute2/rt_tables) protocol kernel kernel254 { persist no; scan time 120; learn no; device routes no; kernel table ipt_main; import none; export all; } protocol direct direct254 { interface "lo255", "eth0"; } protocol static static254_test { # If "via" is one from "lo255" subnets (172.16.2.5 for example: # everything is correct and route installed by "kernel254" route 10.0.0.0/24 via 192.168.1.1; import filter { # This causes route invalidation, "gw" points to 192.0.2.5, # but interface is still "eth0" (see birdc output below). gw = 192.0.2.5; accept; }; export none; }; protocol static static254 { # rs1, rs2 route 192.168.254.1/32 via 192.168.1.1; route 192.168.254.2/32 via 192.168.1.1; } # AS65001, peering with Route Server(s), RS template bgp tl_bgp254_as65001 { start delay time 10; connect retry time 60; startup hold time 30; keepalive time 10; hold time 30; capabilities yes; advertise ipv4 yes; enable route refresh yes; enable as4 yes; gateway recursive; multihop 4; local 192.168.1.2 as 65010; import filter { # This causes route invalidation, "gw" points to 192.0.2.5, but # interface is still "eth0" (see birdc output below). gw = 192.0.2.5; accept; }; export none; } protocol bgp bgp254_as65001_rs1 from tl_bgp254_as65001 { neighbor 192.168.254.1 as 65001; } protocol bgp bgp254_as65001_rs2 from tl_bgp254_as65001 { neighbor 192.168.254.2 as 65001; } Here is output from birdc and ip-route(8): ---------------------------------------------------- ### common ## 192.0.2.0/24, lo255 $ birdc 'show route for 192.0.2.5 all' BIRD 1.3.11 ready. 192.0.2.0/24 dev lo255 [direct254 12:45] * (240) Type: device unicast univ $ ip -4 route show match 192.0.2.5/32 192.0.2.0/24 dev lo255 proto kernel scope link src 192.0.2.254 ## 172.16.2.0/24, lo255 $ birdc 'show route for 172.16.2.5 all' BIRD 1.3.11 ready. 172.16.2.0/24 dev lo255 [direct254 13:00] * (240) Type: device unicast univ $ ip -4 route show match 172.16.2.5/32 172.16.2.0/24 dev lo255 proto kernel scope link src 172.16.2.254 ## 192.168.1.0/24, eth0 $ birdc 'show route for 192.168.1.0/24 all' BIRD 1.3.11 ready. 192.168.1.0/24 dev eth0 [direct254 12:45] * (240) Type: device unicast univ $ ip -4 route show match 192.168.1.2/32 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.254 ### static ## 10.0.0.0/24 $ birdc 'show route for 10.0.0.0/24 all' BIRD 1.3.11 ready. 10.0.0.0/24 via 192.0.2.5 on eth0 [static254_test 13:19] ! (200) Type: static unicast univ $ ip -4 route show exact 10.0.0.0/24 ### BGP $ birdc 'show route for 10.0.0.0/8 all' BIRD 1.3.11 ready. 10.0.0.0/8 via 192.0.2.5 on eth0 [bgp254_as65001_rs1 13:00 from 192.168.254.1] ! (100/0) [AS1011i] Type: BGP unicast univ BGP.origin: IGP BGP.as_path: 65001 1011 BGP.next_hop: 192.168.1.1 BGP.local_pref: 100 BGP.community: (1001,65010) (1001,1001) via 192.0.2.5 on eth0 [bgp254_as65001_rs2 13:00 from 192.168.254.2] (100/0) [AS1011i] Type: BGP unicast univ BGP.origin: IGP BGP.as_path: 65001 1011 BGP.next_hop: 192.168.1.1 BGP.local_pref: 100 BGP.community: (1001,65010) (1001,1001) $ ip -4 route show exact 10.0.0.0/8 --------------------------------------------------------------------------------------- This is probably due to not updating iface, nexthops (multihop config) and other fields of "rta" struct. Provided patch in attachment tries to address this issue by calling rta_set_recursive_next_hop() in filter/filter.c to properly assign to "gw" attribute. Special cases for bgp and static protocols was taken to use "igp table" configuration parameter if present (tested, and found working with static protocol, probably some with bgp). Patch tested in my configuration and works wery well for both static and bgp protocols. --------------------------------------------------------------------------------------- Brief explanation why "gw" attribute might be wery important (at least in my case). There is common technique to stop DDoS in large ISP network: blackholing. However implementations of this might wary from vendor to vendor. In BIRD simplest way to implement this is to set "dest" attribute to something like RTD_BLACKHOLE, and all other route attributes gets deleted (gw, iface, ...). Route installed in KRT as blackhole. Everything is ok with this setup, but sometimes searching blackhole route might give some surprises: you trace path to blackholed destination and gets nothing(???) from your gateway, as it drops packets destined to blackhole route without any notification (in case of trace path without sending ICMP Time To Live Exceeded). And what if you trace from some core router? And heres another blackholing technique comes: use looped back interface to drop packets. In this case we setup some system looped back interface (in Linux this is dummy interface type), configure some network on it (say something 192.0.2.0/24), And instead of setting blackholed route type to RTD_BLACKHOLE we change its "gw" to any address within subnet assigned on looped back interface. Thats servers same as route with blackhole type, but behaves differently on trace paths: router sends ICMP Time To Live Exceeded messages from its incoming interface address indicating last hop before dropping blackholed traffic. This has no impact on DDoS traffic, except it transmitted to looped back interface instead of being dropped immediately after matching route and wasting few CPU cycles. -- SP5474-RIPE Sergey Popovich
On Tue, Aug 13, 2013 at 01:57:48PM +0300, Sergey Popovich wrote:
Hello!
Another issue I spot last time: assigning value in protocol export filter invalidates route and prevents its from being installed in KRT.
Yes, this is a known issue. Works just for setting gw on the same iface.
--------------------------------------------------------------------------------------- This is probably due to not updating iface, nexthops (multihop config) and other fields of "rta" struct.
Provided patch in attachment tries to address this issue by calling rta_set_recursive_next_hop() in filter/filter.c to properly assign to "gw" attribute. Special cases for bgp and static protocols was taken to use "igp table" configuration parameter if present (tested, and found working with static protocol, probably some with bgp).
The patch does not make sense to me - if user sets 'gw' attribute, BIRD should set immediate nexthop of the route, not setup a route with a recursive nexthop - that would be inconsistent, because reading of 'gw' attribute returns the immediate nexhop and not the recursive nexthop of a route. The attached patch should do that (essentially just lookup iface, fix it and force the route to RTD_ROUTER in case of setting 'gw'). Is this OK for you?
--------------------------------------------------------------------------------------- Brief explanation why "gw" attribute might be wery important (at least in my case).
There is common technique to stop DDoS in large ISP network: blackholing.
However implementations of this might wary from vendor to vendor. In BIRD simplest way to implement this is to set "dest" attribute to something like RTD_BLACKHOLE, and all other route attributes gets deleted (gw, iface, ...). Route installed in KRT as blackhole.
...
In this case we setup some system looped back interface (in Linux this is dummy interface type), configure some network on it (say something 192.0.2.0/24), And instead of setting blackholed route type to RTD_BLACKHOLE we change its "gw" to any address within subnet assigned on looped back interface.
Thats servers same as route with blackhole type, but behaves differently on trace paths: router sends ICMP Time To Live Exceeded messages from its incoming interface address indicating last hop before dropping blackholed traffic. This has no impact on DDoS traffic, except it transmitted to looped back interface instead of being dropped immediately after matching route and wasting few CPU cycles.
Thanks for the thorough explanation. I am surprised that route to a Linux dummy interface works like that, i always thought that dummy interface would behave more like an ethernet with nothing connected on it than like a loopback (therefore you would get ICMP Destination unreachable instead of TTL exceeded), but i didn't tested that. And why not just use RTD_UNREACHABLE or RTD_PROHIBIT? Both would return some ICMP message. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
В письме от 13 августа 2013 16:25:14 Вы написали:
The patch does not make sense to me - if user sets 'gw' attribute, BIRD should set immediate nexthop of the route, not setup a route with a recursive nexthop - that would be inconsistent, because reading of 'gw' attribute returns the immediate nexhop and not the recursive nexthop of a route.
Thaks, now I understand why. At least I try to fix problem by myself.
The attached patch should do that (essentially just lookup iface, fix it and force the route to RTD_ROUTER in case of setting 'gw'). Is this OK for you?
Yes, thaks. Patch works as expected.
Thanks for the thorough explanation. I am surprised that route to a Linux dummy interface works like that, i always thought that dummy interface would behave more like an ethernet with nothing connected on it than like a loopback (therefore you would get ICMP Destination unreachable instead of TTL exceeded), but i didn't tested that.
Sorry I dont have in mind to confuse you, really dummy interface is more like ethernet interface with nothing attached to it, nothing is looped back from it (nothing received actually). Anything sent to dummy interface simply discarded as with blackhole route, but no neighbor resolution (ARP, NDP) done on it, and general routing rules applied to it, like any other network interface, that makes it different from blackhole route. But network stack generates ICMP TTL Exceeded when it receives datagram destined on subnet configured on dummy interface, but cant forward to dummy interface because TTL is 1, and thus generaing ICMP TTL Exceeded. Using dummy interfce for blackholing seems simple and elegant solution:-).
And why not just use RTD_UNREACHABLE or RTD_PROHIBIT? Both would return some ICMP message.
Well, this could be solution, for just terminating address space, where packets for all currently not used addresses directed to common route (summary) that generates ICMP. However due do ICMP rate limiting in kernel for certrain ICMP types that might introduce unwanted looses then sending to these routes. But not under DDoS where we prefer not to answer in any way (even kernel network stack limit rate of ICMP Dest Unrach, and ICMP Admin Prohibited messages). -- SP5474-RIPE Sergey Popovich
On Tue, Aug 13, 2013 at 05:31:33PM +0300, Sergey Popovich wrote:
?? ???????????? ???? 13 ?????????????? 2013 16:25:14 ???? ????????????????:
The patch does not make sense to me - if user sets 'gw' attribute, BIRD should set immediate nexthop of the route, not setup a route with a recursive nexthop - that would be inconsistent, because reading of 'gw' attribute returns the immediate nexhop and not the recursive nexthop of a route.
Thaks, now I understand why. At least I try to fix problem by myself.
That always counts.
The attached patch should do that (essentially just lookup iface, fix it and force the route to RTD_ROUTER in case of setting 'gw'). Is this OK for you?
Yes, thaks. Patch works as expected.
Well, you should also use this patch, otherwise your BGP sessions will be restarted if you shutdown the dummy iface. This bug could also be triggered by other means but i noticed it in connection with the gw-setting patch.
Thanks for the thorough explanation. I am surprised that route to a Linux dummy interface works like that, i always thought that dummy interface would behave more like an ethernet with nothing connected on it than like a loopback (therefore you would get ICMP Destination unreachable instead of TTL exceeded), but i didn't tested that.
Sorry I dont have in mind to confuse you, really dummy interface is more like ethernet interface with nothing attached to it, nothing is looped back from it (nothing received actually). Anything sent to dummy interface simply discarded as with blackhole route, but no neighbor resolution (ARP, NDP) done on it, and general routing rules applied to it, like any other network interface, that makes it different from blackhole route.
But network stack generates ICMP TTL Exceeded when it receives datagram destined on subnet configured on dummy interface, but cant forward to dummy interface because TTL is 1, and thus generaing ICMP TTL Exceeded.
OK, now i understand. The TTL ICMP message is related just to traceroute packets, not to the normal traffic (which has large enough TTL). So in essence route to dummy iface first checks TTL and then blackholes traffic, while RTD_BLACKHOLE just blackholes traffic.
Using dummy interfce for blackholing seems simple and elegant solution:-).
Well, i wouldn't call this elegant. RTD_BLACKHOLE seems expected to be used in such cases, so if it is insufficient for that purpose it is most likely a bug in kernel and using dummy iface is merely a workaround. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
В письме от 13 августа 2013 21:06:44 пользователь Ondrej Zajicek написал:
The attached patch should do that (essentially just lookup iface, fix it and force the route to RTD_ROUTER in case of setting 'gw'). Is this OK for you?
Yes, thaks. Patch works as expected.
Well, you should also use this patch, otherwise your BGP sessions will be restarted if you shutdown the dummy iface. This bug could also be triggered by other means but i noticed it in connection with the gw-setting patch.
Wow, really, Im not reaching that, as use IPv6 connection to trouble shoot IPv6 and not down/up lo255 interface. Huge thanks.
Thanks for the thorough explanation. I am surprised that route to a Linux dummy interface works like that, i always thought that dummy interface would behave more like an ethernet with nothing connected on it than like a loopback (therefore you would get ICMP Destination unreachable instead of TTL exceeded), but i didn't tested that.
Sorry I dont have in mind to confuse you, really dummy interface is more like ethernet interface with nothing attached to it, nothing is looped back from it (nothing received actually). Anything sent to dummy interface simply discarded as with blackhole route, but no neighbor resolution (ARP, NDP) done on it, and general routing rules applied to it, like any other network interface, that makes it different from blackhole route.
But network stack generates ICMP TTL Exceeded when it receives datagram destined on subnet configured on dummy interface, but cant forward to dummy interface because TTL is 1, and thus generaing ICMP TTL Exceeded.
OK, now i understand. The TTL ICMP message is related just to traceroute packets, not to the normal traffic (which has large enough TTL).
Yes.
So in essence route to dummy iface first checks TTL and then blackholes traffic, while RTD_BLACKHOLE just blackholes traffic.
Yes, really. Moreover kernel routing machine checks ttl and if it is greather than 1 after decrement it really transmits packet to dummy interface as in any other, and later dummy interface blackholes them (this could be seen with tcpdump(8) on dummy interface). Thanks in advices. -- SP5474-RIPE Sergey Popovich
В письме от 13 августа 2013 16:25:14 Вы написали:
On Tue, Aug 13, 2013 at 01:57:48PM +0300, Sergey Popovich wrote:
Hello!
Another issue I spot last time: assigning value in protocol export filter invalidates route and prevents its from being installed in KRT.
Yes, this is a known issue. Works just for setting gw on the same iface.
The attached patch should do that (essentially just lookup iface, fix it and force the route to RTD_ROUTER in case of setting 'gw'). Is this OK for you?
There are new circumstances comes with this patch (currently in upstream git tree): deletion IP address from interface, in which subnet "gw" attribute was set, does not change route reachability status. BIRD configuration. I use simple static protocol, but other protocols are affected as well when "gw" attribute changed in filters. BIRD is running on Linux 3.2.x stable tree. ------------------------------------------------------ # Configure logging log stderr all; log syslog all; router id 172.16.1.1; protocol device devices { scan time 120; } table rt_10; protocol static static10 { table rt_10; debug all; # This route represents protocol specific. # # For simplicity we use "blackhole" route without # neighbor attached to it. # route 192.168.0.0/16 blackhole; # This route have attached neighbor via neigh_find2() # with %NEF_STICKY to track its nexthop changes, # and thus neigh_notify() callback in static protocol # takes care on nexthop state changes. # route 10.0.0.0/8 via 192.0.2.5; import filter { # Overwrite "gw" attribute on # "blackhole" route if dest = RTD_BLACKHOLE then gw = 192.0.2.5; accept; }; export none; } protocol kernel kernel10 { table rt_10; debug all; persist no; scan time 120; learn no; device routes no; kernel table 10; import none; export all; } System network stack configuration ---------------------------------------------- # ip -4 addr show dev lo255 6: lo255: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN inet 192.0.2.1/24 scope global lo255 # ip -4 route show table 10 Starting bird, complied with -DGLOBAL_DEBUG and "-d" option ------------------------------------------------------------------------------- # bird -d ... After bird starts, everything works as expected ------------------------------------------------------------ # birdc 'show route table rt_10 all' BIRD 1.3.11 ready. 10.0.0.0/8 via 192.0.2.5 on lo255 [static10 12:49] * (200) Type: static unicast univ 192.168.0.0/16 via 192.0.2.5 on lo255 [static10 12:49] * (200) Type: static unicast univ # ip -4 route show table 10 10.0.0.0/8 via 192.0.2.5 dev lo255 proto bird 192.168.0.0/16 via 192.0.2.5 dev lo255 proto bird Now, delete IP address 192.0.2.1/24 from "lo255" interface --------------------------------------------------------------------------- # ip -4 addr del 192.0.2.1/24 dev lo255 # ip -4 route show table 10 192.168.0.0/16 via 192.0.2.5 dev lo255 proto bird # birdc 'show route table rt_10 all' BIRD 1.3.11 ready. 192.168.0.0/16 via 192.0.2.5 on lo255 [static10 12:49] * (200) Type: static unicast univ And debugging output from bird ----------------------------------------- KRT: Received async address notification (21) KIF: IF6(lo255): removed IPA 192.0.2.1, flg 0, net 192.0.2.0/24, brd 192.0.2.255, opp 0.0.0.0 Flushing neighbor 192.0.2.5 on lo255 Static: neighbor notify for 192.0.2.5: iface 0000000000000000 Removing static route 10.0.0.0/8 27-09-2013 12:50:10 <TRACE> static10 > removed [sole] 10.0.0.0/8 via 192.0.2.5 on lo255 27-09-2013 12:50:10 <TRACE> kernel10 < removed 10.0.0.0/8 via 192.0.2.5 on lo255 nl_send_route(10.0.0.0/8,new=0) IFA change notification (2) for lo255:192.0.2.1 KRT: Received async route notification (25) KRT: Got 192.0.2.0/24, type=1, oif=6, table=254, prid=2, proto=(none) KRT: Ignoring route - unknown table 254 KRT: Received async route notification (25) KRT: Got 192.0.2.255/32, type=3, oif=6, table=255, prid=2, proto=(none) KRT: Ignoring route - unknown table 255 KRT: Received async route notification (25) KRT: Got 192.0.2.0/32, type=3, oif=6, table=255, prid=2, proto=(none) KRT: Ignoring route - unknown table 255 KRT: Received async route notification (25) KRT: Got 192.0.2.1/32, type=2, oif=6, table=255, prid=2, proto=(none) KRT: Ignoring route - unknown table 255 KRT: Received async route notification (25) KRT: Got 10.0.0.0/8, type=1, oif=6, table=10, prid=12, proto=kernel10 KRT: Ignoring route - echo ---------------------------------------------------------------------------------------- There is problem with route to 192.168.0.0/16: after deleting its nexthop 192.0.2.5, route reachability information changes - nexthop no more available, but route is still valid and installed in KRT. First this is an issue with 3.2.x linux kernel which does not flush its FIB on nexthop change (fixed in recent kernels, at least with 3.10). Second, BIRD does not renews its information about route when nexthop reachability changes for routes with "gw" attribute changed. This happens due to ignored neighbor status change notification in static_neigh_notify() callback at proto/static/static.c, for route 192.168.0.0/16, created with "dest" RTD_BLACKHOE and thus does not depending on any external information on determining route reachability. Contrary to this, route 10.0.0.0/8 have attached neighbor to its nexthop with %NEF_STICKY flag via neigh_find(), to track state changes, so when neighbor 192.0.2.5 state changes static protocol receives notification from neigh_up()/neigh_down() at nest/neighbor.c via neigh_notify() (static_neigh_notify()) callback. So when neighbor 192.0.2.5 state changes, static_neigh_notify() receives pointer to neighbor structure with list of static routes, that depend on neighbor at n->data field. However n->data contains only one route - 10.0.0.0/8, as 192.168.0.0/16 created by static protocol as blackhole. Also n->data might be empty, if route 10.0.0.0/8 nexthop set something other than 192.0.2.5, and notification received for neighbor 192.0.2.5, set in filter code with "gw" address. This is only one case, which describes core problem: how to handle notifications for routes with changed "gw" attribute? =========================================== So I need some advice from BIRD developers on how to address this issue correctly! =========================================== Currently I have only basic idea on how to do this: 1. Create possibly sticky neighbor for "gw" attribute, in filter code with neigh_find(..., NFA_STICKY), and really change gw attribute (and friends, as done currently in patch) in rta if n->scope > SCOPE_HOST (neighbor currently exists). Overwise do not touch rta->gw. 1. Each protocol must have neigh_notify() callback defined, when state change for neighbor received, we should find route by prefix, and propagate it to the table as this done when new route comes or route changes some of its attributes (nexthop again for example?). Route should pass "import" filter before entering table in this way. 2. In import filter, where "gw" attribute present we check neighbor reachability with n->scope > SCOPE_HOST and do real rta->gw attribute update. -- SP5474-RIPE Sergey Popovich
participants (2)
-
Ondrej Zajicek -
Sergey Popovich