Manish Kathuria
2006-Jan-26 13:29 UTC
Problems in Dead Gateway Detection / Failover - Multiple ISP Links
Hello, I have configured a load balancing router using Julian''s patches and as described in "nano.txt" for two ISP links as shown below. ISP 1 ISP 2 . . | | | | | | | WAN WAN | +-\-+ +-\-+ | | | | |R1 | GW1 GW2 |R2 | | |------. --------| | | | | | | | +---+ | | +---+ EXT1 | | EXT2 +\----\-+ | | | LINUX | | ROUTER| | | | | | | +---/---+ | INT IF | | | /----------------\ | LAN | | | \----------------/ LAN NETWORK = 192.168.100.0/24 INT IF = 192.168.100.1 ISP1 NETWORK = 10.20.30.128/29 R1 - ROUTER1 GW1 = 10.20.30.129 EXT1 = 10.20.30.130 ISP2 NETWORK = 172.16.32.128/29 R2 - ROUTER2 GW2 = 172.16.32.129 EXT2 = 172.16.32.130 Both the ISPs have provided /29 subnets of Public IPs. The above mentioned addresses are just for example. The gateways for both the ISPs are routers placed at the same location which are further connected through Radio Link and Leased Line. Things work fine as long as both the ISP links are alive. While testing the dead gateway detection and failover functionality we observed that if we make the first hop gateway (i.e Router R1 or R2) of one of the ISPs dead by either disconnecting the ethernet cable between Linux Router and R1/R2 or by switching off the gateway (R1/R2) itself, dead gateway detection takes place and failover to the other ISP takes place. However, if there is a problem in the ISP connectivity at any of the subsequent hops, there is no dead gateway detection and failover also does not take place. I have tested this on various linux kernels from 2.4 as well as 2.6 series. Somehow I have never faced a similar problem before and things have been working perfectly. In real life situation here, the first hop gateway is rarely going to be down so dead gateway detection and failover is going to be required whenever there is some connectivity problem at any of the later hops. So that''s where dead gateway detection needs to work. What could be the reason ? How can this be resolved ? I would appreciate any pointers or suggestions. Thanks, Manish Kathuria
gypsy
2006-Jan-29 19:50 UTC
Re: Problems in Dead Gateway Detection / Failover - MultipleISP Links
Manish Kathuria wrote: --== snip ==--> However, if there is a problem in the ISP connectivity at any of the > subsequent hops, there is no dead gateway detection and failover also > does not take place. I have tested this on various linux kernels from > 2.4 as well as 2.6 series. > > Somehow I have never faced a similar problem before and things have been > working perfectly. In real life situation here, the first hop gateway is > rarely going to be down so dead gateway detection and failover is going > to be required whenever there is some connectivity problem at any of the > later hops. So that''s where dead gateway detection needs to work. > > What could be the reason ? How can this be resolved ? I would appreciate > any pointers or suggestions. > > Thanks, > > Manish KathuriaManish, Same here (a long time ago. I no longer have multiple ISPs). I don''t have any answers for you, but here are a few pointers: Use arping in a script, pinging the farthest hop that arping can reach that is of interest. Whenever arping returns a bad status, run ''ip route flush cache''. Put a nice long sleep in the script and run it all the time. Perhaps in that same script, ''ping -n1 -I'' each WAN interface in turn to some destination that must always be up but reachable only by/on that interface. Run ''ip route flush cache'' whenever that ping fails. You are just trying to detect the up or down status of the link, so don''t flood the connection with arping and ping packets. Using sleep, space those pings apart to something sensible. Although Julian has never confirmed (or denied) this, it was my experience that only the **__FIRST__** nexhop affected the up or down status of the connection. If that succeeded, nothing would flag the connection as dead. If you know C, perhaps you can examine Julian''s kernel patch to see if there is any useful information there. In my opinion, Julian should document exactly how DGD works. Perhaps he has and I just can''t find it on his web site, but (when I cared), I was not able to find anything useful there. Have you tried to engage Julian in a conversation to resolve this? He posts here occasionally but I do not know if he answers questions about DGD off this list. -- gypsy
Manish Kathuria
2006-Jan-30 03:38 UTC
Re: Problems in Dead Gateway Detection / Failover - MultipleISP Links
gypsy wrote:> Manish Kathuria wrote: > --== snip ==-- > >> However, if there is a problem in the ISP connectivity at any of the >>subsequent hops, there is no dead gateway detection and failover also >>does not take place. I have tested this on various linux kernels from >>2.4 as well as 2.6 series. >> >>Somehow I have never faced a similar problem before and things have been >>working perfectly. In real life situation here, the first hop gateway is >>rarely going to be down so dead gateway detection and failover is going >>to be required whenever there is some connectivity problem at any of the >>later hops. So that''s where dead gateway detection needs to work. >> >>What could be the reason ? How can this be resolved ? I would appreciate >>any pointers or suggestions. >> >>Thanks, >> >>Manish Kathuria > > > Manish, > > Same here (a long time ago. I no longer have multiple ISPs). > > I don''t have any answers for you, but here are a few pointers:Thanks for your mail. I wil try out the suggestions given by you.> > Use arping in a script, pinging the farthest hop that arping can reach > that is of interest. Whenever arping returns a bad status, run ''ip > route flush cache''. Put a nice long sleep in the script and run it all > the time. > > Perhaps in that same script, ''ping -n1 -I'' each WAN interface in turn to > some destination that must always be up but reachable only by/on that > interface. Run ''ip route flush cache'' whenever that ping fails.The only thing is whether by doing this the kernel would be able to mark the gateway having bad status as down or not. If it does not any other intervention, then its really superb.> > You are just trying to detect the up or down status of the link, so > don''t flood the connection with arping and ping packets. Using sleep, > space those pings apart to something sensible.I was thinking of writing a daemon which will ping a remote host through each of the WAN interfaces every 5 seconds. If one of them gives a bad status response continuosly for 8-10 times, the default route will be changed to the other ISP''s gateway and if the status changes again, it will be restored back to the load balanced multipath state. Will have to actually try and see which method fits in better here and is more elegant. If your suggestion works, its perhaps the best way out.> > Although Julian has never confirmed (or denied) this, it was my > experience that only the **__FIRST__** nexhop affected the up or down > status of the connection. If that succeeded, nothing would flag the > connection as dead. If you know C, perhaps you can examine Julian''s > kernel patch to see if there is any useful information there. In my > opinion, Julian should document exactly how DGD works. Perhaps he has > and I just can''t find it on his web site, but (when I cared), I was not > able to find anything useful there.There are excellent documents at http://www.ssi.bg/~ja/dgd-usage.txt and http://www.ssi.bg/~ja/nano.txt which have explained it very well. Quoting from the dgd-usage.txt document here ... ---Begin Quote--- * the alternative routes check the neighbour state not only for gateways but for hosts, i.e. for any kind of neighbours. Note that in some cases the neighbour can remain in reachable state while its nexthops are failed. For example, it is even possible the gateway to be a proxy ARP server and the gateway IP to remain always in reachable state. In such case we can not notice the real state of the gateway''s IP. * the alternative routes can be a list from unipath or multipath routes, using NOARP and ARP devices. As result, the first alive or first suspected (but not dead) route is selected by inspecting the state of the gateways in each path or the neighbours through the used device from the path. * as result we take care of the state of each path in a multipath route and we try to use only the alive paths considering their relative weights ---End Quote--- In the current situaion I am dealing with, the firsthop gateway is always reachable. It is only the subsequent hops which can go down. And when that happens, the dead gateway detection doesnt work, the outgoing traffic keeps on going out through the dead ISP''s WAN interface. But what confuses me is that DGD does work for one of the ISPs which is also identically connected. Could running routed / gated play a role here in resolving this problem ?> > Have you tried to engage Julian in a conversation to resolve this? He > posts here occasionally but I do not know if he answers questions about > DGD off this list.I have not done it so far.> -- > gypsy >Thanks once again for your suggestions. -- Manish Kathuria
Eduardo Fernández
2006-Apr-15 13:58 UTC
Re: Problems in Dead Gateway Detection / Failover - Multiple ISP Links
Hi! Did you finally write a script for dead gateway detection beyond first hop? Did you find any other solution to this problem? I''m quite interested and I bet other multipath users here are interested too. My linux router has 10 dsl links (adding 15 more in short), when one of the dsl routers goes down the kernel does not always notice. Don''t know why. Also, if a dsl route is up but the internet link is down dead gateway detection doesn''t work either. Thanks! Edu On 1/26/06, Manish Kathuria <manish@tuxspace.com> wrote:> Hello, > > I have configured a load balancing router using Julian''s patches and as > described in "nano.txt" for two ISP links as shown below. > > > > ISP 1 ISP 2 > . . > | | > | | > | | > | WAN WAN | > +-\-+ +-\-+ > | | | | > |R1 | GW1 GW2 |R2 | > | |------. --------| | > | | | | | | > +---+ | | +---+ > EXT1 | | EXT2 > +\----\-+ > | | > | LINUX | > | ROUTER| > | | > | | > | | > +---/---+ > | INT IF > | > | > | > /----------------\ > | LAN | > | | > \----------------/ > > > LAN NETWORK = 192.168.100.0/24 > INT IF = 192.168.100.1 > > ISP1 NETWORK = 10.20.30.128/29 > R1 - ROUTER1 > GW1 = 10.20.30.129 > EXT1 = 10.20.30.130 > > ISP2 NETWORK = 172.16.32.128/29 > R2 - ROUTER2 > GW2 = 172.16.32.129 > EXT2 = 172.16.32.130 > > Both the ISPs have provided /29 subnets of Public IPs. The above > mentioned addresses are just for example. > > The gateways for both the ISPs are routers placed at the same location > which are further connected through Radio Link and Leased Line. > > Things work fine as long as both the ISP links are alive. While testing > the dead gateway detection and failover functionality we observed that > if we make the first hop gateway (i.e Router R1 or R2) of one of the > ISPs dead by either disconnecting the ethernet cable between Linux > Router and R1/R2 or by switching off the gateway (R1/R2) itself, dead > gateway detection takes place and failover to the other ISP takes place. > However, if there is a problem in the ISP connectivity at any of the > subsequent hops, there is no dead gateway detection and failover also > does not take place. I have tested this on various linux kernels from > 2.4 as well as 2.6 series. > > Somehow I have never faced a similar problem before and things have been > working perfectly. In real life situation here, the first hop gateway is > rarely going to be down so dead gateway detection and failover is going > to be required whenever there is some connectivity problem at any of the > later hops. So that''s where dead gateway detection needs to work. > > What could be the reason ? How can this be resolved ? I would appreciate > any pointers or suggestions. > > Thanks, > > Manish Kathuria > _______________________________________________ > LARTC mailing list > LARTC@mailman.ds9a.nl > http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc >
Manish Kathuria
2006-Apr-21 01:48 UTC
Re: Problems in Dead Gateway Detection / Failover - Multiple ISP Links
Eduardo Fernández wrote:> Hi! > > Did you finally write a script for dead gateway detection beyond first > hop? Did you find any other solution to this problem? I''m quite > interested and I bet other multipath users here are interested too. > > My linux router has 10 dsl links (adding 15 more in short), when one > of the dsl routers goes down the kernel does not always notice. Don''t > know why. Also, if a dsl route is up but the internet link is down > dead gateway detection doesn''t work either. > > Thanks! > > Edu > >If you follow the nano.txt procedure and apply the patches, it works perfectly as long as the first hop is dead. But to ensure failover, when connectivity goes down at any of the hops, you can use the nano.txt for configuring the interfaces and multipath routes (call it default configuration) and also run a script in the background to modify the routes as described below. 1. Periodically keep on checking if a remote host is reachable from each of the gateways by pinging it after every n seconds. 2. If the remote host is not reachable after a number of tries (which you can decide according to your own specific situation) from a particular gateway, remove that route. If you have just two internet links, there would be only one gateway left. But if you have more than two links alive you can again define multipath routes with appropriate weights for the active gateways. The possible combinations will increase exponentially with the increase in number of internet links so you will have to factor is all the cases in the script. 3. Restore the default configuration when the remote host is reachable from all the gateways. I am not too sure how its going to behave with 10 links because if the links are not so stable it will result in very frequent changes. -- Manish Kathuria http://www.tuxspace.com /