thr3ads.net - Shorewall users - Failover with two ISP - trying to summarize (LONG) [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Elio Tondo

2005-Dec-10 23:48 UTC

Failover with two ISP - trying to summarize (LONG)

Yesterday on shorewall-devel:

SUNIL <sunilks@rediff.co.in> wrote:
> Hi
> We have two Adsl broadband line which has 512 kbps speed we need to add
> both the line in to one firewall for load balance and switch over when one
> of the line get down and the gateway should be single for local network
Rune Kock <rune.kock@gmail.com> replied:
> Please look in the archives for the shorewall-users mailing list.
> This situation has been discussed a lot.  It is not a simple thing.

I asked a similar question some time ago on this list. I am trying to summarize
the answers and other related useful information, in the hope that this can be
useful to set up a test configuration.

On  November 22 I wrote:

| Hi,
|
| I just subscribed to this list, so forgive me if the question has altready
| been asked and answered. I did not look at the list archives, but
| before posting this question I read the relevant documentation, the
| FAQ, the "providers" file format, followed the advice about setting
| MARK_IN_FORWARD_CHAIN=Yes in shorewall.conf. I also upgraded
| iptables to the latest development release available (I''m using FC4)
| and verified that the required patch is included. My kernel is:
| 2.6.14-1.1637_FC4smp .
|
| The machine acts as a firewall with multiple interfaces; they are assigned
| this way:
|
| eth0    main interface to Internet (ADSL, router, multiple IP addresses)
| eth1    local network
| eth2    dmz
| eth3    crossover cable to another server, for backup
| eth4    secondary interface to Internet (ADSL, router, multiple IP addresses)
|
| The secondary interface should act as a backup when the primary fails,
| and can also be useful for load balancing for outgoing connections.
| On the primary there are two additional IP addresses with static NAT
| active to two loc and dmz machines; they are active onty on eth0.
| The firewall acts also as the MX for some domains, and I plan to
| publish its IP address on eth4 as a secondary MX.
|
| The providers file contains these lines:
|
| ISP1    1       1       main            eth0            85.xxx.xxx.xxx   
track,balance
| eth1,eth2
| ISP2    2       2       main            eth4            80.xxx.xxx.xxx   
track,balance
| eth1,eth2
|
| I verified that the firewall is reachable from outside through the addresses
on
| both eth0 and eth4, therefore the above lines seem to work as expected.
|
| The problem arises when I test the failover. I try to disconnect eth0 or eth4,
and
| I expect to have any new connection routed to the interface remaining active.
| Sometimes (rarely) this works, and a traceroute correctly shows that a route
| through the cative router is selected. Most of the times, however, it looks
like
| the previously used route is maintained, and the remote host is unreachable
| until the disconnected cable is reconnected.
|
| My question is: did I misunderstand when thinking that this setup can also be
| used for failover in case of failure on one of the external interfaces?
Actually
| I did not find any document stating that this was possible, but I assumed that
| it was obviously possible.
|
| Thaks a lot for any help

Jerry Vonau <jvonau@shaw.ca> replied:

| You misunderstood, failover is not automatic out of the box. There are ways
| to monitor and change the config or speedup the trying of the alternate
gateway.
| Check the mail archives, Friday, September 30, 2005
| Re: [Shorewall-users] shorewall + Squid + Two ISP setup
| Where I talk about some proc settings that you can play with, and John posted
| an monitoring script.

Rune Kock <rune.kock@gmail.com> replied:

| Julian Anastasov''s patches http://www.ssi.bg/~ja/#routes seems to
have
| something to do with this.  He calls it dead gateway detection.
|
| But his descriptions are hard to follow.  I''m afraid I don''t
really know what
| the patches actually do.
|
| Anyway, it''s not easy for the kernel to know whether an internet link
is dead
| -- in most cases, the link to the ADSL-modem will still be up.  So maybe
| you need some kind of daemon to ping a very stable server somewhere
| on the net, and then take the interface down if no replies arrive?

Then I looked at the list archives and I found:

From: John Hill <jhill@no...> 2005-09-30 07:15

| I have tested and found that on a ping fail to an isp gateway I can make a
| change in the tcrules file to mark all packets to the working ISP. Then
| restart Shorewall. Then test later and change it back to both. This keeps
| the outbound working.
| I question if this is a good idea. I have yet to put it in production.

From: Tom Eastep <teastep@sh...> 2005-09-30 07:24

| To just reload the tcrules file, all that is needed is "shorewall
refresh"
| -- you probably also want to "ip route flush cache" to purge all of
the
| cached routes through the down interface.

From: John Hill <jhill@no...> 2005-10-02 14:01

| Here is the script.
| It is not pretty. I am open to any suggestions.
| A line to send an email could be added.
|
| You need to create a tcfiles that has the proper packet markings per
| the 2 isp shorewall instructions.
|
| Copy it to tcrules.both. Then edit out isp2 and save as tcrules.isp1
| and another edit out isp1 and save as tcrules.isp2.
|
| This works here. We have not had many real world problems to test it on.
|
| PLEASE TEST THIS BEFORE USING!!!!!
|
| --John
|
| #!/bin/sh
|
| SWDIR=/etc/shorewall # shorewall directory
| WKDIR=/root/cronscripts # working directory for semafores
| G_ISP1=xxx.xxx.xxx.xxx # what to ping for isp1
| G_ISP2=xxx.xxx.xxx.xxx # what to ping for isp2
| PINGCT=2 # ping count
|
| PingGateway() {
|     ping -c $PINGCT $1
|     if [ $? != 0 ] ; then # Failed
|         cp $SWDIR/tcrules.$3 $SWDIR/tcrules # swap gateway
|         touch $WKDIR/failed.$2 # semafore for down gateway
|         shorewall refresh # read configs
|         ip route flush cache # flush routes
|     else # Passed
|         if [ -f $WKDIR/failed.$2 ]; then # check for previous failure
|             rm $WKDIR/failed.$2 # delete semafore
|             cp $SWDIR/tcrules.both $SWDIR/tcrules # return to both
|             shorewall refresh
|             ip route flush cache
|         fi
|     fi
| }
|
|
|  # script starts here
|
|     PingGateway $G_ISP1 isp1 isp2
|     PingGateway $G_ISP2 isp2 isp1
|     exit

From: Jerry Vonau <jvonau@sh...> 2005-10-02 10:25

| John:
|
| I *think* all you would need to do is delete, then re-add the fwmark to the
| working providers lookup table, then flush the cache. I''d be
interested in
| working with you off list to see what we could come up with. Email me
| off list if your interested.
|
| For the fallover issue, there are some proc settings that you can play with.
| http://mailman.ds9a.nl/pipermail/lartc/2002q4/005274.html and the reply
| is about the best info I could find regarding these settings. If anybody knows
| of some better documentation of these settings, I''d love to here from
you.
|
| FWIW, I tried changing some of the settings, in /proc/sys/net/ipv4/route
| echo 1 > gc_interval
| echo 1 > gc_timeout
| echo 1 > gc_elasticity
| echo 2 > max_delay
| echo 1 > min_delay
|
| Just before the test below, I unplugged the nic that had the higher weighted
| value for the gateway.
|
| This appears to speed up the trying of the alternate gateway.
|
| PING mail.gt.ca (216.18.99.22) 56(84) bytes of data.
| From 10.50.0.1 icmp_seq=1 Destination Host Unreachable
| From 10.50.0.1 icmp_seq=2 Destination Host Unreachable
| From 10.50.0.1 icmp_seq=3 Destination Host Unreachable
| From 10.50.0.1 icmp_seq=5 Destination Host Unreachable
| From 10.50.0.1 icmp_seq=6 Destination Host Unreachable
| From 10.50.0.1 icmp_seq=7 Destination Host Unreachable
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=8 ttl=56 time=57.7 ms
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=9 ttl=56 time=58.4 ms
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=10 ttl=56 time=59.3 ms
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=11 ttl=56 time=59.6 ms
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=12 ttl=56 time=54.9 ms
| 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=13 ttl=56 time=56.0 ms
|
| Before this, it seemed to take ''forever'' to try the
alternate gateway. This is not
| by any means conclusive, just me playing around and my observations. If you
| find that changing these settings works for you, I''d like to hear,
off list, about
| what you tried. Use at your own risk, you been warned.

I had also a look at the suggested post at
http://mailman.ds9a.nl/pipermail/lartc/2002q4/005274.html
and the followup at
http://mailman.ds9a.nl/pipermail/lartc/2002q4/005296.html

That''s all, till now, about the documentation / opinions / suggestions
I found
about the matter.

Now, a question and a notice.

Question: in the last message quoted above and in the related links, it looks
like modifying routing parameters in /proc helps to speed up the restart on
tha new route after the failure of one ISP link. Is this enough for failover
without
any continuous monitoring by means of a script like John''s one, or is
this just
a fine tuning to help a fast switching?

Notice: when looking at John''s script, I assumed that it''s
ready to be called
frequently. i.e. once a minute from /etc/cron.d or similar means. It modifies a
"tcrules" file, but I would like to start from a load-balancing
two-ISP setup using
the "providers" file in Shorewall 3.0.x. Therefore I assumed that a
similar schema
can be used to modify the "providers" file to achieve the same
results. Assuming
also this as correct, I had a look at the script logic and I found something
that
could not work as expected. The PingGateway function tries to ping the other
end of the PtP link to the ISP, or any other "near" gateway. Maybe
this works
as expected when the gateway is on the same subnet as the network interface,
but this is not usually true when using PPPoE links through ADSL modems, like
in my situation; in that case the next hop has a totally different IP addess
from
the one assigned to my machine by the ISP. In this case, the ping probe uses
the routing tables and randomly reaches the ISP1 remote gateway through the
ISP1 link, or through the ISP2 link; the same happens on the second link.
This obviously cannot be used to test if a link is up or not.
My suggestion then is to force the use of the correct interface using the -I
ping
option; this way the ping probe is guaranteed to go through the correct link.

Thanks to everybody with enough time and patience to get to this point ;)
and please, if  anyone has opinions about the correctness of the above
notice, and ideas about the question, any help would be very appreciated.

Regards
Elio

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click

Jerry Vonau

2005-Dec-11 02:47 UTC

head link

Re: Failover with two ISP - trying to summarize (LONG)

Hoping that I don''t mess up the mail archive, again.
----- Original Message ----- 
Subject: [Shorewall-users] Failover with two ISP - trying to summarize (LONG)

> Yesterday on shorewall-devel:
> 
> SUNIL <sunilks@rediff.co.in> wrote:
> 
> > Hi
> > We have two Adsl broadband line which has 512 kbps speed we need to
add
> > both the line in to one firewall for load balance and switch over when
one
> > of the line get down and the gateway should be single for local
network
> 
> Rune Kock <rune.kock@gmail.com> replied:
> 
> > Please look in the archives for the shorewall-users mailing list.
> > This situation has been discussed a lot.  It is not a simple thing.
> 
> 
> I asked a similar question some time ago on this list. I am trying to
summarize
> the answers and other related useful information, in the hope that this can
be
> useful to set up a test configuration.
> 
> On  November 22 I wrote:
> 
> | Hi,
> |
> | I just subscribed to this list, so forgive me if the question has
altready
> | been asked and answered. I did not look at the list archives, but
> | before posting this question I read the relevant documentation, the
> | FAQ, the "providers" file format, followed the advice about
setting
> | MARK_IN_FORWARD_CHAIN=Yes in shorewall.conf. I also upgraded
> | iptables to the latest development release available (I''m using
FC4)
> | and verified that the required patch is included. My kernel is:
> | 2.6.14-1.1637_FC4smp .
> |
> | The machine acts as a firewall with multiple interfaces; they are
assigned
> | this way:
> |
> | eth0    main interface to Internet (ADSL, router, multiple IP addresses)
> | eth1    local network
> | eth2    dmz
> | eth3    crossover cable to another server, for backup
> | eth4    secondary interface to Internet (ADSL, router, multiple IP
addresses)
> |
> | The secondary interface should act as a backup when the primary fails,
> | and can also be useful for load balancing for outgoing connections.
> | On the primary there are two additional IP addresses with static NAT
> | active to two loc and dmz machines; they are active onty on eth0.
> | The firewall acts also as the MX for some domains, and I plan to
> | publish its IP address on eth4 as a secondary MX.
> |
> | The providers file contains these lines:
> |
> | ISP1    1       1       main            eth0            85.xxx.xxx.xxx   
track,balance
> | eth1,eth2
> | ISP2    2       2       main            eth4            80.xxx.xxx.xxx   
track,balance
> | eth1,eth2
> |
> | I verified that the firewall is reachable from outside through the
addresses on
> | both eth0 and eth4, therefore the above lines seem to work as expected.
> |
> | The problem arises when I test the failover. I try to disconnect eth0 or
eth4, and
> | I expect to have any new connection routed to the interface remaining
active.
> | Sometimes (rarely) this works, and a traceroute correctly shows that a
route
> | through the cative router is selected. Most of the times, however, it
looks like
> | the previously used route is maintained, and the remote host is
unreachable
> | until the disconnected cable is reconnected.
> |
> | My question is: did I misunderstand when thinking that this setup can
also be
> | used for failover in case of failure on one of the external interfaces?
Actually
> | I did not find any document stating that this was possible, but I assumed
that
> | it was obviously possible.
> |
> | Thaks a lot for any help
> 
> Jerry Vonau <jvonau@shaw.ca> replied:
> 
> | You misunderstood, failover is not automatic out of the box. There are
ways
> | to monitor and change the config or speedup the trying of the alternate
gateway.
> | Check the mail archives, Friday, September 30, 2005
> | Re: [Shorewall-users] shorewall + Squid + Two ISP setup
> | Where I talk about some proc settings that you can play with, and John
posted
> | an monitoring script.
> 
> Rune Kock <rune.kock@gmail.com> replied:
> 
> | Julian Anastasov''s patches http://www.ssi.bg/~ja/#routes seems
to have
> | something to do with this.  He calls it dead gateway detection.
> |
> | But his descriptions are hard to follow.  I''m afraid I
don''t really know what
> | the patches actually do.
> |
> | Anyway, it''s not easy for the kernel to know whether an internet
link is dead
> | -- in most cases, the link to the ADSL-modem will still be up.  So maybe
> | you need some kind of daemon to ping a very stable server somewhere
> | on the net, and then take the interface down if no replies arrive?
> 
> Then I looked at the list archives and I found:
> 
> From: John Hill <jhill@no...> 2005-09-30 07:15
> 
> | I have tested and found that on a ping fail to an isp gateway I can make
a
> | change in the tcrules file to mark all packets to the working ISP. Then
> | restart Shorewall. Then test later and change it back to both. This keeps
> | the outbound working.
> | I question if this is a good idea. I have yet to put it in production.
> 
> From: Tom Eastep <teastep@sh...> 2005-09-30 07:24
> 
> | To just reload the tcrules file, all that is needed is "shorewall
refresh"
> | -- you probably also want to "ip route flush cache" to purge
all of the
> | cached routes through the down interface.
> 
> From: John Hill <jhill@no...> 2005-10-02 14:01
> 
> | Here is the script.
> | It is not pretty. I am open to any suggestions.
> | A line to send an email could be added.
> |
> | You need to create a tcfiles that has the proper packet markings per
> | the 2 isp shorewall instructions.
> |
> | Copy it to tcrules.both. Then edit out isp2 and save as tcrules.isp1
> | and another edit out isp1 and save as tcrules.isp2.
> |
> | This works here. We have not had many real world problems to test it on.
> |
> | PLEASE TEST THIS BEFORE USING!!!!!
> |
> | --John
> |
> | #!/bin/sh
> |
> | SWDIR=/etc/shorewall # shorewall directory
> | WKDIR=/root/cronscripts # working directory for semafores
> | G_ISP1=xxx.xxx.xxx.xxx # what to ping for isp1
> | G_ISP2=xxx.xxx.xxx.xxx # what to ping for isp2
> | PINGCT=2 # ping count
> |
> | PingGateway() {
> |     ping -c $PINGCT $1
> |     if [ $? != 0 ] ; then # Failed
> |         cp $SWDIR/tcrules.$3 $SWDIR/tcrules # swap gateway
> |         touch $WKDIR/failed.$2 # semafore for down gateway
> |         shorewall refresh # read configs
> |         ip route flush cache # flush routes
> |     else # Passed
> |         if [ -f $WKDIR/failed.$2 ]; then # check for previous failure
> |             rm $WKDIR/failed.$2 # delete semafore
> |             cp $SWDIR/tcrules.both $SWDIR/tcrules # return to both
> |             shorewall refresh
> |             ip route flush cache
> |         fi
> |     fi
> | }
> |
> |
> |  # script starts here
> |
> |     PingGateway $G_ISP1 isp1 isp2
> |     PingGateway $G_ISP2 isp2 isp1
> |     exit
> 
> From: Jerry Vonau <jvonau@sh...> 2005-10-02 10:25
> 
> | John:
> |
> | I *think* all you would need to do is delete, then re-add the fwmark to
the
> | working providers lookup table, then flush the cache. I''d be
interested in
> | working with you off list to see what we could come up with. Email me
> | off list if your interested.
> |
> | For the fallover issue, there are some proc settings that you can play
with.
> | http://mailman.ds9a.nl/pipermail/lartc/2002q4/005274.html and the reply
> | is about the best info I could find regarding these settings. If anybody
knows
> | of some better documentation of these settings, I''d love to here
from you.
> |
> | FWIW, I tried changing some of the settings, in /proc/sys/net/ipv4/route
> | echo 1 > gc_interval
> | echo 1 > gc_timeout
> | echo 1 > gc_elasticity
> | echo 2 > max_delay
> | echo 1 > min_delay
> |
> | Just before the test below, I unplugged the nic that had the higher
weighted
> | value for the gateway.
> |
> | This appears to speed up the trying of the alternate gateway.
> |
> | PING mail.gt.ca (216.18.99.22) 56(84) bytes of data.
> | From 10.50.0.1 icmp_seq=1 Destination Host Unreachable
> | From 10.50.0.1 icmp_seq=2 Destination Host Unreachable
> | From 10.50.0.1 icmp_seq=3 Destination Host Unreachable
> | From 10.50.0.1 icmp_seq=5 Destination Host Unreachable
> | From 10.50.0.1 icmp_seq=6 Destination Host Unreachable
> | From 10.50.0.1 icmp_seq=7 Destination Host Unreachable
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=8 ttl=56 time=57.7 ms
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=9 ttl=56 time=58.4 ms
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=10 ttl=56 time=59.3 ms
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=11 ttl=56 time=59.6 ms
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=12 ttl=56 time=54.9 ms
> | 64 bytes from mail.gt.ca (216.18.99.22): icmp_seq=13 ttl=56 time=56.0 ms
> |
> | Before this, it seemed to take ''forever'' to try the
alternate gateway. This is not
> | by any means conclusive, just me playing around and my observations. If
you
> | find that changing these settings works for you, I''d like to
hear, off list, about
> | what you tried. Use at your own risk, you been warned.
> 
> I had also a look at the suggested post at
> http://mailman.ds9a.nl/pipermail/lartc/2002q4/005274.html
> and the followup at
> http://mailman.ds9a.nl/pipermail/lartc/2002q4/005296.html
> 
> That''s all, till now, about the documentation / opinions /
suggestions I found
> about the matter.
> 
> 
> Now, a question and a notice.
> 
> 
> Question: in the last message quoted above and in the related links, it
looks
> like modifying routing parameters in /proc helps to speed up the restart on
> tha new route after the failure of one ISP link. Is this enough for
failover without
> any continuous monitoring by means of a script like John''s one, or
is this just
> a fine tuning to help a fast switching?
>
That is just fine tuning, testing from the firewall itself. When
"balance" is used in the
providers file, and the tcrules file is empty, "new" outbound
connections *should*
fallover as above. Your free to test this on your own, from a client in the
local lan,
and report back your results, nobody else has reported anything, and I like to
know
the results.
  
When you introduce entries into tcrules, those are used as the bases for the
routing
choice, and when that choice is no longer present, need to be undone. You might 
have issues with your network up/down scripts, those tend to mess with your 
multi-path gateway, when a interface goes up or down. That is the reason
shorewall
needs to be restarted or refreshed.  
> 
> Notice: when looking at John''s script, I assumed that
it''s ready to be called
> frequently. i.e. once a minute from /etc/cron.d or similar means. It
modifies a
> "tcrules" file, but I would like to start from a load-balancing
two-ISP setup using
> the "providers" file in Shorewall 3.0.x. Therefore I assumed that
a similar schema
> can be used to modify the "providers" file to achieve the same
results. Assuming
> also this as correct, I had a look at the script logic and I found
something that
Don''t think you need to mess with the providers file, just mark the
traffic for the
remaining isp in tcrules. It''s "cheaper", to run
"shorewall refresh", a change in the
providers file would entail a "restart"  Once those marked packets
start flowing,
the other gateway is ignored. If your isp uses dhcp or adsl, then the client
scripts
that are run upon (dis)connection might wipe out your advanced routing,
requiring
the restart, or you could write your own function that restores the routing
tables and
the multi-path gateway. If you have static ips, then just a refresh should to
it, with
a dynamic ip from an isp, then restart would be a safer bet as your address
could
of changed. 
> could not work as expected. The PingGateway function tries to ping the
other
> end of the PtP link to the ISP, or any other "near" gateway.
Maybe this works
> as expected when the gateway is on the same subnet as the network
interface,
> but this is not usually true when using PPPoE links through ADSL modems,
like
> in my situation; in that case the next hop has a totally different IP
addess from
> the one assigned to my machine by the ISP. In this case, the ping probe
uses
> the routing tables and randomly reaches the ISP1 remote gateway through the
> ISP1 link, or through the ISP2 link; the same happens on the second link.
> This obviously cannot be used to test if a link is up or not.
> My suggestion then is to force the use of the correct interface using the
-I ping
> option; this way the ping probe is guaranteed to go through the correct
link.
Valid point that, thanks for the contribution.
> 
> 
> Thanks to everybody with enough time and patience to get to this point ;)
> and please, if  anyone has opinions about the correctness of the above
> notice, and ideas about the question, any help would be very appreciated.
> 
> Regards
> Elio
Thanks for taking the time to fix the tread that I broke. Hope you find it
useful.

Jerry




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click

Shorewall users - Dec 2005 - Failover with two ISP - trying to summarize (LONG)

Failover with two ISP - trying to summarize (LONG)

Re: Failover with two ISP - trying to summarize (LONG)