Hi there;
I''m currently facing some weird issues using multipath routing, and
I''m
feeling desesperate to solve them. :-(
Overview:
---------
We have two distinct datacenters, linked to our office network across
VTUND VPNs. In our office, one linux server has two VTUN tunnels
connected to our DCs (one tunnel per DC). DCs are also connected with
each other using a VTUN tunnel as well.
So, basically, it looks something like:
		Office
		  |
		Firewall
		  |
	     	VTUN_box 
		|	|
    ----------------------------------INTERNET		
		|	|
		DC1-----DC2
In this situation, everything is working just fine. However, and for
redundency / load balancing reasons, we want to build the following
setup.
		Office
		  |
		Firewall
		  |
	    ---------------
	    |		   |
	Vtun_Box1-------Vtun_Box2
	 |	|	 |	|
    ----------------------------------INTERNET		
	 |	|	 |	|
	DC1----DC2	DC1----DC2
In such setup, if Vtun_Box1 crash, all traffic going to our DCs would be
redirected by the firewall through Vtun_Box2, and vice-versa. 
On top of this, if one or both tunnels on one Vtun Box stop working,
while the Vtun box itself is still alive, it will automatically redirect
all traffic through the other Vtun box. 
Note that both Vtun_Box are on the same network segment, that is they do
have the same network address / broadcast / netmask. Only their IP
addresses are different. Thus, both Vtun_Box are reached by the firewall
through the same device (eth1, here)
Also, the Firewall don''t NAT traffic going to the DCs, since each Vtun
box will already NAT everything going out through the tunnels.
Now, regarding the servers settings:
Firewall:
--------- 
System: Linux, stock kernel 2.2.22 with julian''s patches applied
routing policy:
0:      from all lookup local
50:     from all lookup main
101:    from all lookup prod-vpn	# Traffic going to both DCs
200:    from all lookup uunet		# Default route 
32766:  from all lookup main
32767:  from all lookup default
Where:
ip route list table prod-vpn:
DC1_NET/24 proto static
        nexthop via Vtun_Box1  dev eth1 weight 1
        nexthop via Vtun_Box2  dev eth1 weight 1
DC2_NET/24 proto static
        nexthop via Vtun_Box1  dev eth1 weight 1
        nexthop via Vtun_Box2  dev eth1 weight 1
Vtun_Box1: 
----------
System: Linux, stock kernel 2.2.19
NAT:
MASQ       all  ------  anywhere             anywhere              n/a
On this box, we have 
172.1.1.1 as the local ip of the tunnel to DC1
172.1.2.1 as the local ip of the tunnel to DC2
 
Vtun_Box2:
---------- 
System: Linux, stock kernel 2.4.19
NAT:
SNAT  all  --  any  tun2  anywhere  anywhere  to:172.1.1.3
SNAT  all  --  any  tun3  anywhere  anywhere  to:172.1.2.3
Where 172.1.1.3 is the local ip of the tunnel to DC1
Where 172.1.2.3 is the local ip of the tunnel to DC2 
Now, the problem :-)
We mostly do SSH to our DCs. In the simple setup, where we don''t do
multipath routing (eg, having only one Vtun box), everything is working
fine. We can ssh into any machines in any DC without problems.SSH
sessions are stable, and stop working only when the NAT ttl has expired.
However, when we activate multipath routing, everything goes wrong. 
For instance:
[root@leonard /root]# ssh -l root lime.hosting.kelkoo.net
root@lime.hosting.kelkoo.net''s password:
Read from remote host lime.hosting.kelkoo.net: Connection reset by peer
Connection to lime.hosting.kelkoo.net closed.
SSH simply don''t work anymore, and it''s not a netfilter issue,
nor any
TCP wrapper ACLs. We''ve checked every firewall rules, and every TCP
Wrapper ACLs. Everything is ok.
What is weird, is that much simple protocols seems to work fine; eg,
doing a telnet to the same host instead of SSH, will work.
Same thing if we telnet on the SMTP port for instance, and start
simulating an SMTP dialog; it''ll work just fine.
I also noticed that ping & traceroute ICMP packets works just fine,
whatever path they use to reach a DC.
However, I think that we are likely to have the same problems with
simple protocols as well, if we look a bit deeper, and start heavy
testing.
Right now, we can''t use SSH or FTP with our DC, all sessions will crash
just after authentication. Rarely, we can SSH''in successfully through
the machines, but the session crash a few minutes after.
I''m a bit worry with this situation, because it seems that packets
don''t
use the proper reverse path to come back, although we are NAT''ing
everything going out the tunnels ! 
Maybe the problem comes from the fact that both Vtun box gateways are
reachable through the same firewall device, but in that case, I''d like
to be sure, before throwing everything out. :)
I don''t get what''s going on there, any help, would be greatly
appreciated.
Thanks in advance.
Best regards,
Vincent Jaussaud
-- 
Vincent Jaussaud
Kelkoo.com Security Manager 
email: tatooin@kelkoo.com
"The UNIX philosophy is to design small tools that do one thing, and do
it well."
_______________________________________________
LARTC mailing list / LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/