thr3ads.net - Ocfs2 users - [Ocfs2-users] re: how should ocfs2 react to nic hardware issue [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Peter Santos

2006-Nov-30 12:26 UTC

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

guys,
	I'm trying to test how my 10gR2 oracle cluster (3 nodes) on SuSe reacts to
a network card hardware failure.
	I have eth0 and eth1 as my network cards, I took down eth0 (ifdown eth0) to see
what would happen and
	I didn't get any reaction from the o2cb service. This is probably the
correct behavior since my
 	/etc/ocfs2/cluster.conf uses eth1 as the connection channel?

	If I take down eth1 I suspect o2cb will eventually reboot the machine right?
I'm not using any bonding.

	My concern is that when I took down eth0, I had a user logged into the instance
and everything just "hung" for
	that user, until I manually took down the instance with "SRVCTL"...
then the user connection failed over to
	a working instance.

	Anyway, just trying to get some general knowledge of the behavior of o2cb in
order to understand
	my testing.

- -peter


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFbz9Soyy5QBCjoT0RAiqvAJ40UCXsV/4Zdv19a246ByzNL4CiwgCfX704
+BZwa23LphG878FP/5fQKek=Nhcz
-----END PGP SIGNATURE-----

Adam Kenger

2006-Nov-30 13:10 UTC

head link

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

Peter - depending on how you have your RAC cluster setup, the hang on  
the front end is not that unexpected.  It depends on how the user was  
connected and how the TAF policy was set up.  Eth0 is your public  
interface I assume.  Was the user connecting to the IP on that  
interface or to the VIP set up by RAC?  When you down eth0 the VIP on  
that interface should get pushed over onto one of the other 2 nodes  
in the cluster.  If you're connecting to a "service" versus an
actual
"instance" there should be no hang on the front end.  If you're  
actually connected directly to the instance on the node, then you'll  
be out of luck if you disconnect that instance.  As an example, this  
is what the corresponding tnsnames.ora file looks like :

MYDBSERVICE    (DESCRIPTION      (ADDRESS_LIST        (ADDRESS = (PROTOCOL =
TCP)(HOST = node1-vip)(PORT = 1521))
       (ADDRESS = (PROTOCOL = TCP)(HOST = node2-vip)(PORT = 1521))
       (ADDRESS = (PROTOCOL = TCP)(HOST = node3-vip)(PORT = 1521))
     )
     (CONNECT_DATA        (SERVICE_NAME = mydbservice.db.mydomain.com)
     )
   )

MYDB1    (DESCRIPTION      (ADDRESS = (PROTOCOL = TCP)(HOST = node1-vip)(PORT =
15
21))
     (CONNECT_DATA        (SERVER = DEDICATED)
       (SERVICE_NAME = mydb.db.mydomain.com)
       (INSTANCE_NAME = mydb1)
     )
   )

If you connected to the service "MYDBSERVICE" you could survive the  
failure of any given node.  You'd seamlessly fail-over onto one of  
the other nodes.  If you connect directly to the "MYDB1" instance,  
you'll be out of luck if you drop the connection to it.

As far as o2cb goes, you are right I believe.  Eventually, it will be  
determined that the node is no longer heartbeating and will either  
panic or reboot.

For your testing, just be careful you're not confusing the OCFS2  
layer with the Oracle CRS/RAC layers.

Comments welcome....

Hope that helps

Adam

On Nov 30, 2006, at 3:30 PM, Peter Santos wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> guys,
> 	I'm trying to test how my 10gR2 oracle cluster (3 nodes) on SuSe  
> reacts to a network card hardware failure.
> 	I have eth0 and eth1 as my network cards, I took down eth0 (ifdown  
> eth0) to see what would happen and
> 	I didn't get any reaction from the o2cb service. This is probably  
> the correct behavior since my
>  	/etc/ocfs2/cluster.conf uses eth1 as the connection channel?
>
> 	If I take down eth1 I suspect o2cb will eventually reboot the  
> machine right? I'm not using any bonding.
>
> 	My concern is that when I took down eth0, I had a user logged into  
> the instance and everything just "hung" for
> 	that user, until I manually took down the instance with  
> "SRVCTL"... then the user connection failed over to
> 	a working instance.
>
> 	Anyway, just trying to get some general knowledge of the behavior  
> of o2cb in order to understand
> 	my testing.
>
> - -peter
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFFbz9Soyy5QBCjoT0RAiqvAJ40UCXsV/4Zdv19a246ByzNL4CiwgCfX704
> +BZwa23LphG878FP/5fQKek> =Nhcz
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Alexei_Roudnev

2006-Nov-30 14:22 UTC

head link

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

One of possible ideas (which I am thinking to implement is):

- use Loopback interface for the service access and for the o2cb
- use OSPF protocol (ospfd) to route it inside the network (so it will be
avaliable thru any of eth0 or eth1). OSPF have a very low convergence time,
so it wil change routing quickly if interface failed.
- because it is IP leyer routing and not switch layer bonding, it is very
reliable method.

But I never had a time to try it.


----- Original Message ----- 
From: "Peter Santos" <psantos@cheetahmail.com>
To: <ocfs2-users@oss.oracle.com>
Sent: Thursday, November 30, 2006 12:30 PM
Subject: [Ocfs2-users] re: how should ocfs2 react to nic hardware issue

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> guys,
> I'm trying to test how my 10gR2 oracle cluster (3 nodes) on SuSe reacts
to
a network card hardware failure.> I have eth0 and eth1 as my network cards, I took down eth0 (ifdown eth0)
to see what would happen and> I didn't get any reaction from the o2cb service. This is probably the
correct behavior since my>   /etc/ocfs2/cluster.conf uses eth1 as the connection channel?
>
> If I take down eth1 I suspect o2cb will eventually reboot the machine
right? I'm not using any bonding.>
> My concern is that when I took down eth0, I had a user logged into the
instance and everything just "hung" for> that user, until I manually took down the instance with
"SRVCTL"... then
the user connection failed over to> a working instance.
>
> Anyway, just trying to get some general knowledge of the behavior of o2cb
in order to understand> my testing.
>
> - -peter
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFFbz9Soyy5QBCjoT0RAiqvAJ40UCXsV/4Zdv19a246ByzNL4CiwgCfX704
> +BZwa23LphG878FP/5fQKek> =Nhcz
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Peter Santos

2006-Dec-01 20:03 UTC

head link

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adam,
thanks for the feedback.
I wanted to quickly show you my setup because I believe that
when I took down eth0, my connection hung even though my TAF policy is setup
properly.


eth0   - public ip of the machine
eth0:1 - vip managed by oracle. This is also what clients use to connect to.
         dns has entries called dbinsto# that point to the 3 vips on my cluster.

eth1   - private ip, used for cluster interconnect and o2cb service
(cluster.conf)

my client tnsnames.ora
======================================================================ORACTAH  
(DESCRIPTION     (LOAD_BALANCE = ON)
    (FAILOVER = ON)
    (ADDRESS = (PROTOCOL = TCP)(HOST = dbinsto1)(PORT = 1521))  <-- vip
    (ADDRESS = (PROTOCOL = TCP)(HOST = dbinsto2)(PORT = 1521))  <-- vip
    (ADDRESS = (PROTOCOL = TCP)(HOST = dbinsto3)(PORT = 1521))  <-- vip
    (CONNECT_DATA       (SERVER = SHARED)
      (SERVICE_NAME = ORACTAH)
          (FAILOVER_MODE             (TYPE = SELECT)
            (METHOD = BASIC)
            (RETRIES = 20)
            (DELAY = 5)
          )
    )
  )

Since the o2cb service operates via eth1 and there is no lost connectivity
to the shared device, o2cb should continue to work just fine, but
I was sure that the vip on this node did not get moved to another node.

When I repeated this same process on eth1, the o2cb service evicted this node
from the cluster and it was eventually rebooted.. not sure if it was
the o2cb service that caused it to reboot or oracle's CRS daemons.

I'll keep testing further.

thanks
- -peter



Adam Kenger wrote:> Peter - depending on how you have your RAC cluster setup, the hang on
> the front end is not that unexpected.  It depends on how the user was
> connected and how the TAF policy was set up.  Eth0 is your public
> interface I assume.  Was the user connecting to the IP on that interface
> or to the VIP set up by RAC?  When you down eth0 the VIP on that
> interface should get pushed over onto one of the other 2 nodes in the
> cluster.  If you're connecting to a "service" versus an
actual
> "instance" there should be no hang on the front end.  If
you're actually
> connected directly to the instance on the node, then you'll be out of
> luck if you disconnect that instance.  As an example, this is what the
> corresponding tnsnames.ora file looks like :
> 
> MYDBSERVICE >   (DESCRIPTION >     (ADDRESS_LIST >       (ADDRESS
= (PROTOCOL = TCP)(HOST = node1-vip)(PORT = 1521))
>       (ADDRESS = (PROTOCOL = TCP)(HOST = node2-vip)(PORT = 1521))
>       (ADDRESS = (PROTOCOL = TCP)(HOST = node3-vip)(PORT = 1521))
>     )
>     (CONNECT_DATA >       (SERVICE_NAME = mydbservice.db.mydomain.com)
>     )
>   )
> 
> MYDB1 >   (DESCRIPTION >     (ADDRESS = (PROTOCOL = TCP)(HOST =
node1-vip)(PORT = 15
> 21))
>     (CONNECT_DATA >       (SERVER = DEDICATED)
>       (SERVICE_NAME = mydb.db.mydomain.com)
>       (INSTANCE_NAME = mydb1)
>     )
>   )
> 
> If you connected to the service "MYDBSERVICE" you could survive
the
> failure of any given node.  You'd seamlessly fail-over onto one of the
> other nodes.  If you connect directly to the "MYDB1" instance,
you'll be
> out of luck if you drop the connection to it.
> 
> As far as o2cb goes, you are right I believe.  Eventually, it will be
> determined that the node is no longer heartbeating and will either panic
> or reboot.
> 
> For your testing, just be careful you're not confusing the OCFS2 layer
> with the Oracle CRS/RAC layers.
> 
> Comments welcome....
> 
> Hope that helps
> 
> Adam
> 
> 
> 
> 
> On Nov 30, 2006, at 3:30 PM, Peter Santos wrote:
> 
> guys,
>     I'm trying to test how my 10gR2 oracle cluster (3 nodes) on SuSe
> reacts to a network card hardware failure.
>     I have eth0 and eth1 as my network cards, I took down eth0 (ifdown
> eth0) to see what would happen and
>     I didn't get any reaction from the o2cb service. This is probably
> the correct behavior since my
>      /etc/ocfs2/cluster.conf uses eth1 as the connection channel?
> 
>     If I take down eth1 I suspect o2cb will eventually reboot the
> machine right? I'm not using any bonding.
> 
>     My concern is that when I took down eth0, I had a user logged into
> the instance and everything just "hung" for
>     that user, until I manually took down the instance with
> "SRVCTL"... then the user connection failed over to
>     a working instance.
> 
>     Anyway, just trying to get some general knowledge of the behavior
> of o2cb in order to understand
>     my testing.
> 
> -peter
> 
> 
>>_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFcPwQoyy5QBCjoT0RAsVdAJ9sKZVFv4bUxShy7HUnTtTWieLxJgCdEsXk
iph57e8yr8ziojISMQf4Tvs=qBE+
-----END PGP SIGNATURE-----

Ocfs2 users - Nov 2006 - re: how should ocfs2 react to nic hardware issue

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue

[Ocfs2-users] re: how should ocfs2 react to nic hardware issue