thr3ads.net - Ocfs2 users - [Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jorge Adrian Salaices

2012-Jan-27 01:51 UTC

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

I have been working on trying to convince Mgmt at work that we want to 
go to OCFS2 away from NFS for the sharing of the Application Layer of 
our Oracle EBS (Enterprise Business Suite), and for just general "Backup 
Share", but general instability in my setup has dissuaded me to 
recommend it.

I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and 
something as simple as an umount has triggered random Node reboots, even 
on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
You see the problem I have is that I have disparate hardware and some of 
these servers are even VM's.

Several documents state that nodes have to be somewhat equal of power 
and specs and in my case that will never be.
Unfortunately for me, I have had several other events of random Fencing 
that have been unexplained by common checks.
i.e. My Network has never been the problem yet one server may see 
another one go away when all of the other services on that node may be 
running perfectly fine. I can only surmise that the reason why that may 
have been is because of an elevated load on the server that starved the 
Heartbeat process preventing it from sending Network packets to other 
nodes.

My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs 
out of our SAN and not all servers share all Mounts.
meaning  only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3 
share a third, unfortunately the complexity is such that a server may 
intersect with some of the servers but not all.
perhaps a change in my config to create separate clusters may be the 
solution but only if a node can be part of multiple clusters:

/node:
         ip_port = 7777
         ip_address = 172.20.16.151
         number = 1
         name = txri-oprdracdb-1.tomkinsbp.com
         cluster = ocfs2-back

node:
         ip_port = 7777
         ip_address = 172.20.16.152
         number = 2
         name = txri-oprdracdb-2.tomkinsbp.com
         cluster = ocfs2-back

node:
         ip_port = 7777
         ip_address = 10.30.12.172
         number = 4
         name = txri-util01.tomkinsbp.com
         cluster = ocfs2-util, ocfs2-back
node:
         ip_port = 7777
         ip_address = 10.30.12.94
         number = 5
         name = txri-util02.tomkinsbp.com
         cluster = ocfs2-util, ocfs2-back

cluster:
         node_count = 2
         name = ocfs2-back

cluster:
         node_count = 2
         name = ocfs2-util
/
Is this even Legal, or can it be done some other way ?
or is this done based on the Different DOMAINS that are created once a 
mount is done .


How can I make the cluster more stable ? and Why does a node fence 
itself on the cluster even if it does Not have any locks on the shared 
LUN ? It seems to be that the node may be "fenceable" simply by having
the OCFS2 services turned ON, without a mount .
is this correct ?

Another question I have been having as well is:  can the Fencing method 
be other than Panic or restart ? Can a third party or a Userland event 
be triggered to recover from what may be construed by the "Heartbeat"
or
"Network tests"   as a downed node ?

Thanks for any of the help you can give me.


-- 
Jorge Adrian Salaices
Sr. Linux Engineer
Tomkins Building Products

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120126/16ec093d/attachment.html

Sérgio Surkamp

2012-Jan-27 16:54 UTC

head link

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Hello Jorge,
> I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and 
> something as simple as an umount has triggered random Node reboots,
> even on nodes that have Other OCFS2 mounts not shared by the
> rebooting nodes. You see the problem I have is that I have disparate
> hardware and some of these servers are even VM's.
Probably this is the source of your instability, you shouldn't mix
different versions of the filesystem in the same cluster stack, as it
*may* have network protocol incompatibility between the versions. Also
you should not mount the same filesystem with different driver versions.

The nodes that are fencing while not mounting the ocfs2 is probably due
to any oops inside the o2cb (cluster stack) driver that *could be*
triggered by the mix of versions and some protocol incompatibility
between them.
> Several documents state that nodes have to be somewhat equal of power 
> and specs and in my case that will never be.
> Unfortunately for me, I have had several other events of random
> Fencing that have been unexplained by common checks.
> i.e. My Network has never been the problem yet one server may see 
> another one go away when all of the other services on that node may
> be running perfectly fine. I can only surmise that the reason why
> that may have been is because of an elevated load on the server that
> starved the Heartbeat process preventing it from sending Network
> packets to other nodes.
> 
> My config has about 40 Nodes on it, I have 4 or 5 different shared
> LUNs out of our SAN and not all servers share all Mounts.
> meaning  only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3 
> share a third, unfortunately the complexity is such that a server may 
> intersect with some of the servers but not all.
> perhaps a change in my config to create separate clusters may be the 
> solution but only if a node can be part of multiple clusters:
> 
> /node:
>          ip_port = 7777
>          ip_address = 172.20.16.151
>          number = 1
>          name = txri-oprdracdb-1.tomkinsbp.com
>          cluster = ocfs2-back
> 
> node:
>          ip_port = 7777
>          ip_address = 172.20.16.152
>          number = 2
>          name = txri-oprdracdb-2.tomkinsbp.com
>          cluster = ocfs2-back
> 
> node:
>          ip_port = 7777
>          ip_address = 10.30.12.172
>          number = 4
>          name = txri-util01.tomkinsbp.com
>          cluster = ocfs2-util, ocfs2-back
> node:
>          ip_port = 7777
>          ip_address = 10.30.12.94
>          number = 5
>          name = txri-util02.tomkinsbp.com
>          cluster = ocfs2-util, ocfs2-back
> 
> cluster:
>          node_count = 2
>          name = ocfs2-back
> 
> cluster:
>          node_count = 2
>          name = ocfs2-util
> /
> Is this even Legal, or can it be done some other way ?
> or is this done based on the Different DOMAINS that are created once
> a mount is done .
Isn't possible. The cluster part does not support the definition of
more than one cluster. Take a look at the list archives if you are
interested in why there could not be more than one definition.
> How can I make the cluster more stable ? and Why does a node fence 
> itself on the cluster even if it does Not have any locks on the
> shared LUN ? It seems to be that the node may be "fenceable"
simply
> by having the OCFS2 services turned ON, without a mount .
> is this correct ?
> 
> Another question I have been having as well is:  can the Fencing
> method be other than Panic or restart ? Can a third party or a
> Userland event be triggered to recover from what may be construed by
> the "Heartbeat" or "Network tests"   as a downed node ?
> 
> Thanks for any of the help you can give me.
The server fence because any driver issued a kernel oops due an
unexpected behaviour, so there is no guarantee that the kernel or the
driver is still stable when it happens. That's why is recommended that
the server should restart in this case.

You can disable the automatic fence by setting the sysctl parameter
kernel.panic_on_oops

# echo 0 > /proc/sys/kernel/panic_on_oops

To permanently disable the fence you can add (or modify) the fallowing
line in your /etc/sysctl.conf

kernel.panic_on_oops=0

By doing this, the server will not fence if any of the kernel drivers
issue an oops, instead, the driver will probably crash and may render
your server unstable or crashed by a kernel panic.

Anyway, you should configure a netconsole, so if any of them oops or
panic, you still get the error messages and stack traces.

Regards,
-- 
  .:''''':.
.:'        `     S?rgio Surkamp | Administrador de Redes
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florian?polis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br

Sunil Mushran

2012-Jan-27 18:41 UTC

head link

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Symmetric clustering works best when the nodes are comparable because 
all nodes have to work in sync. NFS may be more suitable for your needs.

On 01/26/2012 05:51 PM, Jorge Adrian Salaices wrote:> I have been working on trying to convince Mgmt at work that we want to
> go to OCFS2 away from NFS for the sharing of the Application Layer of
> our Oracle EBS (Enterprise Business Suite), and for just general
"Backup
> Share", but general instability in my setup has dissuaded me to
> recommend it.
>
> I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and
> something as simple as an umount has triggered random Node reboots, even
> on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
> You see the problem I have is that I have disparate hardware and some of
> these servers are even VM's.
>
> Several documents state that nodes have to be somewhat equal of power
> and specs and in my case that will never be.
> Unfortunately for me, I have had several other events of random Fencing
> that have been unexplained by common checks.
> i.e. My Network has never been the problem yet one server may see
> another one go away when all of the other services on that node may be
> running perfectly fine. I can only surmise that the reason why that may
> have been is because of an elevated load on the server that starved the
> Heartbeat process preventing it from sending Network packets to other
> nodes.
>
> My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs
> out of our SAN and not all servers share all Mounts.
> meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3
> share a third, unfortunately the complexity is such that a server may
> intersect with some of the servers but not all.
> perhaps a change in my config to create separate clusters may be the
> solution but only if a node can be part of multiple clusters:
>
> /node:
> ip_port = 7777
> ip_address = 172.20.16.151
> number = 1
> name = txri-oprdracdb-1.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 7777
> ip_address = 172.20.16.152
> number = 2
> name = txri-oprdracdb-2.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 7777
> ip_address = 10.30.12.172
> number = 4
> name = txri-util01.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
> node:
> ip_port = 7777
> ip_address = 10.30.12.94
> number = 5
> name = txri-util02.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-util
> /
> Is this even Legal, or can it be done some other way ?
> or is this done based on the Different DOMAINS that are created once a
> mount is done .
>
>
> How can I make the cluster more stable ? and Why does a node fence
> itself on the cluster even if it does Not have any locks on the shared
> LUN ? It seems to be that the node may be "fenceable" simply by
having
> the OCFS2 services turned ON, without a mount .
> is this correct ?
>
> Another question I have been having as well is: can the Fencing method
> be other than Panic or restart ? Can a third party or a Userland event
> be triggered to recover from what may be construed by the
"Heartbeat" or
> "Network tests" as a downed node ?
>
> Thanks for any of the help you can give me.
>
>
> --
> Jorge Adrian Salaices
> Sr. Linux Engineer
> Tomkins Building Products
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Jan 2012 - Help ! OCFS2 unstable on Disparate Hardware

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware