thr3ads.net - Ocfs users - [Ocfs-users] Major RAC slowdown [Jun 2004]

If this information is useful, please help other people find it:
Share via:

Derek Suzuki

2004-Jun-08 03:22 UTC

[Ocfs-users] Major RAC slowdown

Hello again.  Our production cluster has begun experiencing some vicious
slowdowns that may (or may not) be related to the filesystems.  When the problem
occurs, the load average on the servers jumps up to 30 or higher.  Usually one
node will climb while the other drops, then they will switch places a few
minutes later.  At one point, we had one node's load average up over 300. 
Our site activity has been on the rise, and the problems usually occur during
peak mid-day hours.
 
Under normal conditions, "top" shows the CPUs spending most of their
time waiting on the very busy fibre channel.  During the slowdowns, the
processors are mostly busy with system calls.  Traffic over both the fibre
channel and gigabit interconnect seems to drop off considerably at the same
time.
 
I've got a TAR open, but the support people are still in the very
preliminary stages (for example, we just installed a switch between the two
nodes because a crossover cable is apparently not supported).  There doesn't
seem to be any good indication of what's going on.  We suspected the
interconnect, but the private interfaces seem to behave normally while Oracle is
grinding to a halt.
 
After 10-30 minutes, the problem will fade away on its own.  I'm inclined to
blame something in the RAC inter-node communications code, but I was wondering
if this situation resembled any kind of OCFS problem anyone has seen.  These
servers are still on 1.0.9-12, with plans to go to 1.0.12 soon after this issue
is resolved.
 
Derek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20040608/af61c12a/attachment.htm

Brian M. Diehl

2004-Jun-08 10:06 UTC

head link

[Ocfs-users] Major RAC slowdown

Hi All.  I had a very very similar problem, loads of 70 during our
"peak" times of the day, the end result was nothing to do with OCFS,
but
lack of proper indexes.  Doing a Top SQL report, and then getting the
execution plan for "heavy" SQL statements, will show if you are doing
full table scans, and such that will busy out even the fastest of
arrays.  Just a little thought, as it was what fixed my problems, my
2-node cluster now runs with a combined load of just over 1.2 :-)

HTH,

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Brian M. Diehl              Sr. Network Admin
A-1 Limousine Inc.               609-919-2019
 "Our greatest glory is not in never falling,
       but in rising every time we fall"
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


-----Original Message-----
From: Derek Suzuki [mailto:DSuzuki@ziprealty.com] 
Sent: Tuesday, June 08, 2004 4:23 AM
To: ocfs-users@oss.oracle.com
Subject: [Ocfs-users] Major RAC slowdown


Hello again.  Our production cluster has begun experiencing some vicious
slowdowns that may (or may not) be related to the filesystems.  When the
problem occurs, the load average on the servers jumps up to 30 or
higher.  Usually one node will climb while the other drops, then they
will switch places a few minutes later.  At one point, we had one node's
load average up over 300.  Our site activity has been on the rise, and
the problems usually occur during peak mid-day hours.

Under normal conditions, "top" shows the CPUs spending most of their
time waiting on the very busy fibre channel.  During the slowdowns, the
processors are mostly busy with system calls.  Traffic over both the
fibre channel and gigabit interconnect seems to drop off considerably at
the same time.

I've got a TAR open, but the support people are still in the very
preliminary stages (for example, we just installed a switch between the
two nodes because a crossover cable is apparently not supported).  There
doesn't seem to be any good indication of what's going on.  We suspected
the interconnect, but the private interfaces seem to behave normally
while Oracle is grinding to a halt.

After 10-30 minutes, the problem will fade away on its own.  I'm
inclined to blame something in the RAC inter-node communications code,
but I was wondering if this situation resembled any kind of OCFS problem
anyone has seen.  These servers are still on 1.0.9-12, with plans to go
to 1.0.12 soon after this issue is resolved.

Wim Coekaerts

2004-Jun-08 11:30 UTC

head link

[Ocfs-users] Major RAC slowdown

what sort of syscalls ? and I guess that means you see a lot of %sys not
%user ... hmm

On Tue, Jun 08, 2004 at 01:22:44AM -0700, Derek Suzuki
wrote:> Hello again.  Our production cluster has begun experiencing some vicious
slowdowns that may (or may not) be related to the filesystems.  When the problem
occurs, the load average on the servers jumps up to 30 or higher.  Usually one
node will climb while the other drops, then they will switch places a few
minutes later.  At one point, we had one node's load average up over 300. 
Our site activity has been on the rise, and the problems usually occur during
peak mid-day hours.
>  
> Under normal conditions, "top" shows the CPUs spending most of
their time waiting on the very busy fibre channel.  During the slowdowns, the
processors are mostly busy with system calls.  Traffic over both the fibre
channel and gigabit interconnect seems to drop off considerably at the same
time.
>  
> I've got a TAR open, but the support people are still in the very
preliminary stages (for example, we just installed a switch between the two
nodes because a crossover cable is apparently not supported).  There doesn't
seem to be any good indication of what's going on.  We suspected the
interconnect, but the private interfaces seem to behave normally while Oracle is
grinding to a halt.
>  
> After 10-30 minutes, the problem will fade away on its own.  I'm
inclined to blame something in the RAC inter-node communications code, but I was
wondering if this situation resembled any kind of OCFS problem anyone has seen. 
These servers are still on 1.0.9-12, with plans to go to 1.0.12 soon after this
issue is resolved.
>  
> Derek
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users

Derek Suzuki

2004-Jun-08 11:43 UTC

head link

[Ocfs-users] Major RAC slowdown

Correct - normally %iowait dominates, but in this circumstance it's all
%sys.  Most of the Oracle processes spend a lot of time spinning through
gettimeofday() calls, but in between there are the usual reads and writes to
disk or the interconnect socket (depending on which process it is).


-----Original Message-----
From: Wim Coekaerts [mailto:wim.coekaerts@oracle.com]
Sent: Tue 6/8/2004 9:23 AM
To: Derek Suzuki
Cc: ocfs-users@oss.oracle.com
Subject: Re: [Ocfs-users] Major RAC slowdown
 
what sort of syscalls ? and I guess that means you see a lot of %sys not
%user ... hmm

On Tue, Jun 08, 2004 at 01:22:44AM -0700, Derek Suzuki
wrote:> Hello again.  Our production cluster has begun experiencing some vicious
slowdowns that may (or may not) be related to the filesystems.  When the problem
occurs, the load average on the servers jumps up to 30 or higher.  Usually one
node will climb while the other drops, then they will switch places a few
minutes later.  At one point, we had one node's load average up over 300. 
Our site activity has been on the rise, and the problems usually occur during
peak mid-day hours.
>  
> Under normal conditions, "top" shows the CPUs spending most of
their time waiting on the very busy fibre channel.  During the slowdowns, the
processors are mostly busy with system calls.  Traffic over both the fibre
channel and gigabit interconnect seems to drop off considerably at the same
time.
>  
> I've got a TAR open, but the support people are still in the very
preliminary stages (for example, we just installed a switch between the two
nodes because a crossover cable is apparently not supported).  There doesn't
seem to be any good indication of what's going on.  We suspected the
interconnect, but the private interfaces seem to behave normally while Oracle is
grinding to a halt.
>  
> After 10-30 minutes, the problem will fade away on its own.  I'm
inclined to blame something in the RAC inter-node communications code, but I was
wondering if this situation resembled any kind of OCFS problem anyone has seen. 
These servers are still on 1.0.9-12, with plans to go to 1.0.12 soon after this
issue is resolved.
>  
> Derek
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20040608/d5048b69/attachment.htm

Reasonably Related Threads

Search for more apparently analagous threads

Ocfs users - Jun 2004 - Major RAC slowdown

[Ocfs-users] Major RAC slowdown

[Ocfs-users] Major RAC slowdown

[Ocfs-users] Major RAC slowdown

[Ocfs-users] Major RAC slowdown

Reasonably Related Threads