thr3ads.net - Ocfs users - [Ocfs-users] OCFS Hang [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Doering, Randy

2004-Apr-19 14:02 UTC

[Ocfs-users] OCFS Hang

Greetings,

 

            Having read about the previous OSFS hangs, I think this one
that we are seeing is different, but I'm not sure if this is caused by
OCFS or the Linux OS.

 

            We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.

 

We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA
tried to do an "ls" command on /u06/oradata/database and his process
hung. I tried to kill his "ls" process and it is unkillable. On Node
2,
the "ls" on /u06/oradata/database worked fine. All of the other file
systems (on both nodes) are fine.

 

Also, what we can't get rid of is this process:

 

oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

 

            and it's been accumulating CPU time since the hang. I'm
unsure if this process is a victim or the cause of the hangs.

 

            I hope that I have provided enough information about the
situation. If not, let me know and I'll get more.

 

Regards,

Randy

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20040419/09a30dcd/attachment.htm

Sunil Mushran

2004-Apr-19 14:37 UTC

head link

[Ocfs-users] OCFS Hang

alt-sysreq-t will be helpful. It dumps the status of all the
processes into /var/log/messages.

For a quicker turnaround, contact oracle support.

Doering, Randy wrote:
> Greetings,
>
> Having read about the previous OSFS hangs, I think this one that we 
> are seeing is different, but I?m not sure if this is caused by OCFS or 
> the Linux OS.
>
> We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.
>
> We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the 
> DBA tried to do an ?ls? command on /u06/oradata/database and his 
> process hung. I tried to kill his ?ls? process and it is unkillable. 
> On Node 2, the ?ls? on /u06/oradata/database worked fine. All of the 
> other file systems (on both nodes) are fine.
>
> Also, what we can?t get rid of is this process:
>
> oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 
> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
>
> and it?s been accumulating CPU time since the hang. I?m unsure if this 
> process is a victim or the cause of the hangs.
>
> I hope that I have provided enough information about the situation. If 
> not, let me know and I?ll get more.
>
> Regards,
>
> Randy
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Ocfs-users mailing list
>Ocfs-users@oss.oracle.com
>http://oss.oracle.com/mailman/listinfo/ocfs-users
>  
>

Kurt Hackel

2004-Apr-19 14:54 UTC

head link

[Ocfs-users] OCFS Hang

Hi Randy,

It looks like you have some process stuck that had previously done a
down() on a semaphore in the /u06/oradata/database directory.  Pretty
much every operation inside that directory from that node will hang once
the first hang occurs.

The best place to go is to Oracle Support at this point.  But in any
case, the information they will want is a 
"debugocfs -f /oradata/database/ /dev/raw/raw##" and a 
"debugocfs -d /oradata/database/ /dev/raw/raw##" and a 
"fsck.ocfs -v /dev/raw/raw##".

My guess is either that the fsck.ocfs output will show an ERROR that
says you have a system file locked by another node, or that you have
some process actively spinning in the ocfs code.  If it turns out to be
the latter, you would also want to get the output of /var/log/messages
after running this:
"echo -1 > /proc/sys/kernel/ocfs/debug_level"
"echo -1 > /proc/sys/kernel/ocfs/debug_context"
making sure to set both of these values back to 0 after a couple
minutes.  Also, make sure to get a "ps -ef" or "ps awux"
output too,
in order to match up the process ids.

The solution to any of the bugs I have mentioned will likely involve
taking down one node, depending upon which bug you have hit.  Since in
your case it unfortunately looks like the trouble partition contains
your datafiles, I would prepare to shutdown the database on this node in
anticipation of a reboot.  The other RAC node can likely remain up and
running.  (If this were a partition containing only archives, for
instance, you could possibly keep the database up by just switching
archive destination temporarily).

Thanks!
-kurt

On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy
wrote:>  
> 
> Greetings,
> 
>  
> 
>             Having read about the previous OSFS hangs, I think this one
> that we are seeing is different, but I'm not sure if this is caused by
> OCFS or the Linux OS.
> 
>  
> 
>             We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.
> 
>  
> 
> We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA
> tried to do an "ls" command on /u06/oradata/database and his
process
> hung. I tried to kill his "ls" process and it is unkillable. On
Node 2,
> the "ls" on /u06/oradata/database worked fine. All of the other
file
> systems (on both nodes) are fine.
> 
>  
> 
> Also, what we can't get rid of is this process:
> 
>  
> 
> oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
> 
>  
> 
>             and it's been accumulating CPU time since the hang. I'm
> unsure if this process is a victim or the cause of the hangs.
> 
>  
> 
>             I hope that I have provided enough information about the
> situation. If not, let me know and I'll get more.
> 
>  
> 
> Regards,
> 
> Randy
> 
>  
> 
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users

Doering, Randy

2004-Apr-19 17:23 UTC

head link

[Ocfs-users] OCFS Hang

Kurt, Thanks for the info. We ended up stopping/restarting the DB. That was
successful, although trying to get to /u06/oradata/database was still hanging.
We then rebooted the node, and after that everything is fine now. I'll look
more into this using your suggestions and hopefully if/when it happens again,
I'll have more information for you all.
 
BTW, using ocfstool, I was able to "browse" over and see the contents
of that directory fine.
 
Thanks again,
Randy
 
PS: We had also logged a case with oracle support.
 

	-----Original Message----- 
	From: Kurt Hackel [mailto:Kurt.Hackel@oracle.com] 
	Sent: Mon 4/19/2004 3:54 PM 
	To: Doering, Randy 
	Cc: ocfs-users@oss.oracle.com 
	Subject: Re: [Ocfs-users] OCFS Hang
	
	

	Hi Randy,
	
	It looks like you have some process stuck that had previously done a
	down() on a semaphore in the /u06/oradata/database directory.  Pretty
	much every operation inside that directory from that node will hang once
	the first hang occurs.
	
	The best place to go is to Oracle Support at this point.  But in any
	case, the information they will want is a
	"debugocfs -f /oradata/database/ /dev/raw/raw##" and a
	"debugocfs -d /oradata/database/ /dev/raw/raw##" and a
	"fsck.ocfs -v /dev/raw/raw##".
	
	My guess is either that the fsck.ocfs output will show an ERROR that
	says you have a system file locked by another node, or that you have
	some process actively spinning in the ocfs code.  If it turns out to be
	the latter, you would also want to get the output of /var/log/messages
	after running this:
	"echo -1 > /proc/sys/kernel/ocfs/debug_level"
	"echo -1 > /proc/sys/kernel/ocfs/debug_context"
	making sure to set both of these values back to 0 after a couple
	minutes.  Also, make sure to get a "ps -ef" or "ps awux"
output too,
	in order to match up the process ids.
	
	The solution to any of the bugs I have mentioned will likely involve
	taking down one node, depending upon which bug you have hit.  Since in
	your case it unfortunately looks like the trouble partition contains
	your datafiles, I would prepare to shutdown the database on this node in
	anticipation of a reboot.  The other RAC node can likely remain up and
	running.  (If this were a partition containing only archives, for
	instance, you could possibly keep the database up by just switching
	archive destination temporarily).
	
	Thanks!
	-kurt
	
	
	
	On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote:
	> 
	>
	> Greetings,
	>
	> 
	>
	>             Having read about the previous OSFS hangs, I think this one
	> that we are seeing is different, but I'm not sure if this is caused by
	> OCFS or the Linux OS.
	>
	> 
	>
	>             We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.
	>
	> 
	>
	> We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA
	> tried to do an "ls" command on /u06/oradata/database and his
process
	> hung. I tried to kill his "ls" process and it is unkillable. On
Node 2,
	> the "ls" on /u06/oradata/database worked fine. All of the other
file
	> systems (on both nodes) are fine.
	>
	> 
	>
	> Also, what we can't get rid of is this process:
	>
	> 
	>
	> oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
	> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
	>
	> 
	>
	>             and it's been accumulating CPU time since the hang.
I'm
	> unsure if this process is a victim or the cause of the hangs.
	>
	> 
	>
	>             I hope that I have provided enough information about the
	> situation. If not, let me know and I'll get more.
	>
	> 
	>
	> Regards,
	>
	> Randy
	>
	> 
	>
	
	> _______________________________________________
	> Ocfs-users mailing list
	> Ocfs-users@oss.oracle.com
	> http://oss.oracle.com/mailman/listinfo/ocfs-users

Ramesh_Rajagopalan@Dell.com

2004-Apr-20 00:49 UTC

head link

[Ocfs-users] OCFS Hang

You might want to try  'strace -p 23593' and observe what's going on
with the process or which system call it is hung.

Ramesh

-----Original Message-----
From: ocfs-users-bounces@oss.oracle.com
[mailto:ocfs-users-bounces@oss.oracle.com]On Behalf Of Doering, Randy
Sent: Tuesday, April 20, 2004 12:32 AM
To: ocfs-users@oss.oracle.com
Subject: [Ocfs-users] OCFS Hang

Greetings,

            Having read about the previous OSFS hangs, I think this one
that we are seeing is different, but I'm not sure if this is caused by
OCFS or the Linux OS.

            We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC.

We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA
tried to do an "ls" command on /u06/oradata/database and his process
hung. I tried to kill his "ls" process and it is unkillable. On Node
2,
the "ls" on /u06/oradata/database worked fine. All of the other file
systems (on both nodes) are fine.

Also, what we can't get rid of is this process:

oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

            and it's been accumulating CPU time since the hang. I'm
unsure if this process is a victim or the cause of the hangs.

            I hope that I have provided enough information about the
situation. If not, let me know and I'll get more.

Regards,

Randy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20040420/5ebce1fd/attachment-0001.htm

Jeremy Schneider

2004-Apr-21 09:14 UTC

head link

[Ocfs-users] OCFS Hang

Just a thought, but you might be having the same problem I was having. 
Symptoms sound *very* similar.  The patch has supposedly been merged
into the source tree but I don't think they've released a new version of
OCFS since the merge.  (Sunil or Wim - do you know if this bugfix was
included in 1.0.11-1?)

Check
http://oss.oracle.com/pipermail/ocfs-users/2004-March/000192.html

For the geek [technical] description, check
http://oss.oracle.com/pipermail/ocfs-users/2004-March/000185.html or
http://www.asugroup.com/ocfsbugfix.txt

Jeremy
>>> "Doering, Randy"
<Randy.Doering@ventersciencejtc.org> 04/19/20046:23:52 PM >>>
Kurt, Thanks for the info. We ended up stopping/restarting the DB. That
was successful, although trying to get to /u06/oradata/database was
still hanging. We then rebooted the node, and after that everything is
fine now. I'll look more into this using your suggestions and hopefully
if/when it happens again, I'll have more information for you all.
 
BTW, using ocfstool, I was able to "browse" over and see the contents
of that directory fine.
 
Thanks again,
Randy
 
PS: We had also logged a case with oracle support.
 

	-----Original Message----- 
	From: Kurt Hackel [mailto:Kurt.Hackel@oracle.com] 
	Sent: Mon 4/19/2004 3:54 PM 
	To: Doering, Randy 
	Cc: ocfs-users@oss.oracle.com 
	Subject: Re: [Ocfs-users] OCFS Hang
	
	

	Hi Randy,
	
	It looks like you have some process stuck that had previously
done a
	down() on a semaphore in the /u06/oradata/database directory. 
Pretty
	much every operation inside that directory from that node will
hang once
	the first hang occurs.
	
	The best place to go is to Oracle Support at this point.  But in
any
	case, the information they will want is a
	"debugocfs -f /oradata/database/ /dev/raw/raw##" and a
	"debugocfs -d /oradata/database/ /dev/raw/raw##" and a
	"fsck.ocfs -v /dev/raw/raw##".
	
	My guess is either that the fsck.ocfs output will show an ERROR
that
	says you have a system file locked by another node, or that you
have
	some process actively spinning in the ocfs code.  If it turns
out to be
	the latter, you would also want to get the output of
/var/log/messages
	after running this:
	"echo -1 > /proc/sys/kernel/ocfs/debug_level"
	"echo -1 > /proc/sys/kernel/ocfs/debug_context"
	making sure to set both of these values back to 0 after a
couple
	minutes.  Also, make sure to get a "ps -ef" or "ps awux"
output
too,
	in order to match up the process ids.
	
	The solution to any of the bugs I have mentioned will likely
involve
	taking down one node, depending upon which bug you have hit. 
Since in
	your case it unfortunately looks like the trouble partition
contains
	your datafiles, I would prepare to shutdown the database on this
node in
	anticipation of a reboot.  The other RAC node can likely remain
up and
	running.  (If this were a partition containing only archives,
for
	instance, you could possibly keep the database up by just
switching
	archive destination temporarily).
	
	Thanks!
	-kurt
	
	
	
	On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote:
	> 
	>
	> Greetings,
	>
	> 
	>
	>             Having read about the previous OSFS hangs, I think
this one
	> that we are seeing is different, but I'm not sure if this is
caused by
	> OCFS or the Linux OS.
	>
	> 
	>
	>             We are running OCFS Version 1.09 with Linux AS
3.0/9i RAC.
	>
	> 
	>
	> We have a 2 node Intel Cluster (Node 1 and Node 2). This
morning the DBA
	> tried to do an "ls" command on /u06/oradata/database and his
process
	> hung. I tried to kill his "ls" process and it is unkillable.
On Node 2,
	> the "ls" on /u06/oradata/database worked fine. All of the
other file
	> systems (on both nodes) are fine.
	>
	> 
	>
	> Also, what we can't get rid of is this process:
	>
	> 
	>
	> oracle   23593     1 95 10:00 ?        04:45:11 oracleXYZ2
	> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
	>
	> 
	>
	>             and it's been accumulating CPU time since the
hang. I'm
	> unsure if this process is a victim or the cause of the hangs.
	>
	> 
	>
	>             I hope that I have provided enough information
about the
	> situation. If not, let me know and I'll get more.
	>
	> 
	>
	> Regards,
	>
	> Randy
	>
	> 
	>
	
	> _______________________________________________
	> Ocfs-users mailing list
	> Ocfs-users@oss.oracle.com 
	> http://oss.oracle.com/mailman/listinfo/ocfs-users 
	
	


This message (including any attachments) contains confidential information
intended for a specific individual(s) and purpose, and is protected by law.  If
you are not the intended recipient, you should delete this message.  Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, by anyone other than the intended recipient(s), is strictly
prohibited.

<<<<...>>>>

Seemingly Similar Threads

Search for more maybe matching threads

Ocfs users - Apr 2004 - OCFS Hang

[Ocfs-users] OCFS Hang

[Ocfs-users] OCFS Hang

[Ocfs-users] OCFS Hang

[Ocfs-users] OCFS Hang

[Ocfs-users] OCFS Hang

[Ocfs-users] OCFS Hang

Seemingly Similar Threads