Greetings, Having read about the previous OSFS hangs, I think this one that we are seeing is different, but I'm not sure if this is caused by OCFS or the Linux OS. We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA tried to do an "ls" command on /u06/oradata/database and his process hung. I tried to kill his "ls" process and it is unkillable. On Node 2, the "ls" on /u06/oradata/database worked fine. All of the other file systems (on both nodes) are fine. Also, what we can't get rid of is this process: oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) and it's been accumulating CPU time since the hang. I'm unsure if this process is a victim or the cause of the hangs. I hope that I have provided enough information about the situation. If not, let me know and I'll get more. Regards, Randy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20040419/09a30dcd/attachment.htm
alt-sysreq-t will be helpful. It dumps the status of all the processes into /var/log/messages. For a quicker turnaround, contact oracle support. Doering, Randy wrote:> Greetings, > > Having read about the previous OSFS hangs, I think this one that we > are seeing is different, but I?m not sure if this is caused by OCFS or > the Linux OS. > > We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. > > We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the > DBA tried to do an ?ls? command on /u06/oradata/database and his > process hung. I tried to kill his ?ls? process and it is unkillable. > On Node 2, the ?ls? on /u06/oradata/database worked fine. All of the > other file systems (on both nodes) are fine. > > Also, what we can?t get rid of is this process: > > oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 > (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) > > and it?s been accumulating CPU time since the hang. I?m unsure if this > process is a victim or the cause of the hangs. > > I hope that I have provided enough information about the situation. If > not, let me know and I?ll get more. > > Regards, > > Randy > >------------------------------------------------------------------------ > >_______________________________________________ >Ocfs-users mailing list >Ocfs-users@oss.oracle.com >http://oss.oracle.com/mailman/listinfo/ocfs-users > >
Hi Randy, It looks like you have some process stuck that had previously done a down() on a semaphore in the /u06/oradata/database directory. Pretty much every operation inside that directory from that node will hang once the first hang occurs. The best place to go is to Oracle Support at this point. But in any case, the information they will want is a "debugocfs -f /oradata/database/ /dev/raw/raw##" and a "debugocfs -d /oradata/database/ /dev/raw/raw##" and a "fsck.ocfs -v /dev/raw/raw##". My guess is either that the fsck.ocfs output will show an ERROR that says you have a system file locked by another node, or that you have some process actively spinning in the ocfs code. If it turns out to be the latter, you would also want to get the output of /var/log/messages after running this: "echo -1 > /proc/sys/kernel/ocfs/debug_level" "echo -1 > /proc/sys/kernel/ocfs/debug_context" making sure to set both of these values back to 0 after a couple minutes. Also, make sure to get a "ps -ef" or "ps awux" output too, in order to match up the process ids. The solution to any of the bugs I have mentioned will likely involve taking down one node, depending upon which bug you have hit. Since in your case it unfortunately looks like the trouble partition contains your datafiles, I would prepare to shutdown the database on this node in anticipation of a reboot. The other RAC node can likely remain up and running. (If this were a partition containing only archives, for instance, you could possibly keep the database up by just switching archive destination temporarily). Thanks! -kurt On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote:> > > Greetings, > > > > Having read about the previous OSFS hangs, I think this one > that we are seeing is different, but I'm not sure if this is caused by > OCFS or the Linux OS. > > > > We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. > > > > We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA > tried to do an "ls" command on /u06/oradata/database and his process > hung. I tried to kill his "ls" process and it is unkillable. On Node 2, > the "ls" on /u06/oradata/database worked fine. All of the other file > systems (on both nodes) are fine. > > > > Also, what we can't get rid of is this process: > > > > oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 > (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) > > > > and it's been accumulating CPU time since the hang. I'm > unsure if this process is a victim or the cause of the hangs. > > > > I hope that I have provided enough information about the > situation. If not, let me know and I'll get more. > > > > Regards, > > Randy > > >> _______________________________________________ > Ocfs-users mailing list > Ocfs-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs-users
Kurt, Thanks for the info. We ended up stopping/restarting the DB. That was successful, although trying to get to /u06/oradata/database was still hanging. We then rebooted the node, and after that everything is fine now. I'll look more into this using your suggestions and hopefully if/when it happens again, I'll have more information for you all. BTW, using ocfstool, I was able to "browse" over and see the contents of that directory fine. Thanks again, Randy PS: We had also logged a case with oracle support. -----Original Message----- From: Kurt Hackel [mailto:Kurt.Hackel@oracle.com] Sent: Mon 4/19/2004 3:54 PM To: Doering, Randy Cc: ocfs-users@oss.oracle.com Subject: Re: [Ocfs-users] OCFS Hang Hi Randy, It looks like you have some process stuck that had previously done a down() on a semaphore in the /u06/oradata/database directory. Pretty much every operation inside that directory from that node will hang once the first hang occurs. The best place to go is to Oracle Support at this point. But in any case, the information they will want is a "debugocfs -f /oradata/database/ /dev/raw/raw##" and a "debugocfs -d /oradata/database/ /dev/raw/raw##" and a "fsck.ocfs -v /dev/raw/raw##". My guess is either that the fsck.ocfs output will show an ERROR that says you have a system file locked by another node, or that you have some process actively spinning in the ocfs code. If it turns out to be the latter, you would also want to get the output of /var/log/messages after running this: "echo -1 > /proc/sys/kernel/ocfs/debug_level" "echo -1 > /proc/sys/kernel/ocfs/debug_context" making sure to set both of these values back to 0 after a couple minutes. Also, make sure to get a "ps -ef" or "ps awux" output too, in order to match up the process ids. The solution to any of the bugs I have mentioned will likely involve taking down one node, depending upon which bug you have hit. Since in your case it unfortunately looks like the trouble partition contains your datafiles, I would prepare to shutdown the database on this node in anticipation of a reboot. The other RAC node can likely remain up and running. (If this were a partition containing only archives, for instance, you could possibly keep the database up by just switching archive destination temporarily). Thanks! -kurt On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote: > > > Greetings, > > > > Having read about the previous OSFS hangs, I think this one > that we are seeing is different, but I'm not sure if this is caused by > OCFS or the Linux OS. > > > > We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. > > > > We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA > tried to do an "ls" command on /u06/oradata/database and his process > hung. I tried to kill his "ls" process and it is unkillable. On Node 2, > the "ls" on /u06/oradata/database worked fine. All of the other file > systems (on both nodes) are fine. > > > > Also, what we can't get rid of is this process: > > > > oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 > (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) > > > > and it's been accumulating CPU time since the hang. I'm > unsure if this process is a victim or the cause of the hangs. > > > > I hope that I have provided enough information about the > situation. If not, let me know and I'll get more. > > > > Regards, > > Randy > > > > _______________________________________________ > Ocfs-users mailing list > Ocfs-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs-users
You might want to try 'strace -p 23593' and observe what's going on with the process or which system call it is hung. Ramesh -----Original Message----- From: ocfs-users-bounces@oss.oracle.com [mailto:ocfs-users-bounces@oss.oracle.com]On Behalf Of Doering, Randy Sent: Tuesday, April 20, 2004 12:32 AM To: ocfs-users@oss.oracle.com Subject: [Ocfs-users] OCFS Hang Greetings, Having read about the previous OSFS hangs, I think this one that we are seeing is different, but I'm not sure if this is caused by OCFS or the Linux OS. We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA tried to do an "ls" command on /u06/oradata/database and his process hung. I tried to kill his "ls" process and it is unkillable. On Node 2, the "ls" on /u06/oradata/database worked fine. All of the other file systems (on both nodes) are fine. Also, what we can't get rid of is this process: oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) and it's been accumulating CPU time since the hang. I'm unsure if this process is a victim or the cause of the hangs. I hope that I have provided enough information about the situation. If not, let me know and I'll get more. Regards, Randy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20040420/5ebce1fd/attachment-0001.htm
Just a thought, but you might be having the same problem I was having. Symptoms sound *very* similar. The patch has supposedly been merged into the source tree but I don't think they've released a new version of OCFS since the merge. (Sunil or Wim - do you know if this bugfix was included in 1.0.11-1?) Check http://oss.oracle.com/pipermail/ocfs-users/2004-March/000192.html For the geek [technical] description, check http://oss.oracle.com/pipermail/ocfs-users/2004-March/000185.html or http://www.asugroup.com/ocfsbugfix.txt Jeremy>>> "Doering, Randy" <Randy.Doering@ventersciencejtc.org> 04/19/20046:23:52 PM >>> Kurt, Thanks for the info. We ended up stopping/restarting the DB. That was successful, although trying to get to /u06/oradata/database was still hanging. We then rebooted the node, and after that everything is fine now. I'll look more into this using your suggestions and hopefully if/when it happens again, I'll have more information for you all. BTW, using ocfstool, I was able to "browse" over and see the contents of that directory fine. Thanks again, Randy PS: We had also logged a case with oracle support. -----Original Message----- From: Kurt Hackel [mailto:Kurt.Hackel@oracle.com] Sent: Mon 4/19/2004 3:54 PM To: Doering, Randy Cc: ocfs-users@oss.oracle.com Subject: Re: [Ocfs-users] OCFS Hang Hi Randy, It looks like you have some process stuck that had previously done a down() on a semaphore in the /u06/oradata/database directory. Pretty much every operation inside that directory from that node will hang once the first hang occurs. The best place to go is to Oracle Support at this point. But in any case, the information they will want is a "debugocfs -f /oradata/database/ /dev/raw/raw##" and a "debugocfs -d /oradata/database/ /dev/raw/raw##" and a "fsck.ocfs -v /dev/raw/raw##". My guess is either that the fsck.ocfs output will show an ERROR that says you have a system file locked by another node, or that you have some process actively spinning in the ocfs code. If it turns out to be the latter, you would also want to get the output of /var/log/messages after running this: "echo -1 > /proc/sys/kernel/ocfs/debug_level" "echo -1 > /proc/sys/kernel/ocfs/debug_context" making sure to set both of these values back to 0 after a couple minutes. Also, make sure to get a "ps -ef" or "ps awux" output too, in order to match up the process ids. The solution to any of the bugs I have mentioned will likely involve taking down one node, depending upon which bug you have hit. Since in your case it unfortunately looks like the trouble partition contains your datafiles, I would prepare to shutdown the database on this node in anticipation of a reboot. The other RAC node can likely remain up and running. (If this were a partition containing only archives, for instance, you could possibly keep the database up by just switching archive destination temporarily). Thanks! -kurt On Mon, Apr 19, 2004 at 03:02:23PM -0400, Doering, Randy wrote: > > > Greetings, > > > > Having read about the previous OSFS hangs, I think this one > that we are seeing is different, but I'm not sure if this is caused by > OCFS or the Linux OS. > > > > We are running OCFS Version 1.09 with Linux AS 3.0/9i RAC. > > > > We have a 2 node Intel Cluster (Node 1 and Node 2). This morning the DBA > tried to do an "ls" command on /u06/oradata/database and his process > hung. I tried to kill his "ls" process and it is unkillable. On Node 2, > the "ls" on /u06/oradata/database worked fine. All of the other file > systems (on both nodes) are fine. > > > > Also, what we can't get rid of is this process: > > > > oracle 23593 1 95 10:00 ? 04:45:11 oracleXYZ2 > (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) > > > > and it's been accumulating CPU time since the hang. I'm > unsure if this process is a victim or the cause of the hangs. > > > > I hope that I have provided enough information about the > situation. If not, let me know and I'll get more. > > > > Regards, > > Randy > > > > _______________________________________________ > Ocfs-users mailing list > Ocfs-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs-users This message (including any attachments) contains confidential information intended for a specific individual(s) and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, by anyone other than the intended recipient(s), is strictly prohibited. <<<<...>>>>