We had a situation over the weekend with our production database that we can't figure out, hoping someone can shed some light. Specifics: Oracle 9.2.0.4 OS is Redhat AS2.1 ocfs-2.4.9-e-summit-1.0.12-1 ocfs-tools-1.0.10-1 ocfs-support-1.0.10-1 ocfs-2.4.9-e-enterprise-1.0.12-1 All database, redo, undo, and control files are on ocfs, archived logs are on ext3. We shut down the database for san maintenance, but didn't shut down cluster manager. The san was disconnected from the server, a tray was added and then the san was reconnected. The server and cluster manager remained up during the maintenance. When we tried to restart the database, we got an ORA-01207, saying the control file was older than the datafiles. Per Oracle support, we recreated the control file and attempted to bring the db up with the new one. At this point we received the following: Errors in file /opt/oracle/product/9.2.0/admin/ENTPRD/udump/entprd2_ora_22596.trc: ORA-00600: internal error code, arguments: [kcoapl_blkchk], [5], [393], [6101], [], [], [], [] There's a RAC bug entry for [kcoapl_blkchk], but it was for a 4-node RAC, ours is only 2 nodes, so Oracle internals support said they didn't think it applied to our case. We ended up doing a point-in-time recovery to before the san maintenance, but moved the datafiles to an ext3 partition for now. Has anyone seen this before, or have any input as to what happened? We're trying to determine if this is a bug, and if we should move back to RAC/ocfs. Thanks very much, Matt Daniels Apps DBA, Priority Healthcare Corp -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20041124/be435c95/attachment.html
wow, i'm surprised that cluster manager (oracm) stayed up. i don't know all the technical internals of exactly how it works, but i know that it uses a quorum on shared storage... i guess it might only use the quorum for split-brain situations (where the interconnect goes down) but personally i'd still never yank the shared disk quorum out from under it without shutting it down! but i also have to admit that your error message doesn't sound like it would be related to this. did you shut down GSD or did you leave that running too? the first error (control file older than datafiles) doesn't make much sense at all... i think that just means the SCN in the datafile headers was newer than the SCN recorded in the control file? did the DB shutdown cleanly according to the alert log? (FYI, we're running 9.2.0.5 on a 2-node RHEL3 cluster using ocfs -- it's a backend for 11.5.9 -- and we've been production for almost 3 months without any problems so far... oh - and we have [separate] ocfs partitions for archive logs too) jeremy, dba>>> "Matt Daniels" <Matt.Daniels@priorityhealthcare.com> 11/24/20049:30:10 AM >>> We had a situation over the weekend with our production database that we can't figure out, hoping someone can shed some light. Specifics: Oracle 9.2.0.4 OS is Redhat AS2.1 ocfs-2.4.9-e-summit-1.0.12-1 ocfs-tools-1.0.10-1 ocfs-support-1.0.10-1 ocfs-2.4.9-e-enterprise-1.0.12-1 All database, redo, undo, and control files are on ocfs, archived logs are on ext3. We shut down the database for san maintenance, but didn't shut down cluster manager. The san was disconnected from the server, a tray was added and then the san was reconnected. The server and cluster manager remained up during the maintenance. When we tried to restart the database, we got an ORA-01207, saying the control file was older than the datafiles. Per Oracle support, we recreated the control file and attempted to bring the db up with the new one. At this point we received the following: Errors in file /opt/oracle/product/9.2.0/admin/ENTPRD/udump/entprd2_ora_22596.trc: ORA-00600: internal error code, arguments: [kcoapl_blkchk], [5], [393], [6101], [], [], [], [] There's a RAC bug entry for [kcoapl_blkchk], but it was for a 4-node RAC, ours is only 2 nodes, so Oracle internals support said they didn't think it applied to our case. We ended up doing a point-in-time recovery to before the san maintenance, but moved the datafiles to an ext3 partition for now. Has anyone seen this before, or have any input as to what happened? We're trying to determine if this is a bug, and if we should move back to RAC/ocfs. Thanks very much, Matt Daniels Apps DBA, Priority Healthcare Corp This message (including any attachments) contains confidential information intended for a specific individual(s) and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, by anyone other than the intended recipient(s), is strictly prohibited. <<<<...>>>>
Hi Matt... Have you seen Note:76434.1 in Metalink. Rgds/Jeram _____ From: Matt Daniels [mailto:Matt.Daniels@priorityhealthcare.com] Sent: Wednesday, November 24, 2004 9:30 PM To: ocfs-users@oss.oracle.com Subject: [Ocfs-users] ORA-01207 after SAN maintenance We had a situation over the weekend with our production database that we can't figure out, hoping someone can shed some light. Specifics: Oracle 9.2.0.4 OS is Redhat AS2.1 ocfs-2.4.9-e-summit-1.0.12-1 ocfs-tools-1.0.10-1 ocfs-support-1.0.10-1 ocfs-2.4.9-e-enterprise-1.0.12-1 All database, redo, undo, and control files are on ocfs, archived logs are on ext3. We shut down the database for san maintenance, but didn't shut down cluster manager. The san was disconnected from the server, a tray was added and then the san was reconnected. The server and cluster manager remained up during the maintenance. When we tried to restart the database, we got an ORA-01207, saying the control file was older than the datafiles. Per Oracle support, we recreated the control file and attempted to bring the db up with the new one. At this point we received the following: Errors in file /opt/oracle/product/9.2.0/admin/ENTPRD/udump/entprd2_ora_22596.trc: ORA-00600: internal error code, arguments: [kcoapl_blkchk], [5], [393], [6101], [], [], [], [] There's a RAC bug entry for [kcoapl_blkchk], but it was for a 4-node RAC, ours is only 2 nodes, so Oracle internals support said they didn't think it applied to our case. We ended up doing a point-in-time recovery to before the san maintenance, but moved the datafiles to an ext3 partition for now. Has anyone seen this before, or have any input as to what happened? We're trying to determine if this is a bug, and if we should move back to RAC/ocfs. Thanks very much, Matt Daniels Apps DBA, Priority Healthcare Corp -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20041126/62122e8c/attachment.html
Hi Jeram, Thanks for the reply. Yes, we read the note, especially this part: Bug# 3281882 See <http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_id=3281882.8&p_database_id=NOT> [NOTE:3281882.8] Block corruption / OERI[kcoapl_blkchk] in multinode RAC after multiple reconfigurations Fixed: 9.2.0.5, 10.1.0.2 Oracle Internals support, however, determined that this bug didn't apply to our case since it occured specifically with a 4-node RAC instance, and ours is only 2 nodes. We're still trying to determine the cause for this, as we lost our production instance and had to do point-in-time recovery. Interestingly, one of our development instances experienced the exact problem as well. Its datafiles were stored on the san in a different ocfs partition, but in the same storage group. Our other development instance using the same san storage group, but with datafiles on an ext3 partition, wasn't affected and came up fine. Thanks again for the response! Matt -----Original Message----- From: Jeram [mailto:jeram@JISEDU.OR.ID] Sent: Thursday, November 25, 2004 8:11 PM To: Matt Daniels; ocfs-users@oss.oracle.com Subject: RE: [Ocfs-users] ORA-01207 after SAN maintenance Hi Matt... Have you seen Note:76434.1 in Metalink. Rgds/Jeram _____ From: Matt Daniels [mailto:Matt.Daniels@priorityhealthcare.com] Sent: Wednesday, November 24, 2004 9:30 PM To: ocfs-users@oss.oracle.com Subject: [Ocfs-users] ORA-01207 after SAN maintenance We had a situation over the weekend with our production database that we can't figure out, hoping someone can shed some light. Specifics: Oracle 9.2.0.4 OS is Redhat AS2.1 ocfs-2.4.9-e-summit-1.0.12-1 ocfs-tools-1.0.10-1 ocfs-support-1.0.10-1 ocfs-2.4.9-e-enterprise-1.0.12-1 All database, redo, undo, and control files are on ocfs, archived logs are on ext3. We shut down the database for san maintenance, but didn't shut down cluster manager. The san was disconnected from the server, a tray was added and then the san was reconnected. The server and cluster manager remained up during the maintenance. When we tried to restart the database, we got an ORA-01207, saying the control file was older than the datafiles. Per Oracle support, we recreated the control file and attempted to bring the db up with the new one. At this point we received the following: Errors in file /opt/oracle/product/9.2.0/admin/ENTPRD/udump/entprd2_ora_22596.trc: ORA-00600: internal error code, arguments: [kcoapl_blkchk], [5], [393], [6101], [], [], [], [] There's a RAC bug entry for [kcoapl_blkchk], but it was for a 4-node RAC, ours is only 2 nodes, so Oracle internals support said they didn't think it applied to our case. We ended up doing a point-in-time recovery to before the san maintenance, but moved the datafiles to an ext3 partition for now. Has anyone seen this before, or have any input as to what happened? We're trying to determine if this is a bug, and if we should move back to RAC/ocfs. Thanks very much, Matt Daniels Apps DBA, Priority Healthcare Corp -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs-users/attachments/20041125/718b5d4f/attachment.html
Hi Jeremy, Thanks for the response. I believe GSD was still running when the san maintenance was done. The database shutdown cleanly, all the issues arose when we tried to start it back up. We've since found out that a development instance with datafiles on another ocfs partition on the san suffered the exact problem as production, while a development instance with its datafiles on an ext3 partition on the san had no issues at all, and came up cleanly. This one still has us stumped, we're working with support to try and determine root cause...any other thoughts or suggestions are welcome... -Matt -----Original Message----- From: ocfs-users-bounces@oss.oracle.com [mailto:ocfs-users-bounces@oss.oracle.com]On Behalf Of Jeremy Schneider Sent: Wednesday, November 24, 2004 10:33 AM To: ocfs-users@oss.oracle.com; Matt Daniels Subject: Re: [Ocfs-users] ORA-01207 after SAN maintenance wow, i'm surprised that cluster manager (oracm) stayed up. i don't know all the technical internals of exactly how it works, but i know that it uses a quorum on shared storage... i guess it might only use the quorum for split-brain situations (where the interconnect goes down) but personally i'd still never yank the shared disk quorum out from under it without shutting it down! but i also have to admit that your error message doesn't sound like it would be related to this. did you shut down GSD or did you leave that running too? the first error (control file older than datafiles) doesn't make much sense at all... i think that just means the SCN in the datafile headers was newer than the SCN recorded in the control file? did the DB shutdown cleanly according to the alert log? (FYI, we're running 9.2.0.5 on a 2-node RHEL3 cluster using ocfs -- it's a backend for 11.5.9 -- and we've been production for almost 3 months without any problems so far... oh - and we have [separate] ocfs partitions for archive logs too) jeremy, dba>>> "Matt Daniels" <Matt.Daniels@priorityhealthcare.com> 11/24/20049:30:10 AM >>> We had a situation over the weekend with our production database that we can't figure out, hoping someone can shed some light. Specifics: Oracle 9.2.0.4 OS is Redhat AS2.1 ocfs-2.4.9-e-summit-1.0.12-1 ocfs-tools-1.0.10-1 ocfs-support-1.0.10-1 ocfs-2.4.9-e-enterprise-1.0.12-1 All database, redo, undo, and control files are on ocfs, archived logs are on ext3. We shut down the database for san maintenance, but didn't shut down cluster manager. The san was disconnected from the server, a tray was added and then the san was reconnected. The server and cluster manager remained up during the maintenance. When we tried to restart the database, we got an ORA-01207, saying the control file was older than the datafiles. Per Oracle support, we recreated the control file and attempted to bring the db up with the new one. At this point we received the following: Errors in file /opt/oracle/product/9.2.0/admin/ENTPRD/udump/entprd2_ora_22596.trc: ORA-00600: internal error code, arguments: [kcoapl_blkchk], [5], [393], [6101], [], [], [], [] There's a RAC bug entry for [kcoapl_blkchk], but it was for a 4-node RAC, ours is only 2 nodes, so Oracle internals support said they didn't think it applied to our case. We ended up doing a point-in-time recovery to before the san maintenance, but moved the datafiles to an ext3 partition for now. Has anyone seen this before, or have any input as to what happened? We're trying to determine if this is a bug, and if we should move back to RAC/ocfs. Thanks very much, Matt Daniels Apps DBA, Priority Healthcare Corp This message (including any attachments) contains confidential information intended for a specific individual(s) and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, by anyone other than the intended recipient(s), is strictly prohibited. <<<<...>>>> _______________________________________________ Ocfs-users mailing list Ocfs-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs-users