thr3ads.net - Ocfs users - [Ocfs-users] OCFS file system used as archived redo destination is corrupted [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Pei Ku

2005-Feb-11 13:32 UTC

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

we started using an ocfs file system about 4 months ago as the shared archived
redo  destination for the 4-node rac instances  (HP dl380, msa1000, RH AS 2.1) 
.  last night we are seeing some weird behavior, and my guess is the inode
directory in the file system is getting corrupted.  I've always had a bad
feeling about OCFS not being very robust at handling constant file creation and
deletion (which is what happens when you use it for archived redo logs).
 
ocfs-2.4.9-e-smp-1.0.12-1 is what we are using in production.
 
For now, we set up an archo redo dest on a local ext3 FS on each node and made
that dest the mandatory dest; we changed the ocfs dest to an optional one.  The
reason we made ocfs arch redo dest the primary dest a few months ago was because
we are planning to migrate to rman-based backup (as opposed to the current hot
backup scheme); it's easier (required?) to manage RAC archived redo logs
with rman if archived redos reside in a shared file system
 
below are some diagnostics: 

$ ls -l rdo_1_21810.arc*
 
-rw-r-----    1 oracle   dba        397312 Feb 10 22:30 rdo_1_21810.arc
-rw-r-----    1 oracle   dba        397312 Feb 10 22:30 rdo_1_21810.arc
 
(they have the same inode, btw -- I had done a 'ls -li' earlier but the
output had rolled off the screen)
 
after a while , one of the dba scripts gziped the file(s).  Now they look like
this:
 
 $ ls -liL /export/u10/oraarch/AUCP/rdo_1_21810.arc*
1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00
/export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00
/export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
 
These two same files have the same inode also.  But the size is way too small.  
 
yeah, /export/u10 is pretty hosed...
 
Pei 



-----Original Message----- 
From: Pei Ku 
Sent: Thu 2/10/2005 11:16 PM 
To: IT 
Cc: ADS 
Subject: possible OCFS /export/u10/ corruption on dbprd*


Ulf,
 
AUCP had problems creating archive file
"/export/u10/oraarch/AUCP/rdo_1_21810.arc".  After a few tries, it
appeared that it was able to -- except that there are *two* rdo_1_21810.arc
files in it (by the time you look at it, it/they probably would get gzipped.  We
also have a couple of zero-lengh gzipped redo log files (which is not normal) in
there.
 
At least the problem had not brought any of the AUCP instances down.  Manoj and
I turned on archiving to an ext3 file system on each node for now; archiving to
/export/u10/ is still active but made optional for now.
 
My guess /export/u10/ is corrupted in some way.  I still say OCFS can't take
constant file creation/removing.
 
We are one rev behind (1.0.12 vs 1.0.13 on ocfs.org).   No guarantee that 1.0.13
contains the cure...
 
Pei

-----Original Message----- 
From: Oracle [mailto:oracle@dbprd01.autc.com] 
Sent: Thu 2/10/2005 10:26 PM 
To: DBA; Page DBA; Unix Admin 
Cc: 
Subject: SL1:dbprd01.autc.com:050210_222600:oalert_mon> Alert Log Errors



SEVER_LVL=1  PROG=oalert_mon 
**** oalert_mon.pl: DB=AUCP SID=AUCP1 
[Thu Feb 10 22:25:21] ORA-19504: failed to create file
"/export/u10/oraarch/AUCP/rdo_1_21810.arc"
[Thu Feb 10 22:25:21] ORA-19504: failed to create file
"/export/u10/oraarch/AUCP/rdo_1_21810.arc"
[Thu Feb 10 22:25:21] ORA-27040: skgfrcre: create error, unable to create file 
[Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot be archived 
[Thu Feb 10 22:25:28] ORA-19504: failed to create file "" 
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m1.log'
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m2.log'
[Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot be archived 
[Thu Feb 10 22:25:28] ORA-19504: failed to create file "" 
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m1.log'
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m2.log'
[Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot be archived 
[Thu Feb 10 22:25:28] ORA-19504: failed to create file "" 
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m1.log'
[Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
'/export/u01/oradata/AUCP/redo12m2.log'

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20050211/ef968754/attachment.html

Sunil Mushran

2005-Feb-11 13:52 UTC

head link

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

Looks like the dirnode index is screwed up. The file is showing up
twice, but there is only one copy of the file.
We had detected a race which could cause this. Was fixed.

Did you start on 1.0.12 or run an older version of the module with this 
device?
May want to look into upgrading to atleast 1.0.13. We did some memory alloc
changes which were sorely required.

As part of our tests, we simulate the archiver... run a script on 
multiple nodes
which constantly creates files.

Pei Ku wrote:
>  
> we started using an ocfs file system about 4 months ago as the shared 
> archived redo  destination for the 4-node rac instances  (HP dl380, 
> msa1000, RH AS 2.1)  .  last night we are seeing some weird behavior, 
> and my guess is the inode directory in the file system is getting 
> corrupted.  I've always had a bad feeling about OCFS not being very 
> robust at handling constant file creation and deletion (which is what 
> happens when you use it for archived redo logs).
>  
> ocfs-2.4.9-e-smp-1.0.12-1 is what we are using in production.
>  
> For now, we set up an archo redo dest on a local ext3 FS on each node 
> and made that dest the mandatory dest; we changed the ocfs dest to an 
> optional one.  The reason we made ocfs arch redo dest the primary dest 
> a few months ago was because we are planning to migrate to rman-based 
> backup (as opposed to the current hot backup scheme); it's easier 
> (required?) to manage RAC archived redo logs with rman if archived 
> redos reside in a shared file system
>  
> below are some diagnostics: 
> $ ls -l rdo_1_21810.arc*
>  
> -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 rdo_1_21810.arc
> -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 rdo_1_21810.arc
>  
> (they have the same inode, btw -- I had done a 'ls -li' earlier but
> the output had rolled off the screen)
>  
> after a while , one of the dba scripts gziped the file(s).  Now they 
> look like this:
>  
>  $ ls -liL /export/u10/oraarch/AUCP/rdo_1_21810.arc*
> 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
>  
> These two same files have the same inode also.  But the size is way 
> too small. 
>  
> yeah, /export/u10 is pretty hosed...
>  
> Pei 
>
>     -----Original Message-----
>     *From:* Pei Ku
>     *Sent:* Thu 2/10/2005 11:16 PM
>     *To:* IT
>     *Cc:* ADS
>     *Subject:* possible OCFS /export/u10/ corruption on dbprd*
>
>     Ulf,
>      
>     AUCP had problems creating archive file
>     "/export/u10/oraarch/AUCP/rdo_1_21810.arc".  After a few
tries, it
>     appeared that it was able to -- except that there are *two*
>     rdo_1_21810.arc files in it (by the time you look at it, it/they
>     probably would get gzipped.  We also have a couple of zero-lengh
>     gzipped redo log files (which is not normal) in there.
>      
>     At least the problem had not brought any of the AUCP instances
>     down.  Manoj and I turned on archiving to an ext3 file system on
>     each node for now; archiving to /export/u10/ is still active but
>     made optional for now.
>      
>     My guess /export/u10/ is corrupted in some way.  I still say OCFS
>     can't take constant file creation/removing.
>      
>     We are one rev behind (1.0.12 vs 1.0.13 on ocfs.org).   No
>     guarantee that 1.0.13 contains the cure...
>      
>     Pei
>
>         -----Original Message-----
>         *From:* Oracle [mailto:oracle@dbprd01.autc.com]
>         *Sent:* Thu 2/10/2005 10:26 PM
>         *To:* DBA; Page DBA; Unix Admin
>         *Cc:*
>         *Subject:* SL1:dbprd01.autc.com:050210_222600:oalert_mon>
>         Alert Log Errors
>
>         SEVER_LVL=1  PROG=oalert_mon
>         **** oalert_mon.pl: DB=AUCP SID=AUCP1
>         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
>         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
>         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
>         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
>         [Thu Feb 10 22:25:21] ORA-27040: skgfrcre: create error,
>         unable to create file
>         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot
>         be archived
>         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m1.log'
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m2.log'
>         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot
>         be archived
>         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m1.log'
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m2.log'
>         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 21810 cannot
>         be archived
>         [Thu Feb 10 22:25:28] ORA-19504: failed to create file ""
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m1.log'
>         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
>         '/export/u01/oradata/AUCP/redo12m2.log'
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Ocfs-users mailing list
>Ocfs-users@oss.oracle.com
>http://oss.oracle.com/mailman/listinfo/ocfs-users
>  
>

Pei Ku

2005-Feb-11 14:45 UTC

head link

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

This file system was created under 1.0.12.

Does upgrade from 1.0.12 to 1.0.13 require reformatting the file systems?  I
don't care about the file system we are using for archived redos (it's
pretty screwed up anyway - it's gonna need a clean swipe).  But should I do
a full FS pre-upgrade dump and post-upgrade restore for the file system used for
storing oracle datafiles?  Of course I'll do a full db backup in any case.

Are you saying the problem I described was a known problem in 1.0.12 and had
been fixed in 1.0.13?

Before this problem, our production db had archive_lag_target set to 15 minutes
(in order for the standby db not to lag behind the production db too much). 
Since this is a four-node RAC, it means that there are at least (60/15)*4=16
archived redos generated per hour (16*24=384 per day).  And the fact that this
problem only appears after several months tells me that OCFS QA process needs to
be more thorough and run for a long time in order to catch bugs like this.

Another weird thing is when I do a 'ls /export/u10/oraarch/AUCP', it
takes about 15 sec and CPU usage is like 25+% for that duration.  If I do the
same command on multiple nodes, the elapse time might be 30 sec on each node. 
It concerns me a simple command like 'ls' can be that resource intensive
and slow.  Maybe it's related to the FS corruption...

thanks

Pei


> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran@oracle.com]
> Sent: Friday, February 11, 2005 11:51 AM
> To: Pei Ku
> Cc: ocfs-users@oss.oracle.com
> Subject: Re: [Ocfs-users] OCFS file system used as archived redo
> destination is corrupted
> 
> 
> Looks like the dirnode index is screwed up. The file is showing up
> twice, but there is only one copy of the file.
> We had detected a race which could cause this. Was fixed.
> 
> Did you start on 1.0.12 or run an older version of the module 
> with this 
> device?
> May want to look into upgrading to atleast 1.0.13. We did 
> some memory alloc
> changes which were sorely required.
> 
> As part of our tests, we simulate the archiver... run a script on 
> multiple nodes
> which constantly creates files.
> 
> Pei Ku wrote:
> 
> >  
> > we started using an ocfs file system about 4 months ago as 
> the shared 
> > archived redo  destination for the 4-node rac instances  (HP dl380, 
> > msa1000, RH AS 2.1)  .  last night we are seeing some weird 
> behavior, 
> > and my guess is the inode directory in the file system is getting 
> > corrupted.  I've always had a bad feeling about OCFS not being
very
> > robust at handling constant file creation and deletion 
> (which is what 
> > happens when you use it for archived redo logs).
> >  
> > ocfs-2.4.9-e-smp-1.0.12-1 is what we are using in production.
> >  
> > For now, we set up an archo redo dest on a local ext3 FS on 
> each node 
> > and made that dest the mandatory dest; we changed the ocfs 
> dest to an 
> > optional one.  The reason we made ocfs arch redo dest the 
> primary dest 
> > a few months ago was because we are planning to migrate to 
> rman-based 
> > backup (as opposed to the current hot backup scheme); it's easier 
> > (required?) to manage RAC archived redo logs with rman if archived 
> > redos reside in a shared file system
> >  
> > below are some diagnostics: 
> > $ ls -l rdo_1_21810.arc*
> >  
> > -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> rdo_1_21810.arc
> > -rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> rdo_1_21810.arc
> >  
> > (they have the same inode, btw -- I had done a 'ls -li'
earlier but
> > the output had rolled off the screen)
> >  
> > after a while , one of the dba scripts gziped the file(s).  
> Now they 
> > look like this:
> >  
> >  $ ls -liL /export/u10/oraarch/AUCP/rdo_1_21810.arc*
> > 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> > /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> > 1457510912 -rw-r-----    1 oracle   dba            36 Feb 10 23:00 
> > /export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> >  
> > These two same files have the same inode also.  But the size is way 
> > too small. 
> >  
> > yeah, /export/u10 is pretty hosed...
> >  
> > Pei 
> >
> >     -----Original Message-----
> >     *From:* Pei Ku
> >     *Sent:* Thu 2/10/2005 11:16 PM
> >     *To:* IT
> >     *Cc:* ADS
> >     *Subject:* possible OCFS /export/u10/ corruption on dbprd*
> >
> >     Ulf,
> >      
> >     AUCP had problems creating archive file
> >     "/export/u10/oraarch/AUCP/rdo_1_21810.arc".  After a 
> few tries, it
> >     appeared that it was able to -- except that there are *two*
> >     rdo_1_21810.arc files in it (by the time you look at it, it/they
> >     probably would get gzipped.  We also have a couple of zero-lengh
> >     gzipped redo log files (which is not normal) in there.
> >      
> >     At least the problem had not brought any of the AUCP instances
> >     down.  Manoj and I turned on archiving to an ext3 file system on
> >     each node for now; archiving to /export/u10/ is still active but
> >     made optional for now.
> >      
> >     My guess /export/u10/ is corrupted in some way.  I 
> still say OCFS
> >     can't take constant file creation/removing.
> >      
> >     We are one rev behind (1.0.12 vs 1.0.13 on ocfs.org).   No
> >     guarantee that 1.0.13 contains the cure...
> >      
> >     Pei
> >
> >         -----Original Message-----
> >         *From:* Oracle [mailto:oracle@dbprd01.autc.com]
> >         *Sent:* Thu 2/10/2005 10:26 PM
> >         *To:* DBA; Page DBA; Unix Admin
> >         *Cc:*
> >         *Subject:* SL1:dbprd01.autc.com:050210_222600:oalert_mon>
> >         Alert Log Errors
> >
> >         SEVER_LVL=1  PROG=oalert_mon
> >         **** oalert_mon.pl: DB=AUCP SID=AUCP1
> >         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >         [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >         "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >         [Thu Feb 10 22:25:21] ORA-27040: skgfrcre: create error,
> >         unable to create file
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >         [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> 21810 cannot
> >         be archived
> >         [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m1.log'
> >         [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread 1:
> >         '/export/u01/oradata/AUCP/redo12m2.log'
> >
> >-------------------------------------------------------------
> -----------
> >
> >_______________________________________________
> >Ocfs-users mailing list
> >Ocfs-users@oss.oracle.com
> >http://oss.oracle.com/mailman/listinfo/ocfs-users
> >  
> >
>

Pei Ku

2005-Feb-11 16:05 UTC

head link

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

oooh, I forgot to look in /var/log/messages:

Feb 10 22:20:55 dbprd01 kernel: (24943) ERROR: status = -2, Linux/ocfsmain.c,
2195
Feb 10 22:25:21 dbprd01 kernel: (27737) ERROR: status = -2, Linux/ocfsmain.c,
2195

The timestamp corresponds to the time stamp of errors in alert.log.  Do these
errors look familiar?

Regarding 'colorls' -- we are not using it, but I neglected to mention
that the monitoring scripts are using '-ltr' in order to get a sorted
list of files.  So I need to get more than just the list of filenames -- stat()
can't be avoided.

thanks again

Pei
> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran@oracle.com]
> Sent: Friday, February 11, 2005 1:47 PM
> To: Pei Ku
> Cc: ocfs-users@oss.oracle.com
> Subject: Re: [Ocfs-users] OCFS file system used as archived redo
> destination is corrupted
> 
> 
> No... the on-disk format is the same. 1.0.12 to 1.0.13 does
> not require formatting. However, cleaning up the messed up
> volume is always a good idea.
> 
> The problem you described was similar could be caused by
> a race in the dlm (fixed in 1.0.12) or due to mem alloc failiure
> (fixed in 1.0.13). If you see any status=-12 (ENOMEM) errors
> in var/log/messages, do upgrade to 1.0.13.
> 
> Do you have colorls enabled? ls does two things.... getdents()
> to get the names of the files followed by stat() on each file.
> As ocfs v1 does sync reads, each stat() involves disk access.
> (strace will show what I am talking about). This becomes an
> issue when there are lot of files.
> 
> So, if you need to quickly get a list of files, disable colorls.
> 
> Pei Ku wrote:
> 
> >This file system was created under 1.0.12.
> >
> >Does upgrade from 1.0.12 to 1.0.13 require reformatting the 
> file systems?  I don't care about the file system we are 
> using for archived redos (it's pretty screwed up anyway - 
> it's gonna need a clean swipe).  But should I do a full FS 
> pre-upgrade dump and post-upgrade restore for the file system 
> used for storing oracle datafiles?  Of course I'll do a full 
> db backup in any case.
> >
> >Are you saying the problem I described was a known problem 
> in 1.0.12 and had been fixed in 1.0.13?
> >
> >Before this problem, our production db had 
> archive_lag_target set to 15 minutes (in order for the 
> standby db not to lag behind the production db too much).  
> Since this is a four-node RAC, it means that there are at 
> least (60/15)*4=16 archived redos generated per hour 
> (16*24=384 per day).  And the fact that this problem only 
> appears after several months tells me that OCFS QA process 
> needs to be more thorough and run for a long time in order to 
> catch bugs like this.
> >
> >Another weird thing is when I do a 'ls 
> /export/u10/oraarch/AUCP', it takes about 15 sec and CPU 
> usage is like 25+% for that duration.  If I do the same 
> command on multiple nodes, the elapse time might be 30 sec on 
> each node.  It concerns me a simple command like 'ls' can be 
> that resource intensive and slow.  Maybe it's related to the 
> FS corruption...
> >
> >thanks
> >
> >Pei
> >
> >
> >
> >  
> >
> >>-----Original Message-----
> >>From: Sunil Mushran [mailto:Sunil.Mushran@oracle.com]
> >>Sent: Friday, February 11, 2005 11:51 AM
> >>To: Pei Ku
> >>Cc: ocfs-users@oss.oracle.com
> >>Subject: Re: [Ocfs-users] OCFS file system used as archived redo
> >>destination is corrupted
> >>
> >>
> >>Looks like the dirnode index is screwed up. The file is showing up
> >>twice, but there is only one copy of the file.
> >>We had detected a race which could cause this. Was fixed.
> >>
> >>Did you start on 1.0.12 or run an older version of the module 
> >>with this 
> >>device?
> >>May want to look into upgrading to atleast 1.0.13. We did 
> >>some memory alloc
> >>changes which were sorely required.
> >>
> >>As part of our tests, we simulate the archiver... run a script on 
> >>multiple nodes
> >>which constantly creates files.
> >>
> >>Pei Ku wrote:
> >>
> >>    
> >>
> >>> 
> >>>we started using an ocfs file system about 4 months ago as 
> >>>      
> >>>
> >>the shared 
> >>    
> >>
> >>>archived redo  destination for the 4-node rac instances  
> (HP dl380, 
> >>>msa1000, RH AS 2.1)  .  last night we are seeing some weird 
> >>>      
> >>>
> >>behavior, 
> >>    
> >>
> >>>and my guess is the inode directory in the file system is
getting
> >>>corrupted.  I've always had a bad feeling about OCFS not 
> being very 
> >>>robust at handling constant file creation and deletion 
> >>>      
> >>>
> >>(which is what 
> >>    
> >>
> >>>happens when you use it for archived redo logs).
> >>> 
> >>>ocfs-2.4.9-e-smp-1.0.12-1 is what we are using in production.
> >>> 
> >>>For now, we set up an archo redo dest on a local ext3 FS on 
> >>>      
> >>>
> >>each node 
> >>    
> >>
> >>>and made that dest the mandatory dest; we changed the ocfs 
> >>>      
> >>>
> >>dest to an 
> >>    
> >>
> >>>optional one.  The reason we made ocfs arch redo dest the 
> >>>      
> >>>
> >>primary dest 
> >>    
> >>
> >>>a few months ago was because we are planning to migrate to 
> >>>      
> >>>
> >>rman-based 
> >>    
> >>
> >>>backup (as opposed to the current hot backup scheme); it's
easier
> >>>(required?) to manage RAC archived redo logs with rman if
archived
> >>>redos reside in a shared file system
> >>> 
> >>>below are some diagnostics: 
> >>>$ ls -l rdo_1_21810.arc*
> >>> 
> >>>-rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> >>>      
> >>>
> >>rdo_1_21810.arc
> >>    
> >>
> >>>-rw-r-----    1 oracle   dba        397312 Feb 10 22:30 
> >>>      
> >>>
> >>rdo_1_21810.arc
> >>    
> >>
> >>> 
> >>>(they have the same inode, btw -- I had done a 'ls -li'
> earlier but 
> >>>the output had rolled off the screen)
> >>> 
> >>>after a while , one of the dba scripts gziped the file(s).  
> >>>      
> >>>
> >>Now they 
> >>    
> >>
> >>>look like this:
> >>> 
> >>> $ ls -liL /export/u10/oraarch/AUCP/rdo_1_21810.arc*
> >>>1457510912 -rw-r-----    1 oracle   dba            36 Feb 10
23:00
> >>>/export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> >>>1457510912 -rw-r-----    1 oracle   dba            36 Feb 10
23:00
> >>>/export/u10/oraarch/AUCP/rdo_1_21810.arc.gz
> >>> 
> >>>These two same files have the same inode also.  But the 
> size is way 
> >>>too small. 
> >>> 
> >>>yeah, /export/u10 is pretty hosed...
> >>> 
> >>>Pei 
> >>>
> >>>    -----Original Message-----
> >>>    *From:* Pei Ku
> >>>    *Sent:* Thu 2/10/2005 11:16 PM
> >>>    *To:* IT
> >>>    *Cc:* ADS
> >>>    *Subject:* possible OCFS /export/u10/ corruption on dbprd*
> >>>
> >>>    Ulf,
> >>>     
> >>>    AUCP had problems creating archive file
> >>>    "/export/u10/oraarch/AUCP/rdo_1_21810.arc". 
After a
> >>>      
> >>>
> >>few tries, it
> >>    
> >>
> >>>    appeared that it was able to -- except that there are *two*
> >>>    rdo_1_21810.arc files in it (by the time you look at 
> it, it/they
> >>>    probably would get gzipped.  We also have a couple of 
> zero-lengh
> >>>    gzipped redo log files (which is not normal) in there.
> >>>     
> >>>    At least the problem had not brought any of the AUCP
instances
> >>>    down.  Manoj and I turned on archiving to an ext3 file 
> system on
> >>>    each node for now; archiving to /export/u10/ is still 
> active but
> >>>    made optional for now.
> >>>     
> >>>    My guess /export/u10/ is corrupted in some way.  I 
> >>>      
> >>>
> >>still say OCFS
> >>    
> >>
> >>>    can't take constant file creation/removing.
> >>>     
> >>>    We are one rev behind (1.0.12 vs 1.0.13 on ocfs.org).   No
> >>>    guarantee that 1.0.13 contains the cure...
> >>>     
> >>>    Pei
> >>>
> >>>        -----Original Message-----
> >>>        *From:* Oracle [mailto:oracle@dbprd01.autc.com]
> >>>        *Sent:* Thu 2/10/2005 10:26 PM
> >>>        *To:* DBA; Page DBA; Unix Admin
> >>>        *Cc:*
> >>>        *Subject:*
SL1:dbprd01.autc.com:050210_222600:oalert_mon>
> >>>        Alert Log Errors
> >>>
> >>>        SEVER_LVL=1  PROG=oalert_mon
> >>>        **** oalert_mon.pl: DB=AUCP SID=AUCP1
> >>>        [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >>>        "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >>>        [Thu Feb 10 22:25:21] ORA-19504: failed to create file
> >>>        "/export/u10/oraarch/AUCP/rdo_1_21810.arc"
> >>>        [Thu Feb 10 22:25:21] ORA-27040: skgfrcre: create
error,
> >>>        unable to create file
> >>>        [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> >>>      
> >>>
> >>21810 cannot
> >>    
> >>
> >>>        be archived
> >>>        [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m1.log'
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m2.log'
> >>>        [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> >>>      
> >>>
> >>21810 cannot
> >>    
> >>
> >>>        be archived
> >>>        [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m1.log'
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m2.log'
> >>>        [Thu Feb 10 22:25:28] ORA-16038: log 12 sequence# 
> >>>      
> >>>
> >>21810 cannot
> >>    
> >>
> >>>        be archived
> >>>        [Thu Feb 10 22:25:28] ORA-19504: failed to create file
""
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m1.log'
> >>>        [Thu Feb 10 22:25:28] ORA-00312: online log 12 thread
1:
> >>>        '/export/u01/oradata/AUCP/redo12m2.log'
> >>>
> >>>-------------------------------------------------------------
> >>>      
> >>>
> >>-----------
> >>    
> >>
> >>>_______________________________________________
> >>>Ocfs-users mailing list
> >>>Ocfs-users@oss.oracle.com
> >>>http://oss.oracle.com/mailman/listinfo/ocfs-users
> >>> 
> >>>
> >>>      
> >>>
>

Maybe Matching Threads

Search for more apparently analagous threads

Ocfs users - Feb 2005 - OCFS file system used as archived redo destination is corrupted

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

[Ocfs-users] OCFS file system used as archived redo destination is corrupted

Maybe Matching Threads