Silviu Marin-Caea
2008-Sep-22  12:02 UTC
[Ocfs2-users] Lost write in archive logs: has it ever happened?
We have 2 nodes with OCFS2 1.2.3 (SLES9). The archive logs are generated on an OCFS2 volume (mounted with nointr,datavolume). It has happened 3 times in one year that some archivelog had a lost write. We have detected this when applying the archivelogs on the standby database (with dataguard). We had to copy some datafiles from the production database to the standby and let it resume the recovery process. Has it ever occurred a data loss of this kind (lost write) on an OCFS2 volume, version 1.2.3 x86_64? We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and those servers never had such a problem with archivelogs. The storage is Dell/EMC Clariion CX3-40. The storage on the old servers was CX300. We are worried that this lost writes could occur not only in archivelogs but in the datafiles as well... Not saying that OCFS2 is the cause, the problem might be with something else, but we must investigate everything. Thank you
Luis Freitas
2008-Sep-22  15:54 UTC
[Ocfs2-users] Lost write in archive logs: has it ever happened?
Silviu, When I had this kind of issues it usually was caused by a bad hba, or a power failure. I am assuming it is not the latter as you would be aware of it. It is a difficult situation, since the controller only malfunctions sporadically it is difficult to prove that it is the cause or to get it changed on warranty. And your database slowly gets corrupted, until someday it crashes and wont startup. If this is the cause it surely is happening on the datafiles also. To be safe you should run a "ANALYZE TABLE ... VALIDATE STRUCTURE CASCADE;" on all your database tables, and look for fractured or bad blocks on the datafiles using dbv or rman. A fractured block is one that has a different timestamp on the begin and the end, so it was only partially writen to the disk. You also could try to change the hba with some other server to see if the problem disappears. Regards, Luis --- On Mon, 9/22/08, Silviu Marin-Caea <silviumc at fastmail.fm> wrote:> From: Silviu Marin-Caea <silviumc at fastmail.fm> > Subject: [Ocfs2-users] Lost write in archive logs: has it ever happened? > To: ocfs2-users at oss.oracle.com > Date: Monday, September 22, 2008, 9:02 AM > We have 2 nodes with OCFS2 1.2.3 (SLES9). The archive logs > are generated on > an OCFS2 volume (mounted with nointr,datavolume). It has > happened 3 times in > one year that some archivelog had a lost write. We have > detected this when > applying the archivelogs on the standby database (with > dataguard). We had to > copy some datafiles from the production database to the > standby and let it > resume the recovery process. > > Has it ever occurred a data loss of this kind (lost write) > on an OCFS2 volume, > version 1.2.3 x86_64? > > We had 32 bit servers before with OCFS2 that was even older > than 1.2.3 and > those servers never had such a problem with archivelogs. > > The storage is Dell/EMC Clariion CX3-40. The storage on > the old servers was > CX300. > > We are worried that this lost writes could occur not only > in archivelogs but > in the datafiles as well... > > Not saying that OCFS2 is the cause, the problem might be > with something else, > but we must investigate everything. > > Thank you > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Luis Freitas
2008-Sep-22  17:10 UTC
[Ocfs2-users] Lost write in archive logs: has it ever happened?
Silviu,
    Just so you be warned, the "ANALYZE TABLE..." command locks the
tables during its execution.
Regards,
Luis
--- On Mon, 9/22/08, Luis Freitas <lfreitas34 at yahoo.com> wrote:
> From: Luis Freitas <lfreitas34 at yahoo.com>
> Subject: Re: [Ocfs2-users] Lost write in archive logs: has it ever
happened?
> To: ocfs2-users at oss.oracle.com, "Silviu Marin-Caea"
<silviumc at fastmail.fm>
> Date: Monday, September 22, 2008, 12:54 PM
> Silviu,
> 
>    When I had this kind of issues it usually was caused by
> a bad hba, or a power failure. I am assuming it is not the
> latter as you would be aware of it.
> 
>    It is a difficult situation, since the controller only
> malfunctions sporadically it is difficult to prove that it
> is the cause or to get it changed on warranty. And your
> database slowly gets corrupted, until someday it crashes and
> wont startup. If this is the cause it surely is happening on
> the datafiles also.
> 
>    To be safe you should run a "ANALYZE TABLE ...
> VALIDATE STRUCTURE CASCADE;" on all your database
> tables, and look for fractured or bad blocks on the
> datafiles using dbv or rman. A fractured block is one that
> has a different timestamp on the begin and the end, so it
> was only partially writen to the disk.
> 
>    You also could try to change the hba with some other
> server to see if the problem disappears.
> 
> Regards,
> Luis
> 
> --- On Mon, 9/22/08, Silviu Marin-Caea
> <silviumc at fastmail.fm> wrote:
> 
> > From: Silviu Marin-Caea <silviumc at fastmail.fm>
> > Subject: [Ocfs2-users] Lost write in archive logs: has
> it ever happened?
> > To: ocfs2-users at oss.oracle.com
> > Date: Monday, September 22, 2008, 9:02 AM
> > We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive
> logs
> > are generated on 
> > an OCFS2 volume (mounted with nointr,datavolume).  It
> has
> > happened 3 times in 
> > one year that some archivelog had a lost write.  We
> have
> > detected this when 
> > applying the archivelogs on the standby database (with
> > dataguard).  We had to 
> > copy some datafiles from the production database to
> the
> > standby and let it 
> > resume the recovery process.
> > 
> > Has it ever occurred a data loss of this kind (lost
> write)
> > on an OCFS2 volume, 
> > version 1.2.3 x86_64?
> > 
> > We had 32 bit servers before with OCFS2 that was even
> older
> > than 1.2.3 and 
> > those servers never had such a problem with
> archivelogs.
> > 
> > The storage is Dell/EMC Clariion CX3-40.  The storage
> on
> > the old servers was 
> > CX300.
> > 
> > We are worried that this lost writes could occur not
> only
> > in archivelogs but 
> > in the datafiles as well...
> > 
> > Not saying that OCFS2 is the cause, the problem might
> be
> > with something else, 
> > but we must investigate everything.
> > 
> > Thank you
> > 
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
>       
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
Sunil Mushran
2008-Sep-22  18:22 UTC
[Ocfs2-users] Lost write in archive logs: has it ever happened?
No, I've never heard of an end user running into it. Or, any bug we've fixed that addresses this. Are you multiplexing the archivelogs? Lost write could be due to any layer from the userspace to the disk array. By multiplexing archivelogs and mirroring redologs, you will reduce the chances of getting bit by it. Reduce, but not eliminate. A lost write in a db file will get detected. Not immediately, but when the db reads that block. And that's why we take backups. BTW, 1.2.3 is 2+ years old. You should look into upgrading to sles9 sp3 that is shipping 1.2.5, or better yet sp4, that is shipping 1.2.9. Sunil Silviu Marin-Caea wrote:> We have 2 nodes with OCFS2 1.2.3 (SLES9). The archive logs are generated on > an OCFS2 volume (mounted with nointr,datavolume). It has happened 3 times in > one year that some archivelog had a lost write. We have detected this when > applying the archivelogs on the standby database (with dataguard). We had to > copy some datafiles from the production database to the standby and let it > resume the recovery process. > > Has it ever occurred a data loss of this kind (lost write) on an OCFS2 volume, > version 1.2.3 x86_64? > > We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and > those servers never had such a problem with archivelogs. > > The storage is Dell/EMC Clariion CX3-40. The storage on the old servers was > CX300. > > We are worried that this lost writes could occur not only in archivelogs but > in the datafiles as well... > > Not saying that OCFS2 is the cause, the problem might be with something else, > but we must investigate everything. > > Thank you > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Silviu Marin-Caea
2008-Dec-03  15:17 UTC
[Ocfs2-users] Lost write in archive logs: has it ever happened?
On Monday 22 September 2008 15:02:36 Silviu Marin-Caea wrote:> We have 2 nodes with OCFS2 1.2.3 (SLES9). The archive logs are generated > on an OCFS2 volume (mounted with nointr,datavolume). It has happened 3 > times in one year that some archivelog had a lost write. We have detected > this when applying the archivelogs on the standby database (with > dataguard). We had to copy some datafiles from the production database to > the standby and let it resume the recovery process. > > Has it ever occurred a data loss of this kind (lost write) on an OCFS2 > volume, version 1.2.3 x86_64? > > We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and > those servers never had such a problem with archivelogs. > > The storage is Dell/EMC Clariion CX3-40. The storage on the old servers > was CX300. > > We are worried that this lost writes could occur not only in archivelogs > but in the datafiles as well... > > Not saying that OCFS2 is the cause, the problem might be with something > else, but we must investigate everything.OCFS2 is not the cause. The error just occurred again and this time we had the archivelogs multiplexed on both OCFS2 and local storage (reiserfs). Both archives have identical MD5 sums. There was no lost write, just some bullshit that Oracle support tries to feed us. There is still an unknown, but it's not related to OCFS2. The unknown is that archives on the standby database have different MD5 sums than the ones on production. All the archives, not just the corrupt one. Does dataguard intervenes in some way in the archives during transmission? I thought it was just supposed to transfer them unchanged, then apply them on the standby.