thr3ads.net - Ocfs2 users - [Ocfs2-users] Lost write in archive logs: has it ever happened? [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Silviu Marin-Caea

2008-Sep-22 12:02 UTC

[Ocfs2-users] Lost write in archive logs: has it ever happened?

We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive logs are generated on 
an OCFS2 volume (mounted with nointr,datavolume).  It has happened 3 times in 
one year that some archivelog had a lost write.  We have detected this when 
applying the archivelogs on the standby database (with dataguard).  We had to 
copy some datafiles from the production database to the standby and let it 
resume the recovery process.

Has it ever occurred a data loss of this kind (lost write) on an OCFS2 volume, 
version 1.2.3 x86_64?

We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and 
those servers never had such a problem with archivelogs.

The storage is Dell/EMC Clariion CX3-40.  The storage on the old servers was 
CX300.

We are worried that this lost writes could occur not only in archivelogs but 
in the datafiles as well...

Not saying that OCFS2 is the cause, the problem might be with something else, 
but we must investigate everything.

Thank you

Luis Freitas

2008-Sep-22 15:54 UTC

head link

[Ocfs2-users] Lost write in archive logs: has it ever happened?

Silviu,

   When I had this kind of issues it usually was caused by a bad hba, or a power
failure. I am assuming it is not the latter as you would be aware of it.

   It is a difficult situation, since the controller only malfunctions
sporadically it is difficult to prove that it is the cause or to get it changed
on warranty. And your database slowly gets corrupted, until someday it crashes
and wont startup. If this is the cause it surely is happening on the datafiles
also.

   To be safe you should run a "ANALYZE TABLE ... VALIDATE STRUCTURE
CASCADE;" on all your database tables, and look for fractured or bad blocks
on the datafiles using dbv or rman. A fractured block is one that has a
different timestamp on the begin and the end, so it was only partially writen to
the disk.

   You also could try to change the hba with some other server to see if the
problem disappears.

Regards,
Luis

--- On Mon, 9/22/08, Silviu Marin-Caea <silviumc at fastmail.fm> wrote:
> From: Silviu Marin-Caea <silviumc at fastmail.fm>
> Subject: [Ocfs2-users] Lost write in archive logs: has it ever happened?
> To: ocfs2-users at oss.oracle.com
> Date: Monday, September 22, 2008, 9:02 AM
> We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive logs
> are generated on 
> an OCFS2 volume (mounted with nointr,datavolume).  It has
> happened 3 times in 
> one year that some archivelog had a lost write.  We have
> detected this when 
> applying the archivelogs on the standby database (with
> dataguard).  We had to 
> copy some datafiles from the production database to the
> standby and let it 
> resume the recovery process.
> 
> Has it ever occurred a data loss of this kind (lost write)
> on an OCFS2 volume, 
> version 1.2.3 x86_64?
> 
> We had 32 bit servers before with OCFS2 that was even older
> than 1.2.3 and 
> those servers never had such a problem with archivelogs.
> 
> The storage is Dell/EMC Clariion CX3-40.  The storage on
> the old servers was 
> CX300.
> 
> We are worried that this lost writes could occur not only
> in archivelogs but 
> in the datafiles as well...
> 
> Not saying that OCFS2 is the cause, the problem might be
> with something else, 
> but we must investigate everything.
> 
> Thank you
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Luis Freitas

2008-Sep-22 17:10 UTC

head link

[Ocfs2-users] Lost write in archive logs: has it ever happened?

Silviu,

    Just so you be warned, the "ANALYZE TABLE..." command locks the
tables during its execution.

Regards,
Luis


--- On Mon, 9/22/08, Luis Freitas <lfreitas34 at yahoo.com> wrote:
> From: Luis Freitas <lfreitas34 at yahoo.com>
> Subject: Re: [Ocfs2-users] Lost write in archive logs: has it ever
happened?
> To: ocfs2-users at oss.oracle.com, "Silviu Marin-Caea"
<silviumc at fastmail.fm>
> Date: Monday, September 22, 2008, 12:54 PM
> Silviu,
> 
>    When I had this kind of issues it usually was caused by
> a bad hba, or a power failure. I am assuming it is not the
> latter as you would be aware of it.
> 
>    It is a difficult situation, since the controller only
> malfunctions sporadically it is difficult to prove that it
> is the cause or to get it changed on warranty. And your
> database slowly gets corrupted, until someday it crashes and
> wont startup. If this is the cause it surely is happening on
> the datafiles also.
> 
>    To be safe you should run a "ANALYZE TABLE ...
> VALIDATE STRUCTURE CASCADE;" on all your database
> tables, and look for fractured or bad blocks on the
> datafiles using dbv or rman. A fractured block is one that
> has a different timestamp on the begin and the end, so it
> was only partially writen to the disk.
> 
>    You also could try to change the hba with some other
> server to see if the problem disappears.
> 
> Regards,
> Luis
> 
> --- On Mon, 9/22/08, Silviu Marin-Caea
> <silviumc at fastmail.fm> wrote:
> 
> > From: Silviu Marin-Caea <silviumc at fastmail.fm>
> > Subject: [Ocfs2-users] Lost write in archive logs: has
> it ever happened?
> > To: ocfs2-users at oss.oracle.com
> > Date: Monday, September 22, 2008, 9:02 AM
> > We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive
> logs
> > are generated on 
> > an OCFS2 volume (mounted with nointr,datavolume).  It
> has
> > happened 3 times in 
> > one year that some archivelog had a lost write.  We
> have
> > detected this when 
> > applying the archivelogs on the standby database (with
> > dataguard).  We had to 
> > copy some datafiles from the production database to
> the
> > standby and let it 
> > resume the recovery process.
> > 
> > Has it ever occurred a data loss of this kind (lost
> write)
> > on an OCFS2 volume, 
> > version 1.2.3 x86_64?
> > 
> > We had 32 bit servers before with OCFS2 that was even
> older
> > than 1.2.3 and 
> > those servers never had such a problem with
> archivelogs.
> > 
> > The storage is Dell/EMC Clariion CX3-40.  The storage
> on
> > the old servers was 
> > CX300.
> > 
> > We are worried that this lost writes could occur not
> only
> > in archivelogs but 
> > in the datafiles as well...
> > 
> > Not saying that OCFS2 is the cause, the problem might
> be
> > with something else, 
> > but we must investigate everything.
> > 
> > Thank you
> > 
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
>       
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Sunil Mushran

2008-Sep-22 18:22 UTC

head link

[Ocfs2-users] Lost write in archive logs: has it ever happened?

No, I've never heard of an end user running into it. Or, any bug we've
fixed that addresses this.

Are you multiplexing the archivelogs?

Lost write could be due to any layer from the userspace to the disk array.
By multiplexing archivelogs and mirroring redologs, you will reduce
the chances of getting bit by it. Reduce, but not eliminate.

A lost write in a db file will get detected. Not immediately, but when
the db reads that block. And that's why we take backups.

BTW, 1.2.3 is 2+ years old.  You should look into upgrading to sles9
sp3 that is shipping 1.2.5, or better yet sp4, that is shipping 1.2.9.

Sunil

Silviu Marin-Caea wrote:> We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive logs are generated
on
> an OCFS2 volume (mounted with nointr,datavolume).  It has happened 3 times
in
> one year that some archivelog had a lost write.  We have detected this when
> applying the archivelogs on the standby database (with dataguard).  We had
to
> copy some datafiles from the production database to the standby and let it 
> resume the recovery process.
>
> Has it ever occurred a data loss of this kind (lost write) on an OCFS2
volume,
> version 1.2.3 x86_64?
>
> We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and 
> those servers never had such a problem with archivelogs.
>
> The storage is Dell/EMC Clariion CX3-40.  The storage on the old servers
was
> CX300.
>
> We are worried that this lost writes could occur not only in archivelogs
but
> in the datafiles as well...
>
> Not saying that OCFS2 is the cause, the problem might be with something
else,
> but we must investigate everything.
>
> Thank you
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Silviu Marin-Caea

2008-Dec-03 15:17 UTC

head link

[Ocfs2-users] Lost write in archive logs: has it ever happened?

On Monday 22 September 2008 15:02:36 Silviu Marin-Caea
wrote:> We have 2 nodes with OCFS2 1.2.3 (SLES9).  The archive logs are generated
> on an OCFS2 volume (mounted with nointr,datavolume).  It has happened 3
> times in one year that some archivelog had a lost write.  We have detected
> this when applying the archivelogs on the standby database (with
> dataguard).  We had to copy some datafiles from the production database to
> the standby and let it resume the recovery process.
>
> Has it ever occurred a data loss of this kind (lost write) on an OCFS2
> volume, version 1.2.3 x86_64?
>
> We had 32 bit servers before with OCFS2 that was even older than 1.2.3 and
> those servers never had such a problem with archivelogs.
>
> The storage is Dell/EMC Clariion CX3-40.  The storage on the old servers
> was CX300.
>
> We are worried that this lost writes could occur not only in archivelogs
> but in the datafiles as well...
>
> Not saying that OCFS2 is the cause, the problem might be with something
> else, but we must investigate everything.
OCFS2 is not the cause.  The error just occurred again and this time we had 
the archivelogs multiplexed on both OCFS2 and local storage (reiserfs).  Both 
archives have identical MD5 sums.

There was no lost write, just some bullshit that Oracle support tries to feed 
us.

There is still an unknown, but it's not related to OCFS2.  The unknown is
that
archives on the standby database have different MD5 sums than the ones on 
production.  All the archives, not just the corrupt one.  Does dataguard 
intervenes in some way in the archives during transmission?  I thought it was 
just supposed to transfer them unchanged, then apply them on the standby.

Ocfs2 users - Sep 2008 - Lost write in archive logs: has it ever happened?

[Ocfs2-users] Lost write in archive logs: has it ever happened?

[Ocfs2-users] Lost write in archive logs: has it ever happened?

[Ocfs2-users] Lost write in archive logs: has it ever happened?

[Ocfs2-users] Lost write in archive logs: has it ever happened?

[Ocfs2-users] Lost write in archive logs: has it ever happened?