thr3ads.net - zfs discuss - [zfs-discuss] zfs corruptions in pool [Jun 2010]

If this information is useful, please help other people find it:
Share via:

devsk

2010-Jun-06 06:06 UTC

[zfs-discuss] zfs corruptions in pool

I had an unclean shutdown because of a hang and suddenly my pool is degraded (I
realized something is wrong when python dumped core a couple of times).

This is before I ran scrub:

  pool: mypool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27 2010
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      DEGRADED     0     0     0
          c6t0d0s0  DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        mypool/ROOT/May25-2010-Image-Update:<0x3041e>
        mypool/ROOT/May25-2010-Image-Update:<0x31524>
        mypool/ROOT/May25-2010-Image-Update:<0x26d24>
        mypool/ROOT/May25-2010-Image-Update:<0x37234>
        //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
        mypool/ROOT/May25-2010-Image-Update:<0x25db3>
        //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
        mypool/ROOT/May25-2010-Image-Update:<0x26cf6>


I ran scrub and this is what it has to say afterwards.

  pool: mypool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5 22:43:54 2010
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      DEGRADED     0     0     0
          c6t0d0s0  DEGRADED     0     0     0  too many errors

errors: No known data errors

Few of questions:

1. Have the errors really gone away? Can I just clear and be content that errors
are really gone?

2. Why did the errors occur anyway if ZFS guarantees on-disk consistency? I
wasn''t writing anything. Those files were definitely not being touched
when the hang and unclean shutdown happened.

I mean I don''t mind if I create or modify a file and it
doesn''t land on disk because on unclean shutdown happened but a bunch
of unrelated files getting corrupted, is sort of painful to digest.

3. The action says "Determine if the device needs to be replaced". How
the heck do I do that?
-- 
This message posted from opensolaris.org

Roy Sigurd Karlsbakk

2010-Jun-06 08:13 UTC

head link

[zfs-discuss] zfs corruptions in pool

> Few of questions:
>
> 1. Have the errors really gone away? Can I just clear and be content
> that errors are really gone?
Looks like they''re fixed now, yes.
> 2. Why did the errors occur anyway if ZFS guarantees on-disk
> consistency? I wasn''t writing anything. Those files were
definitely
> not being touched when the hang and unclean shutdown happened.
>
> I mean I don''t mind if I create or modify a file and it
doesn''t land
> on disk because on unclean shutdown happened but a bunch of unrelated
> files getting corrupted, is sort of painful to digest.
ZFS guarantees consistency in a redundant setup, but it looks like your pool
only consists of one drive, meaning zero redundancy
 > 3. The action says "Determine if the device needs to be
replaced". How
> the heck do I do that?
attach nother drive, mirror, detatch the bad drive

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Thomas Maier-Komor

2010-Jun-06 11:11 UTC

head link

[zfs-discuss] zfs corruptions in pool

On 06.06.2010 08:06, devsk wrote:> I had an unclean shutdown because of a hang and suddenly my pool is
degraded (I realized something is wrong when python dumped core a couple of
times).
> 
> This is before I ran scrub:
> 
>   pool: mypool
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27 2010
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         mypool      DEGRADED     0     0     0
>           c6t0d0s0  DEGRADED     0     0     0  too many errors
> 
> errors: Permanent errors have been detected in the following files:
> 
>         mypool/ROOT/May25-2010-Image-Update:<0x3041e>
>         mypool/ROOT/May25-2010-Image-Update:<0x31524>
>         mypool/ROOT/May25-2010-Image-Update:<0x26d24>
>         mypool/ROOT/May25-2010-Image-Update:<0x37234>
>         //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
>         mypool/ROOT/May25-2010-Image-Update:<0x25db3>
>         //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
>         mypool/ROOT/May25-2010-Image-Update:<0x26cf6>
> 
> 
> I ran scrub and this is what it has to say afterwards.
> 
>   pool: mypool
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5 22:43:54 2010
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         mypool      DEGRADED     0     0     0
>           c6t0d0s0  DEGRADED     0     0     0  too many errors
> 
> errors: No known data errors
> 
> Few of questions:
> 
> 1. Have the errors really gone away? Can I just clear and be content that
errors are really gone?
> 
> 2. Why did the errors occur anyway if ZFS guarantees on-disk consistency? I
wasn''t writing anything. Those files were definitely not being touched
when the hang and unclean shutdown happened.
> 
> I mean I don''t mind if I create or modify a file and it
doesn''t land on disk because on unclean shutdown happened but a bunch
of unrelated files getting corrupted, is sort of painful to digest.
> 
> 3. The action says "Determine if the device needs to be
replaced". How the heck do I do that?

Is it possible that this system runs on a virtual box? At least I''ve
seen such a thing happen on a Virtual Box but never on a real machine.

The reason why the error have gone away might be that meta data has
three copies IIRC. So if your disk only had corruptions in the meta data
area these errors can be repaired by scrubbing the pool.

The smartmontools might help you figuring out if the disk is broken. But
if you only had an unexpected shutdown and now everything is clean after
a scrub, I wouldn''t expect the disk to be broken. You can get the
smartmontools from opencsw.org.

If your system is really running on a Virtual Box I''d recommend that
you
turn of disk write caching of Virtual Box. Search the OpenSolaris forum
of Virtual Box. There is an article somewhere how to do this. IIRC the
subject is somethink like ''zfs pool curruption''. But it is
also
somewhere in the docs.

HTH,
Thomas

Bob Friesenhahn

2010-Jun-06 16:22 UTC

head link

[zfs-discuss] zfs corruptions in pool

On Sun, 6 Jun 2010, Roy Sigurd Karlsbakk wrote:>>
>> I mean I don''t mind if I create or modify a file and it
doesn''t land
>> on disk because on unclean shutdown happened but a bunch of unrelated
>> files getting corrupted, is sort of painful to digest.
>
> ZFS guarantees consistency in a redundant setup, but it looks like 
> your pool only consists of one drive, meaning zero redundancy
This is not a true statement.  Redundancy is not required for 
consistency.  Consistency is assured by zfs writing transaction groups 
in order and commiting the data to disk prior to transitioning to the 
next transaction group.  If the disk fails to sync its cache and 
writes data out of order (data from multiple transaction groups), then 
zfs loses consistency.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

devsk

2010-Jun-06 20:05 UTC

head link

[zfs-discuss] zfs corruptions in pool

I think both Bob and Thomas have it right. I am using VIrtualbox and just
checked, the host IO is cached on the SATA controller, although I thought I had
it enabled (this is VB-3.2.0).

Let me run this mode for a while and see of this happens again.
-- 
This message posted from opensolaris.org

Toby Thain

2010-Jun-09 01:05 UTC

head link

[zfs-discuss] zfs corruptions in pool

On 6-Jun-10, at 7:11 AM, Thomas Maier-Komor wrote:
> On 06.06.2010 08:06, devsk wrote:
>> I had an unclean shutdown because of a hang and suddenly my pool is  
>> degraded (I realized something is wrong when python dumped core a  
>> couple of times).
>>
>> This is before I ran scrub:
>>
>>  pool: mypool
>> state: DEGRADED
>> status: One or more devices has experienced an error resulting in  
>> data
>>        corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise  
>> restore the
>>        entire pool from backup.
>>   see: http://www.sun.com/msg/ZFS-8000-8A
>> scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27  
>> 2010
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        mypool      DEGRADED     0     0     0
>>          c6t0d0s0  DEGRADED     0     0     0  too many errors
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>        mypool/ROOT/May25-2010-Image-Update:<0x3041e>
>>        mypool/ROOT/May25-2010-Image-Update:<0x31524>
>>        mypool/ROOT/May25-2010-Image-Update:<0x26d24>
>>        mypool/ROOT/May25-2010-Image-Update:<0x37234>
>>        //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
>>        mypool/ROOT/May25-2010-Image-Update:<0x25db3>
>>        //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
>>        mypool/ROOT/May25-2010-Image-Update:<0x26cf6>
>>
>>
>> I ran scrub and this is what it has to say afterwards.
>>
>>  pool: mypool
>> state: DEGRADED
>> status: One or more devices has experienced an unrecoverable  
>> error.  An
>>        attempt was made to correct the error.  Applications are  
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the  
>> errors
>>        using ''zpool clear'' or replace the device with
''zpool
>> replace''.
>>   see: http://www.sun.com/msg/ZFS-8000-9P
>> scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5  
>> 22:43:54 2010
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        mypool      DEGRADED     0     0     0
>>          c6t0d0s0  DEGRADED     0     0     0  too many errors
>>
>> errors: No known data errors
>>
>> Few of questions:
>>
>> 1. Have the errors really gone away? Can I just clear and be  
>> content that errors are really gone?
>>
>> 2. Why did the errors occur anyway if ZFS guarantees on-disk  
>> consistency? I wasn''t writing anything. Those files were
definitely
>> not being touched when the hang and unclean shutdown happened.
>>
>> I mean I don''t mind if I create or modify a file and it
doesn''t
>> land on disk because on unclean shutdown happened but a bunch of  
>> unrelated files getting corrupted, is sort of painful to digest.
>>
>> 3. The action says "Determine if the device needs to be
replaced".
>> How the heck do I do that?
>
>
> Is it possible that this system runs on a virtual box? At least
I''ve
> seen such a thing happen on a Virtual Box but never on a real machine.
As I postulated in the relevant forum thread there:
http://forums.virtualbox.org/viewtopic.php?t=13661
(can''t check URL, the site seems down for me atm)
>
> The reason why the error have gone away might be that meta data has
> three copies IIRC. So if your disk only had corruptions in the meta  
> data
> area these errors can be repaired by scrubbing the pool.
>
> The smartmontools might help you figuring out if the disk is broken.  
> But
> if you only had an unexpected shutdown and now everything is clean  
> after
> a scrub, I wouldn''t expect the disk to be broken. You can get the
> smartmontools from opencsw.org.
>
> If your system is really running on a Virtual Box I''d recommend
that
> you
> turn of disk write caching of Virtual Box.
Specifically, stop it from ignoring cache flush. Caching is irrelevant  
if flushes are being correctly handled.

ZFS isn''t the only software system that will suffer inconsistencies/ 
corruption in the guest if flushes are ignored, of course.

--Toby

> Search the OpenSolaris forum
> of Virtual Box. There is an article somewhere how to do this. IIRC the
> subject is somethink like ''zfs pool curruption''. But it
is also
> somewhere in the docs.
>
> HTH,
> Thomas
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jun 2010 - zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool

[zfs-discuss] zfs corruptions in pool