thr3ads.net - zfs discuss - [zfs-discuss] DMU corruption [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Peter Bortas

2007-Jun-30 13:08 UTC

[zfs-discuss] DMU corruption

Hello all,

After playing around a bit with the disks (powering down, pulling one
disk out,  powering down putting the disk back in and pulling out
another one, repeat) zpool status reports permanent data corruption:

# uname -a
SunOS bhelliom 5.11 snv_55b i86pc i386 i86pc
# zpool status -v
  pool: famine
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        famine      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c2d0    ONLINE       0     0     0
            c2d1    ONLINE       0     0     0
            c3d0    ONLINE       0     0     0
            c4d0    ONLINE       0     0     0
            c4d1    ONLINE       0     0     0
            c5d0    ONLINE       0     0     0

errors: The following persistent errors have been detected:

          DATASET  OBJECT  RANGE
          6d       0       lvl=4 blkid=0
          73       0       lvl=0 blkid=0
          10b1     0       lvl=6 blkid=0


The corruption is somewhat understandable. It''s my home fileserver and
I do the most horrible things to it now and then just to find out what
happens. The point of this exercise was to go through the disks, label
them, and locate c2d1 since it  had been experiences lockups that
required a cold reset to get the disk online again, and I was to lazy
to do it without fully starting the OS and thus mounting the raidz
each time. During one of the restarts both the disk I pulled out and
c2d1 went missing while starting the filesystem.

According to the zdb dump, object 0 seems to be the DMU node on each
file system. My understanding of this part of ZFS is very shallow, but
why does it allow the filesystems to be mounted rw with damaged DMU
nodes, doesn''t that result in a risk of more permanent damage to the
structure of those filesystems? Or are there redundant DMU nodes it''s
now using, and in that case, why doesn''t it automatically fix the
damaged ones?

I''m currently doing a complete scrub, but according to zpool status
latest estimate it will be 63h before I know how that went...

-- 
Peter Bortas

Matthew Ahrens

2007-Jun-30 18:21 UTC

head link

[zfs-discuss] DMU corruption

Peter Bortas wrote:> According to the zdb dump, object 0 seems to be the DMU node on each
> file system. My understanding of this part of ZFS is very shallow, but
> why does it allow the filesystems to be mounted rw with damaged DMU
> nodes, doesn''t that result in a risk of more permanent damage to
the
> structure of those filesystems? Or are there redundant DMU nodes
it''s
> now using, and in that case, why doesn''t it automatically fix the
> damaged ones?
Object 0 is basically the object that describes the other objects.  So the 
end result will be that some range of (up to 32) objects in each of those 
filesystems will be inaccessible.  There is no risk of additional damage by 
running in read/write mode, because ZFS is always able to detect what data is 
good and what is bad by using checksums.

That said, blkid 0 of object 0 always happens to contain some critical 
objects (the ZPL "master node" and root directory).  So if you are
able to
mount these filesystems at all, then it probably means that ZFS was able to 
find another redundant copy, or the failure was actually transient.  (Eg, 
because one disk was temporarily offline, and some pieces of another disk are 
damaged, so raidz1 couldn''t reconstruct.)

FYI, in a later build, ''zpool status -v'' actually tells you
the names of the
damaged filesystem & files, so you don''t have to muck around with
zdb.

--matt

Peter Bortas

2007-Jun-30 19:34 UTC

head link

[zfs-discuss] DMU corruption

On 6/30/07, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Peter Bortas wrote:
> > According to the zdb dump, object 0 seems to be the DMU node on each
> > file system. My understanding of this part of ZFS is very shallow, but
> > why does it allow the filesystems to be mounted rw with damaged DMU
> > nodes, doesn''t that result in a risk of more permanent damage
to the
> > structure of those filesystems? Or are there redundant DMU nodes
it''s
> > now using, and in that case, why doesn''t it automatically fix
the
> > damaged ones?
>
> Object 0 is basically the object that describes the other objects.  So the
> end result will be that some range of (up to 32) objects in each of those
> filesystems will be inaccessible.  There is no risk of additional damage by
> running in read/write mode, because ZFS is always able to detect what data
is
> good and what is bad by using checksums.
>
> That said, blkid 0 of object 0 always happens to contain some critical
> objects (the ZPL "master node" and root directory).  So if you
are able to
> mount these filesystems at all, then it probably means that ZFS was able to
> find another redundant copy, or the failure was actually transient.  (Eg,
> because one disk was temporarily offline, and some pieces of another disk
are
> damaged, so raidz1 couldn''t reconstruct.)
The question is why it didn''t clear those errors when resilvering if
it found redundant copies? Before the resilvering there where actually
four of those errors. This one:

            37       0       lvl=2 blkid=0

was removed by resilvering.
> FYI, in a later build, ''zpool status -v'' actually tells
you the names of the
> damaged filesystem & files, so you don''t have to muck around
with zdb.
Yes, that is a feature that has been tempting me to upgrade for a
while. Unfortunately I won''t have time to do it this weekend.

-- 
Peter Bortas

Peter Bortas

2007-Jun-30 20:53 UTC

head link

[zfs-discuss] Re: DMU corruption

On 6/30/07, Peter Bortas <bortas at gmail.com> wrote:
> I''m currently doing a complete scrub, but according to zpool
status
> latest estimate it will be 63h before I know how that went...
The scrub has now completed with 0 errors and the there are no longer
any corruption errors reported.

-- 
Peter Bortas

zfs discuss - Jun 2007 - DMU corruption

[zfs-discuss] DMU corruption

[zfs-discuss] DMU corruption

[zfs-discuss] DMU corruption

[zfs-discuss] Re: DMU corruption