thr3ads.net - zfs discuss - [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Eugene Gladchenko

2008-Oct-19 07:02 UTC

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Hi,

I''m running FreeBSD 7.1-PRERELEASE with a 500-gig ZFS drive. Recently
I''ve encountered a FreeBSD problem (PR kern/128083) and decided about
updating the motherboard BIOS. It looked like the update went right but after
that I was shocked to see my ZFS destroyed! Rolling the BIOS back did not help.

Now it looks like that:

# zpool status
  pool: tank
 state: UNAVAIL
status: One or more devices could not be used because the label is missing
        or invalid.  There are insufficient replicas for the pool to continue
        functioning.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        UNAVAIL      0     0     0  insufficient replicas
          ad4       UNAVAIL      0     0     0  corrupted data
# zdb -l /dev/ad4
--------------------------------------------
LABEL 0
--------------------------------------------
    version=6
    name=''tank''
    state=0
    txg=4
    pool_guid=12069359268725642778
    hostid=2719189110
    hostname=''home.gladchenko.ru''
    top_guid=5515037892630596686
    guid=5515037892630596686
    vdev_tree
        type=''disk''
        id=0
        guid=5515037892630596686
        path=''/dev/ad4''
        devid=''ad:5QM0WF9G''
        whole_disk=0
        metaslab_array=14
        metaslab_shift=32
        ashift=9
        asize=500103118848
--------------------------------------------
LABEL 1
--------------------------------------------
    version=6
    name=''tank''
    state=0
    txg=4
    pool_guid=12069359268725642778
    hostid=2719189110
    hostname=''home.gladchenko.ru''
    top_guid=5515037892630596686
    guid=5515037892630596686
    vdev_tree
        type=''disk''
        id=0
        guid=5515037892630596686
        path=''/dev/ad4''
        devid=''ad:5QM0WF9G''
        whole_disk=0
        metaslab_array=14
        metaslab_shift=32
        ashift=9
        asize=500103118848
--------------------------------------------
LABEL 2
--------------------------------------------
failed to unpack label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3
#

I''ve tried to import the problem pool into OpenSolaris 2008.05 with no
success:

# zpool import
  pool: tank
     id: 12069359268725642778
 state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
    see: http://www.sun.com/msg/ZFS-8000-EY
config:

        tank       UNAVAIL      0     0     0  insufficient replicas
          c3d0s2   UNAVAIL      0     0     0  corrupted data
#

Is there a way to recover my files from this broken pool? Maybe at least some of
them? The drive was 4/5 full. :(

I would appreciate any help.

p.s. I already bought another drive of the same size yesterday. My next ZFS
experience definitely will be a mirrored one.
--
This message posted from opensolaris.org

Nigel Smith

2008-Oct-19 12:29 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

With ZFS, that are 4 identical labels on each
physical vdev, in this case a single hard drive.
L0/L1 at the start of the vdev, and
L2/L3 at the end of the vdev.

As I understand it, part of the reason for having
four identical labels is to make it difficult
to completely loose the information in the labels.

In this case, labels L0 & L1 look ok, but
labels L2 & L3 ''failed to unpack''.

And status says "..the label is missing or invalid."

Ok, my theory is that some setting in the bios
has got confused about the size of the hard drive,
and thinks it''s smaller that it was originally.
Maybe it thinks the geometry is changed.

If it thinks the size of the hard drive has reduced,
then maybe that is why it cannot read the labels at
the end of the vdev.

And maybe it think the two readable labels
are invalid because now the ''asize'' does 
not match what the bios is currently reporting.

I would switch back to you original bios and try
looking at settings for the hard drive geometry.

(BTW, this is the sort of situation, where it
would have been good to have noted the reported size
of the hard drive BEFORE the update.)

(And if the above theory is right, having a
mirrored pair of identical hard drives would not help,
as the bios update may cause an identical problem
with each drive.)

Good Luck
Nigel Smith
--
This message posted from opensolaris.org

Orvar Korvar

2008-Oct-19 13:52 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

would it help to insert the raid into another computer and import it there?
--
This message posted from opensolaris.org

Richard Elling

2008-Oct-20 17:38 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Eugene Gladchenko wrote:> Hi,
>
> I''m running FreeBSD 7.1-PRERELEASE with a 500-gig ZFS drive.
Recently I''ve encountered a FreeBSD problem (PR kern/128083) and
decided about updating the motherboard BIOS. It looked like the update went
right but after that I was shocked to see my ZFS destroyed! Rolling the BIOS
back did not help.
>
> Now it looks like that:
>
> # zpool status
>   pool: tank
>  state: UNAVAIL
> status: One or more devices could not be used because the label is missing
>         or invalid.  There are insufficient replicas for the pool to
continue
>         functioning.
> action: Destroy and re-create the pool from a backup source.
>    see: http://www.sun.com/msg/ZFS-8000-5E
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         tank        UNAVAIL      0     0     0  insufficient replicas
>           ad4       UNAVAIL      0     0     0  corrupted data
> # zdb -l /dev/ad4
> --------------------------------------------
> LABEL 0
> --------------------------------------------
>     version=6
>     name=''tank''
>     state=0
>     txg=4
>     pool_guid=12069359268725642778
>     hostid=2719189110
>     hostname=''home.gladchenko.ru''
>     top_guid=5515037892630596686
>     guid=5515037892630596686
>     vdev_tree
>         type=''disk''
>         id=0
>         guid=5515037892630596686
>         path=''/dev/ad4''
>         devid=''ad:5QM0WF9G''
>         whole_disk=0
>         metaslab_array=14
>         metaslab_shift=32
>         ashift=9
>         asize=500103118848
> --------------------------------------------
> LABEL 1
> --------------------------------------------
>     version=6
>     name=''tank''
>     state=0
>     txg=4
>     pool_guid=12069359268725642778
>     hostid=2719189110
>     hostname=''home.gladchenko.ru''
>     top_guid=5515037892630596686
>     guid=5515037892630596686
>     vdev_tree
>         type=''disk''
>         id=0
>         guid=5515037892630596686
>         path=''/dev/ad4''
>         devid=''ad:5QM0WF9G''
>         whole_disk=0
>         metaslab_array=14
>         metaslab_shift=32
>         ashift=9
>         asize=500103118848
> --------------------------------------------
> LABEL 2
> --------------------------------------------
> failed to unpack label 2
> --------------------------------------------
> LABEL 3
> --------------------------------------------
> failed to unpack label 3
>   
This would occur if the beginning of the partition was intact,
but the end is not.  Causes for the latter include:
    1. partition table changes (or vtoc for SMI labels)
    2. something overwrote data at the end

If the cause is #1, then restoring the partion should work.  If
the cause is #2, then the data may be gone.

Note: ZFS can import a pool with one working label, but if
more of the data is actually unavailable or overwritten, then
it may not be able to get to a consistent state.
 -- richard

> #
>
> I''ve tried to import the problem pool into OpenSolaris 2008.05
with no success:
>
> # zpool import
>   pool: tank
>      id: 12069359268725642778
>  state: UNAVAIL
> status: The pool was last accessed by another system.
> action: The pool cannot be imported due to damaged devices or data.
>     see: http://www.sun.com/msg/ZFS-8000-EY
> config:
>
>         tank       UNAVAIL      0     0     0  insufficient replicas
>           c3d0s2   UNAVAIL      0     0     0  corrupted data
> #
>
> Is there a way to recover my files from this broken pool? Maybe at least
some of them? The drive was 4/5 full. :(
>
> I would appreciate any help.
>
> p.s. I already bought another drive of the same size yesterday. My next ZFS
experience definitely will be a mirrored one.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Eugene Gladchenko

2008-Oct-27 15:00 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Hi everyone, sorry for the late reply.

First of all, I''ve got all my files back. Cheers! :-)

Next, I''d like to tnank Nigel Smith, you are the best!

And, if anyone is interested, here is the end of the story and I''ll try
not to make it too long.

As one can see at http://www.freebsd.org/cgi/query-pr.cgi?pr=128083 the size of
/dev/ad4 was 476940MB before the trouble arose.
And as Nigel said I really didn''t notice the size had changed to
476938MB. 2MB got stolen.

The one to blame was HPA, or Host Protected Area, see
http://en.wikipedia.org/wiki/Host_Protected_Area for details.

A damned new motherboard BIOS silently cut 2 megabytes down from the drive so
that ZFS went insane.

All I had to do was to reset the HPA back to the native capacity and pray that
BIOS did not overwrite any ZFS vital data.
With the new 7200.11 Seagates the resetting procedure was a bit complicated
because I had to power-cycle the drive while feeding it the new HPA setting.

After making my data come back I made a copy of them, let the BIOS cut the drive
again and re-created the pool with the new HDD size. That''s all.

Thank you all again. Take care.
Eugene Gladchenko
--
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2008-Oct-27 15:09 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

>A damned new motherboard BIOS silently cut 2 megabytes down from
>the drive so that ZFS went insane.


Can you tell us which BIOS/Motherboard we should avoid?

Casper

Nigel Smith

2008-Oct-27 17:24 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Hi Eugene
I''m delighted to hear you got your files back!

I''ve seen a few posts to this forum where people have
done some change to the hardware, and then found
that the ZFS pool have gone. And often you never
hear any more from them, so you assume they could
not recover it.

Thanks for reporting back your interesting story.
I wonder how many other people have been caught out
with this ''Host Protected Area'' (HPA) and never
worked out that this was the cause...

Maybe one moral of this story is to make a note of
your hard drive and partitions sizes now, while
you have a working system.

If your using Solaris, maybe try ''prtvtoc''.
http://docs.sun.com/app/docs/doc/819-2240/prtvtoc-1m?a=view
(Unless someone knows a better way?)
Thanks
Nigel Smith


# prtvtoc /dev/rdsk/c1t1d0
* /dev/rdsk/c1t1d0 partition map
*
* Dimensions:
*     512 bytes/sector
* 1465149168 sectors
* 1465149101 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector
*          34       222       255
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      4    00        256 1465132495 1465132750
       8     11    00  1465132751     16384 1465149134
--
This message posted from opensolaris.org

Nigel Smith

2008-Oct-27 17:49 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

...check out that link that Eugene provided.
It was a GigaByte GA-G31M-S2L motherboard.
http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2693

Some more info on ''Host Protected Area'' (HPA), relating to
OpenSolaris here:
http://opensolaris.org/os/community/arc/caselog/2007/660/onepager/
http://bugs.opensolaris.org/view_bug.do?bug_id=5044205

Regards
Nigel Smith
--
This message posted from opensolaris.org

Miles Nordin

2008-Oct-27 19:13 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

>>>>> "ns" == Nigel Smith <nwsmith at
wilusa.freeserve.co.uk> writes:
    ns> make a note of your hard drive and partitions sizes now, while
    ns> you have a working system.

keeping a human-readable backup of all your disklabels somewhere safe
has helped me a few times.  For me it was mostly moving disks among
architectures (i386, sparc, alpha), but even if it weren''t for this
new and broken idea of boot firmware feeling entitled to
``write'''' to
disks, there are already many label formats and many label readers and
writers just on a single x86 system which motivated that one-pager:
the Linux label writer becomes incompatible with the Solaris reader
because Linux disables the HPA, and the Linux reader is more forgiving
than the Solaris reader.

The other thing I don''t like is that it''s hard to tell under
Solaris
the difference between a physically defective disk, and a disk where
Solaris ``doesn''t like'''' the label.  There''s
no uniform Solaris
equivalent to Linux''s unpartitioned device interface.  And the solaris
label tools like fmthard, rmformat, format, prtvtoc, have a variety of
quiet and ambiguously reported reasons for refusing to operate with a
disk, and a bunch of undocumented command line flags and secret
environment variables.

Lastly shouldn''t these bugs or ARC''s or whatever:

 http://opensolaris.org/os/community/arc/caselog/2007/660/onepager/
 http://bugs.opensolaris.org/view_bug.do?bug_id=5044205

cover SATA framework drives and drives appearing with SCSI emulation
through mpt or mega_sas?  Covering PATA only isn''t much help.  A
uniform ATA layer would be good for everyone---for keeping this HPA
work well-factored as well as making tools like hdparm, smartctl,
cdrecord work consistently.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081027/1513729a/attachment.bin>

Nigel Smith

2008-Oct-28 00:00 UTC

head link

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Hi Miles
I think you make some very good points in your comments.
It would be nice to get some positive feedback on these from Sun.

And my thought also on (quickly) looking at that bug & ARC case was
does not this also need to be factored into the SATA framework.

I really miss not having ''smartctl'' (fully) working with PATA
and
SATA drives on x86 Solaris.

I''ve done a quick search on PSARC 2007/660 and it was
"closed approved fast-track 11/28/2007".
I did a quick search, but I could not find any code that had been
committed to ''onnv-gate'' that references this case.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Reasonably Related Threads

Search for more apparently analagous threads

zfs discuss - Oct 2008 - My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

[zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

Reasonably Related Threads