thr3ads.net - zfs discuss - [zfs-discuss] corrupted pool [Apr 2008]

If this information is useful, please help other people find it:
Share via:
Scott Cromar
2008-Apr-16 12:31 UTC
[zfs-discuss] corrupted pool

(I tried to post this yesterday, but I haven''t seen it come through the
list yet.  I apologize if this is a duplicate posting.  I added some updated
information regarding a Sun bug ID below.)

We''re in the process of setting up a Sun Cluster on two M5000s attached
to a DMX1000 array.  The M5000s are running Solaris 10u4 with the current patch
cluster.

For some of our initial testing, our pools consist of single unmirrored LUNs
(though these are RAID 5 on the DMX side).  All ZFS pools are delegated to
non-global zones through dataset properties on the zone setups.

We had a processor blow chunks on one of the M5000s while the pools were active
on that server (thank heavens, NOT in production!).  The server panicked and the
group attempted to fail over to the other server.  That server also immediately
panicked, probably due to the zpools being faulted.

(Note that the SERVER panicked, not the non-global zone.  Bad enough that a
zpool fault leads to a zone panicking; does it have to drop the ENTIRE server
with whatever other well-behaved pools may be involved?)

Long story short, we rebooted to milestone=none, removed the zpool.cache,
proceeded to a full boot, and began to try to import all of the pools.  All but
one of the pools were imported ok.  The other pool provokes a panic whenever we
attempt to import it.

We have set aok=1 and zfs:zfs_recover=1 in /etc/system.  (Are these supported in
s10u4?  Or just OpenSolaris?)  No change in behavior.  The results look
something like:

zpool import -f zpool_STORE

panic[cpu48]/thread=2a101e1fcc0: assertion failed: 0 == dmu_buf_hold_array(os,
object, offset, size, FALSE, FTAG, &numbufs, &dbp), file:
../../common/fs/zfs/dmu.c, line: 413

000002a101e1f560 genunix:assfail+74 (7b29d8b0, 7b29d900, 19d, 18a9000, 1204400,
0)
  %l0-3: 0000000000000000 000003000013a1d8 0000000000000000 0000000000000000
  %l4-7: 0000000001204400 0000000000000000 00000000018f5000 0000000000000000
000002a101e1f610 zfs:zfsctl_ops_root+ac3c2e8 (600105efc70, 63, 668, 10,
60016232e00, 6001079e500)
  %l0-3: 0000000000000001 000000000000000f 0000000000000007 0000000000002264
  %l4-7: 00000300000da800 0000000000000000 0000000000002263 0000000000000000
000002a101e1f6e0 zfs:space_map_sync+278 (60015f0ab58, 4, 60015f0a960, 10, 2, 48)
  %l0-3: 0000000000000010 0000060016232e00 0000060016232e10 0000060016232e48
  %l4-7: 00007fffffffffff 0000000000007fff 00000000000000e0 0000000000000010
000002a101e1f7d0 zfs:metaslab_sync+200 (60015f0a940, ff0e49, 4, 6001079e500,
6000a144ac0, 60016237700)
  %l0-3: 00000600105efc70 0000060015f0a978 0000060015f0a960 0000060015f0a9d8
  %l4-7: 0000060015f0ab58 0000060015f0aaf8 0000060015f0ac78 0000000000000003
000002a101e1f890 zfs:vdev_sync+90 (6000a144ac0, ff0e49, ff0e48, 60015f0a940,
6000a144d08, e)
  %l0-3: 00000600105efc40 0000000000000007 00000600162377e8 0000000000000001
  %l4-7: 000006000a144ac0 0000060016237700 0000000000000000 0000000000000000
000002a101e1f940 zfs:spa_sync+1d0 (60016237700, ff0e49, 1, 0, 2a101e1fcc4, 1)
  %l0-3: 0000000000000009 0000000000000000 00000600162377e8 00000600105e6080
  %l4-7: 0000000000000000 00000600058d05c0 0000060012515340 0000060016237880
000002a101e1fa00 zfs:txg_sync_thread+134 (60012515340, ff0e49, 0, 2a101e1fab0,
60012515450, 60012515452)
  %l0-3: 0000060012515460 0000060012515410 0000000000000000 0000060012515418
  %l4-7: 0000060012515456 0000060012515454 0000060012515408 0000000000ff0e4a

syncing file systems... 1 done
dumping to /dev/dsk/c0t0d0s1, offset 65536, content: kernel
100% done: 175836 pages dumped, compression ratio 6.77, dump succeeded
rebooting...
Resetting...

We notice that three of the labels fail to unpack when we try to run zdb -lv:

/ > zdb -lv /dev/dsk/c3t6006048000038743014953594D333444d0
--------------------------------------------
LABEL 0
--------------------------------------------
--------------------------------------------
LABEL 1
--------------------------------------------
failed to unpack label 1
--------------------------------------------
LABEL 2
--------------------------------------------
failed to unpack label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3

Any suggestions how to recover this zpool?  There was no particularly important
data on this pool (since we were just doing testing), but we want to make sure
that we know how to recover a pool if the same thing happens to one of the
(currently non-production) systems running ZFS.  This is certainly a good chance
to practice recovery techniques.

We have since heard back from Sun Support that this is bug 6565042, which is
scheduled to be rolled into S10u6 (not u5).  We still do not have an answer from
them as to whether the fix will be available as a patch (or a test patch), or
whether the fix allows a recovery of the file system or prevents the corruption
in the first place.

Honestly, the section in the ZFS FAQ about why there is no fsck (or equivalent
recovery tool) rings awfully hollow about now.  I don''t see how we can
consider ZFS as a production file system when a CPU failure can cause pool
corruption so complete that no reasonably-documented recovery of any data from
that pool is possible.  We may do another round of suitability testing when u6
comes out.  Maybe Sun will have rolled in the setting to prevent the system from
panicking on an I/O problem.

Thanks in advance,

--Scott






     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
zfs discuss - Apr 2008 - corrupted pool

[zfs-discuss] corrupted pool