(I tried to post this yesterday, but I haven''t seen it come through the list yet. I apologize if this is a duplicate posting. I added some updated information regarding a Sun bug ID below.) We''re in the process of setting up a Sun Cluster on two M5000s attached to a DMX1000 array. The M5000s are running Solaris 10u4 with the current patch cluster. For some of our initial testing, our pools consist of single unmirrored LUNs (though these are RAID 5 on the DMX side). All ZFS pools are delegated to non-global zones through dataset properties on the zone setups. We had a processor blow chunks on one of the M5000s while the pools were active on that server (thank heavens, NOT in production!). The server panicked and the group attempted to fail over to the other server. That server also immediately panicked, probably due to the zpools being faulted. (Note that the SERVER panicked, not the non-global zone. Bad enough that a zpool fault leads to a zone panicking; does it have to drop the ENTIRE server with whatever other well-behaved pools may be involved?) Long story short, we rebooted to milestone=none, removed the zpool.cache, proceeded to a full boot, and began to try to import all of the pools. All but one of the pools were imported ok. The other pool provokes a panic whenever we attempt to import it. We have set aok=1 and zfs:zfs_recover=1 in /etc/system. (Are these supported in s10u4? Or just OpenSolaris?) No change in behavior. The results look something like: zpool import -f zpool_STORE panic[cpu48]/thread=2a101e1fcc0: assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 413 000002a101e1f560 genunix:assfail+74 (7b29d8b0, 7b29d900, 19d, 18a9000, 1204400, 0) %l0-3: 0000000000000000 000003000013a1d8 0000000000000000 0000000000000000 %l4-7: 0000000001204400 0000000000000000 00000000018f5000 0000000000000000 000002a101e1f610 zfs:zfsctl_ops_root+ac3c2e8 (600105efc70, 63, 668, 10, 60016232e00, 6001079e500) %l0-3: 0000000000000001 000000000000000f 0000000000000007 0000000000002264 %l4-7: 00000300000da800 0000000000000000 0000000000002263 0000000000000000 000002a101e1f6e0 zfs:space_map_sync+278 (60015f0ab58, 4, 60015f0a960, 10, 2, 48) %l0-3: 0000000000000010 0000060016232e00 0000060016232e10 0000060016232e48 %l4-7: 00007fffffffffff 0000000000007fff 00000000000000e0 0000000000000010 000002a101e1f7d0 zfs:metaslab_sync+200 (60015f0a940, ff0e49, 4, 6001079e500, 6000a144ac0, 60016237700) %l0-3: 00000600105efc70 0000060015f0a978 0000060015f0a960 0000060015f0a9d8 %l4-7: 0000060015f0ab58 0000060015f0aaf8 0000060015f0ac78 0000000000000003 000002a101e1f890 zfs:vdev_sync+90 (6000a144ac0, ff0e49, ff0e48, 60015f0a940, 6000a144d08, e) %l0-3: 00000600105efc40 0000000000000007 00000600162377e8 0000000000000001 %l4-7: 000006000a144ac0 0000060016237700 0000000000000000 0000000000000000 000002a101e1f940 zfs:spa_sync+1d0 (60016237700, ff0e49, 1, 0, 2a101e1fcc4, 1) %l0-3: 0000000000000009 0000000000000000 00000600162377e8 00000600105e6080 %l4-7: 0000000000000000 00000600058d05c0 0000060012515340 0000060016237880 000002a101e1fa00 zfs:txg_sync_thread+134 (60012515340, ff0e49, 0, 2a101e1fab0, 60012515450, 60012515452) %l0-3: 0000060012515460 0000060012515410 0000000000000000 0000060012515418 %l4-7: 0000060012515456 0000060012515454 0000060012515408 0000000000ff0e4a syncing file systems... 1 done dumping to /dev/dsk/c0t0d0s1, offset 65536, content: kernel 100% done: 175836 pages dumped, compression ratio 6.77, dump succeeded rebooting... Resetting... We notice that three of the labels fail to unpack when we try to run zdb -lv: / > zdb -lv /dev/dsk/c3t6006048000038743014953594D333444d0 -------------------------------------------- LABEL 0 -------------------------------------------- -------------------------------------------- LABEL 1 -------------------------------------------- failed to unpack label 1 -------------------------------------------- LABEL 2 -------------------------------------------- failed to unpack label 2 -------------------------------------------- LABEL 3 -------------------------------------------- failed to unpack label 3 Any suggestions how to recover this zpool? There was no particularly important data on this pool (since we were just doing testing), but we want to make sure that we know how to recover a pool if the same thing happens to one of the (currently non-production) systems running ZFS. This is certainly a good chance to practice recovery techniques. We have since heard back from Sun Support that this is bug 6565042, which is scheduled to be rolled into S10u6 (not u5). We still do not have an answer from them as to whether the fix will be available as a patch (or a test patch), or whether the fix allows a recovery of the file system or prevents the corruption in the first place. Honestly, the section in the ZFS FAQ about why there is no fsck (or equivalent recovery tool) rings awfully hollow about now. I don''t see how we can consider ZFS as a production file system when a CPU failure can cause pool corruption so complete that no reasonably-documented recovery of any data from that pool is possible. We may do another round of suitability testing when u6 comes out. Maybe Sun will have rolled in the setting to prevent the system from panicking on an I/O problem. Thanks in advance, --Scott ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ