thr3ads.net - zfs discuss - [zfs-discuss] Corrupted pool repair attempt [May 2008]

If this information is useful, please help other people find it:
Share via:

Matthew Erickson

2008-May-21 20:47 UTC

[zfs-discuss] Corrupted pool repair attempt

OK, so this is another "my pool got eaten" problem.  Our setup:

Nevada 77 when it happened, now running 87.
9 iSCSI vdevs exported from Linux boxes off of hardware RAID (running Linux for
drivers on the RAID controllers).  The pool itself is simply striped.

Our problem:
Power got yanked to 8 of the 9 vdevs.  At the time, we had ZIL disabled and
write-back caching enabled on the vdevs for performance reasons.  The ZIL *was*
going to be re-enabled, but Murphy''s Law says things crash beforehand.

On attempting to bring the system back up after a reboot, all the vdevs and the
pool itself is marked FAULTED with corrupted data.

What we''ve attempted:
Since last Thursday (today is the Wednesday afterwords), we''ve tried
using this weekend''s nightly build to use zpool import -F to no avail.

In addition, I''ve been going through and applying dtrace probes into
the kernel to see where its dying and how, to see if it''s a "turn
off sanity checks and mount r/o" issue, or if it''s that our data
is hopelessly munged.  This attempt has resulted in a bit of a goose chase, with
possibilities popping up and failure modes branching quicker than I can take a
close look at them.

My partner here is working on the possibility of an offline file-grabbing
program, which shows some progress, but not much yet.

Our biggest problem is neither of us are experienced in kernel-land debugging or
filesystems, and at least I am rather unexperienced with the debugging power
tools available on Solaris, such as mdb, and uses of dtrace beyond looking at
function return values and entry arguments.

Is there someone who has a bit more experience with this who can help us?

-- Matt
 
 
This message posted from opensolaris.org

Matthew Erickson

2008-May-21 20:49 UTC

head link

[zfs-discuss] Corrupted pool repair attempt

And another thing we noticed: on test striped pools we''ve created, all
the vdev labels hold the same txg number, even as vdevs are added later, while
the labels on our primary pool (the dead one) are all different.
 
 
This message posted from opensolaris.org

zfs discuss - May 2008 - Corrupted pool repair attempt

[zfs-discuss] Corrupted pool repair attempt

[zfs-discuss] Corrupted pool repair attempt