thr3ads.net - zfs discuss - [zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Chris Forgeron

2011-Feb-07 00:56 UTC

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

Hello all,
 Long time reader, first time poster.

I''m on day two of a rather long struggle with ZFS and my data. It seems
we have a difference of opinion - ZFS doesn''t think I have any, and
I''m pretty sure I saw a crapload of it just the other day.

I''ve been researching and following various bits of information that
I''ve found from so many helpful people on this list, but I''m
running into a slightly different problem than the rest of you;

My zdb doesn''t seem to recognize the pool for any command other than
zdb - e <pool>

I think my problem is a corrupt set of uberblocks, and if I could go back in
time a bit, everything would be rosy. But how do you do that when zdb
doesn''t give you the output that you need?


Let''s start at the beginning, as this will be a rather long post.
Hopefully it will be of use to others in similar situations.

I was running Solaris Express 11, keeping my pool at v28 so I could occasionally
switch back into FreeBSD-9-Current for tests, comparisons, etc.

I''ve built a rather large raidz comprised of 25 1.5 TB drives,
organized into a striped 5 x 5 drive raidz.

Friday night, one of the 1.5 TB''s faulted, and the reslivering process
started to the spare 1.5 TB drive. All was normal.

In the morning, the resliver was around 86% complete when I started working on
the CIFS ability of Solaris - I wanted to take it''s authentication from
Workgroup to Domain mode, and thus I was following procedure on this, setting up
krb5.conf, etc. I also changed the hostname at this point to better label the
system.

I had rebooted once during this, and everything came back up fine. The drive was
still reslivering. I then went for a second reboot, and when the system came
back up, I was shocked to see my pool was in a faulted state.

Here''s a zpool status output from that fateful moment:

-=-=-=-=-

  pool: tank

state: FAULTED

status: The pool metadata is corrupted and the pool cannot be opened.

action: Destroy and re-create the pool from

        a backup source.

   see: http://www.sun.com/msg/ZFS-8000-72

scan: none requested

config:



        NAME             STATE     READ WRITE CKSUM

        tank             FAULTED      0     0     1  corrupted data

          raidz1-0       ONLINE       0     0     2

            c9t0d0       ONLINE       0     0     0

            c9t0d1       ONLINE       0     0     0

            c9t0d2       ONLINE       0     0     0

            c9t0d3       ONLINE       0     0     0

            c9t0d4       ONLINE       0     0     0

          raidz1-1       ONLINE       0     0     0

            c9t1d0       ONLINE       0     0     0

            c9t1d1       ONLINE       0     0     0

            c9t1d2       ONLINE       0     0     0

            c9t1d3       ONLINE       0     0     0

            c9t1d4       ONLINE       0     0     0

          raidz1-2       ONLINE       0     0     0

            c9t2d0       ONLINE       0     0     0

            c9t2d1       ONLINE       0     0     0

            c9t2d2       ONLINE       0     0     0

            c9t2d3       ONLINE       0     0     0

            c9t2d4       ONLINE       0     0     0

          raidz1-3       ONLINE       0     0     2

            c9t3d0       ONLINE       0     0     0

            c9t3d1       ONLINE       0     0     0

            c9t3d2       ONLINE       0     0     0

            c9t3d3       ONLINE       0     0     0

            c9t3d4       ONLINE       0     0     0

          raidz1-6       ONLINE       0     0     2

            c9t4d0       ONLINE       0     0     0

            c9t4d1       ONLINE       0     0     0

            c9t4d2       ONLINE       0     0     0

            c9t4d3       ONLINE       0     0     0

            replacing-4  ONLINE       0     0     0

              c9t4d4     ONLINE       0     0     0

              c9t15d1    ONLINE       0     0     0

        logs

          c9t14d0p0      ONLINE       0     0     0

          c9t14d1p0      ONLINE       0     0     0

-=-=-=-=-=-

After a "holy crap", and a check of /tank to see if it really was
gone, I executed a zpool export and then a zpool import.

(notice the 2 under the raidz1-3 vdev, as well as the raidz1-6)

Export worked fine, couldn''t import, as I received an I/O error.

At this stage, I thought it was something stupid with the resliver being jammed,
and since I had 4 out of 5 drives functional in my raidz1-6 vdev, I figured
I''d just remove those two drives and try my import again. Still no go.


At this stage, I started keeping a log, so I could more accurately record what
steps I was taking, and what results I was seeing. The log will become more
accurate as the gravity of the situation sunk in.

There are copious reboots here at times to make sure I have a solid system.

About the pool: It''s big, I''d say 10-12TB in size, about 12
zfs filesystems, some compression, some dedup, etc. It really was a nice thing
when working.

I have backups of my most important stuff, but I don''t have backups of
my more recent work in the last week-3 months, and I also have a few TB of media
that is not backed up - So I''d really like to get this back.  My
secondary backup server was still in the process of being built, so backing up
was painful, and thus not done often enough. I know..  I know..  and the funny
thing is; from day 1 I decided I needed two SAN''s so the data would
exist on two completely different bits of hardware to protect against big
whoopsies like this.

Here''s what I started doing:


Tried: time zpool import -fFX -o ro -o failmode=continue -R /mnt
13666181038508963033

Result (20 min later): cannot import ''tank'': one or more
devices is currently unavailable

I found out about the -V command for zpool by reading the source, and decided to
try this;

Tried: zpool import -f -V 13666181038508963033
Result: worked! Right away, but I was back to where I started, a faulted pool
that didn''t mount.


Tried: zpool clear -F data

Result: cannot clear errors for tank: I/O error

There was more playing around the same command, always the same results.

I exported again, as the key was in the import I felt.

I read about the importance of labels on the drives. I used zdb -l to check all
my drives. What I found here was interesting - Since I had been doing so many
tests with ZFS over the last few months, I had many labels on the drives, some
at say c9t4d0 and others at c9t4d0s0. You will notice that my zpool status
output shows the drives as c0tXdX without the s0 slice.

BUT - Both drive c9d4d4 and c9d15d1 (the bad drive, and the replacement drive)
lacked any labels at that level - the correct label was at c9t4d4s0 and
c9t14d1s0 ! Checking all the drives, I also found that an OLD zpool called
"tank" was on the base of the drive, and my new zpool called tank was
on the s0 part of the drive. The guids were different, so Solaris
shouldn''t be confusing them.


Now, I''ve done exports and imports before like this, and Solaris /
FreeBSD always figured things out just fine - Reading about how this could cause
problems ( http://opensolaris.org/jive/thread.jspa?threadID=104654 ) I decided
to make sure it wasn''t causing an issue here.



I followed the instructions, particularly for:



mkdir /mytempdev

cd /mytempdev

for i in /dev/rdsk/c[67]d*s* ; do

ln -s $i

done

zpool import -d /mytempdev


which creates a new dir, which I made sure was populated with only my drives
that I knew were in this pool, and only with the cXtXdXs0 designation.


Tried: time zpool import -d /mytempdev -fFX -o ro -o failmode=continue -R /mnt
13666181038508963033

Result (1 min) The devices below are missing, use ''-m'' to
import the pool anyway:

            c9t14d0p0 [log]

            c9t14d1p0 [log]

Okay.. so that''s a bit of progress. I did what it said.


Tried: time zpool import -m -d /mytempdev -fFX -o ro -o failmode=continue -R
/mnt 13666181038508963033

Result (1 min): cannot import ''tank'': one or more devices is
currently unavailable



Then I downloaded Victor''s dtrace script
(http://markmail.org/search/?q=more+ZFS+recovery#query:more%20ZFS%20recovery+page:1+mid:6l2zrn36qis6rydv+state:results)

and then executed these commands:



Tried: dtrace -s ./zpool.d -c "zpool import -m -d /mytempdev -fFX -o ro -o
failmode=continue -R /mnt 13666181038508963033" > dtrace.dump

Results (6 min) : huge dump to dtrace.dump, like 12 Gig - couldn''t open
in vim, or compress with bzip2. I decided to abandon that, as it''s so
large I doubt I could do much with it.





Tried: zdb -e -dddd 13666181038508963033

Results: zdb: can''t open ''tank'': I/O error



Tried: zdb -e 13666181038508963033

Results: starts listing labels, then it tanks with an I/O error. It lists all
the vdevs.

This is about the only zdb command that works for me.

The other that at least tries something is:


Tried: zdb -U -uuuv 13666181038508963033

Result: Assertion failed: thr_create(0, 0, (void *(*)(void *))func, arg,
THR_DETACHED, &tid) == 0, file ../common/kernel.c, line 73, function
zk_thread_create


So I then typed :zpool export tank, and then I deleted the /mytempdev/c9t4d4s0
and c9d15d1s0 dev''s, and I will try my import again.  The thinking here
is the drives involved in the reslivering may somehow be bad.


Tried: time zpool import -m -d /mytempdev -fFX -o ro -o failmode=continue -R
/mnt 13666181038508963033

Result: cannot import ''tank'': one or more devices is currently
unavailable

So it doesn''t like to not have enough vdevs - which is fair.

I gave it two blank drives to act as my two slivering drives, put the links back
in /mytempdev and tried my import again. Same problems, nothing different.


Here''s what my pool looks like at this stage:





  pool: tank

state: FAULTED

status: The pool metadata is corrupted and the pool cannot be opened.

action: Destroy and re-create the pool from

        a backup source.

   see: http://www.sun.com/msg/ZFS-8000-72

scan: none requested

config:



        NAME                        STATE     READ WRITE CKSUM

        tank                        FAULTED      0     0     1  corrupted data

          raidz1-0                  ONLINE       0     0     2

            /mytempdev/c9t0d0s0     ONLINE       0     0     0

            /mytempdev/c9t0d1s0     ONLINE       0     0     0

            /mytempdev/c9t0d2s0     ONLINE       0     0     0

            /mytempdev/c9t0d3s0     ONLINE       0     0     0

            /mytempdev/c9t0d4s0     ONLINE       0     0     0

          raidz1-1                  ONLINE       0     0     0

            /mytempdev/c9t1d0s0     ONLINE       0     0     0

            /mytempdev/c9t1d1s0     ONLINE       0     0     0

            /mytempdev/c9t1d2s0     ONLINE       0     0     0

            /mytempdev/c9t1d3s0     ONLINE       0     0     0

            /mytempdev/c9t1d4s0     ONLINE       0     0     0

          raidz1-2                  ONLINE       0     0     0

            /mytempdev/c9t2d0s0     ONLINE       0     0     0

            /mytempdev/c9t2d1s0     ONLINE       0     0     0

            /mytempdev/c9t2d2s0     ONLINE       0     0     0

            /mytempdev/c9t2d3s0     ONLINE       0     0     0

            /mytempdev/c9t2d4s0     ONLINE       0     0     0

          raidz1-3                  ONLINE       0     0     2

            /mytempdev/c9t3d0s0     ONLINE       0     0     0

            /mytempdev/c9t3d1s0     ONLINE       0     0     0

            /mytempdev/c9t3d2s0     ONLINE       0     0     0

            /mytempdev/c9t3d3s0     ONLINE       0     0     0

            /mytempdev/c9t3d4s0     ONLINE       0     0     0

          missing-4                 ONLINE       0     0     0

          missing-5                 ONLINE       0     0     0

          raidz1-6                  ONLINE       0     0     2

            /mytempdev/c9t4d0s0     ONLINE       0     0     0

            /mytempdev/c9t4d1s0     ONLINE       0     0     0

            /mytempdev/c9t4d2s0     ONLINE       0     0     0

            /mytempdev/c9t4d3s0     ONLINE       0     0     0

            replacing-4             ONLINE       0     0     0

              /mytempdev/c9t4d4s0   ONLINE       0     0     0

              /mytempdev/c9t15d1s0  ONLINE       0     0     0



(Hey! Look, I just noticed. It says missing-4 missing-5. These never existed.
The reason it jumps to raidz1-6 as far as I know is because raidz1-6 is the
start of a new backplane. The other raidz1-[0-3] are on a different backplane. I
wonder if ZFS is suddenly thinking we need those? )



Tried: zdb -e -dddd 13666181038508963033

Result: (same as before) can''t open ''tank'': I/O error



Tried: zpool history tank

Result: no output



Tried:  zdb -U -lv 13666181038508963033

Result:  zdb: can''t open ''13666181038508963033'': No
such file or directory



Tried: zdb -e 13666181038508963033

Result: lists all the vdevs,  gets an I/O error at the end of c9t15d1s0





It thinks we have 7 vdev children, and it lists 7 (0-6)



Tried: time zpool import -V -m -d /mytempdev -fFX -o ro -o failmode=continue -R
/mnt 13666181038508963033
Result: works, took 25 min, and all the vdevs are the proper /mytempdev devices,
not other ones.


Tried: zpool clear -F tank

Result: cannot clear errors for tank: I/O error

Zdb does work, my commands run on "rpool" come back properly.. look:


solaris:/# zdb -R tank 0:11600:200

zdb: can''t open ''tank'': No such file or directory

solaris:/# zdb -R rpool 0:11600:200

Found vdev: /dev/dsk/c8d0s0

DVA[0]=<0:11600:200:STD:1> [L0 unallocated] off uncompressed LE contiguous
unique unencrypted 1-copy size=200L/200P birth=4L/4P fill=0 cksum=0:0:0:0

          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef

000000:  070c89010c000254  1310050680101c00  T...............

000010:  58030001001f0728  060d201528830a07  (......X...(. ..

000020:  3bdf081041020c10  0f00cc0c00588256  ...A...;V.X.....

000030:  000800003d2fe64b  130016580df44f09  K./=.....O..X...

000040:  48b49e8ec3ac74c0  42fc03fcff2f0064  .t.....Hd./....B

000050:  42fc42fc42fc42fc  fc42fcff42fc42fc  .B.B.B.B.B.B..B.
[..snip..]


I''ve tried booting into FreeBSD, and I get pretty much the same results
as I do under Solaris Express. The only diff with FreeBSD seems to be that after
I import with -V and try a zdb command, the zpool doesn''t exist anymore
(crashes out?). My advantage in FreeBSD is that I have the source code that I
can browse through, and possibly edit if I need to.



I''m thinking that if I could try some of the uberblock invalidation
tricks, I could do something here - But how do I do this without zdb giving my
information that I need?

Hopefully some kind soul will take pity and dive into this mess with me. :)


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110206/6290e415/attachment-0001.html>

George Wilson

2011-Feb-07 02:55 UTC

head link

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

Chris,

I might be able to help you recover the pool but will need access to your
system. If you think this is possible just ping me off list and let me know.

Thanks,
George


On Sun, Feb 6, 2011 at 4:56 PM, Chris Forgeron <cforgeron at acsi.ca>
wrote:
> Hello all,
>
>  Long time reader, first time poster.
>
>
>
> I?m on day two of a rather long struggle with ZFS and my data. It seems we
> have a difference of opinion ? ZFS doesn?t think I have any, and I?m pretty
> sure I saw a crapload of it just the other day.
>
>
>
> I?ve been researching and following various bits of information that I?ve
> found from so many helpful people on this list, but I?m running into a
> slightly different problem than the rest of you;
>
>
>
> My zdb doesn?t seem to recognize the pool for any command other than zdb ?
> e <pool>
>
>
>
> I think my problem is a corrupt set of uberblocks, and if I could go back
> in time a bit, everything would be rosy. But how do you do that when zdb
> doesn?t give you the output that you need?
>
>
>
>
>
> Let?s start at the beginning, as this will be a rather long post. Hopefully
> it will be of use to others in similar situations.
>
>
>
> I was running Solaris Express 11, keeping my pool at v28 so I could
> occasionally switch back into FreeBSD-9-Current for tests, comparisons,
etc.
>
>
>
> I?ve built a rather large raidz comprised of 25 1.5 TB drives, organized
> into a striped 5 x 5 drive raidz.
>
>
>
> Friday night, one of the 1.5 TB?s faulted, and the reslivering process
> started to the spare 1.5 TB drive. All was normal.
>
>
>
> In the morning, the resliver was around 86% complete when I started working
> on the CIFS ability of Solaris ? I wanted to take it?s authentication from
> Workgroup to Domain mode, and thus I was following procedure on this,
> setting up krb5.conf, etc. I also changed the hostname at this point to
> better label the system.
>
>
>
> I had rebooted once during this, and everything came back up fine. The
> drive was still reslivering. I then went for a second reboot, and when the
> system came back up, I was shocked to see my pool was in a faulted state.
>
>
>
> Here?s a zpool status output from that fateful moment:
>
>
>
> -=-=-=-=-
>
>   pool: tank
>
> state: FAULTED
>
> status: The pool metadata is corrupted and the pool cannot be opened.
>
> action: Destroy and re-create the pool from
>
>         a backup source.
>
>    see: http://www.sun.com/msg/ZFS-8000-72
>
> scan: none requested
>
> config:
>
>
>
>         NAME             STATE     READ WRITE CKSUM
>
>         tank             FAULTED      0     0     1  corrupted data
>
>           raidz1-0       ONLINE       0     0     2
>
>             c9t0d0       ONLINE       0     0     0
>
>             c9t0d1       ONLINE       0     0     0
>
>             c9t0d2       ONLINE       0     0     0
>
>             c9t0d3       ONLINE       0     0     0
>
>             c9t0d4       ONLINE       0     0     0
>
>           raidz1-1       ONLINE       0     0     0
>
>             c9t1d0       ONLINE       0     0     0
>
>             c9t1d1       ONLINE       0     0     0
>
>             c9t1d2       ONLINE       0     0     0
>
>             c9t1d3       ONLINE       0     0     0
>
>             c9t1d4       ONLINE       0     0     0
>
>           raidz1-2       ONLINE       0     0     0
>
>             c9t2d0       ONLINE       0     0     0
>
>             c9t2d1       ONLINE       0     0     0
>
>             c9t2d2       ONLINE       0     0     0
>
>             c9t2d3       ONLINE       0     0     0
>
>             c9t2d4       ONLINE       0     0     0
>
>           raidz1-3       ONLINE       0     0     2
>
>             c9t3d0       ONLINE       0     0     0
>
>             c9t3d1       ONLINE       0     0     0
>
>             c9t3d2       ONLINE       0     0     0
>
>             c9t3d3       ONLINE       0     0     0
>
>             c9t3d4       ONLINE       0     0     0
>
>           raidz1-6       ONLINE       0     0     2
>
>             c9t4d0       ONLINE       0     0     0
>
>             c9t4d1       ONLINE       0     0     0
>
>             c9t4d2       ONLINE       0     0     0
>
>             c9t4d3       ONLINE       0     0     0
>
>             replacing-4  ONLINE       0     0     0
>
>               c9t4d4     ONLINE       0     0     0
>
>               c9t15d1    ONLINE       0     0     0
>
>         logs
>
>           c9t14d0p0      ONLINE       0     0     0
>
>           c9t14d1p0      ONLINE       0     0     0
>
>
>
> -=-=-=-=-=-
>
>
>
> After a ?holy crap?, and a check of /tank to see if it really was gone, I
> executed a zpool export and then a zpool import.
>
>
>
> (notice the 2 under the raidz1-3 vdev, as well as the raidz1-6)
>
>
>
> Export worked fine, couldn?t import, as I received an I/O error.
>
>
>
> At this stage, I thought it was something stupid with the resliver being
> jammed, and since I had 4 out of 5 drives functional in my raidz1-6 vdev, I
> figured I?d just remove those two drives and try my import again. Still no
> go.
>
>
>
>
>
> At this stage, I started keeping a log, so I could more accurately record
> what steps I was taking, and what results I was seeing. The log will become
> more accurate as the gravity of the situation sunk in.
>
>
>
> There are copious reboots here at times to make sure I have a solid system.
>
>
>
>
> About the pool: It?s big, I?d say 10-12TB in size, about 12 zfs
> filesystems, some compression, some dedup, etc. It really was a nice thing
> when working.
>
>
>
> I have backups of my most important stuff, but I don?t have backups of my
> more recent work in the last week-3 months, and I also have a few TB of
> media that is not backed up ? So I?d really like to get this back.  My
> secondary backup server was still in the process of being built, so backing
> up was painful, and thus not done often enough. I know..  I know..  and the
> funny thing is; from day 1 I decided I needed two SAN?s so the data would
> exist on two completely different bits of hardware to protect against big
> whoopsies like this.
>
>
>
> Here?s what I started doing:
>
>
>
> Tried: time zpool import -fFX -o ro -o failmode=continue -R /mnt
> 13666181038508963033
>
> Result (20 min later): cannot import ''tank'': one or more
devices is
> currently unavailable
>
>
>
> I found out about the ?V command for zpool by reading the source, and
> decided to try this;
>
>
>
> Tried: zpool import ?f ?V 13666181038508963033
>
> Result: worked! Right away, but I was back to where I started, a faulted
> pool that didn?t mount.
>
>
>
> Tried: zpool clear -F data
>
> Result: cannot clear errors for tank: I/O error
>
>
>
> There was more playing around the same command, always the same results.
>
>
>
> * *I exported again, as the key was in the import I felt.
>
>
>
> I read about the importance of labels on the drives. I used zdb ?l to check
> all my drives. What I found here was interesting ? Since I had been doing
so
> many tests with ZFS over the last few months, I had many labels on the
> drives, some at say c9t4d0 and others at c9t4d0s0. You will notice that my
> zpool status output shows the drives as c0tXdX without the s0 slice.
>
>
>
> BUT ? Both drive c9d4d4 and c9d15d1 (the bad drive, and the replacement
> drive) lacked any labels at that level ? the correct label was at c9t4d4s0
> and c9t14d1s0 ! Checking all the drives, I also found that an OLD zpool
> called ?tank? was on the base of the drive, and my new zpool called tank
was
> on the s0 part of the drive. The guids were different, so Solaris shouldn?t
> be confusing them.
>
>
>
> Now, I?ve done exports and imports before like this, and Solaris / FreeBSD
> always figured things out just fine - Reading about how this could cause
> problems ( http://opensolaris.org/jive/thread.jspa?threadID=104654 ) I
> decided to make sure it wasn?t causing an issue here.
>
>
>
> I followed the instructions, particularly for:
>
>
>
>
>
> mkdir /mytempdev
>
> cd /mytempdev
>
> for i in /dev/rdsk/c[67]d*s* ; do
>
> ln -s $i
>
> done
>
> zpool import -d /mytempdev
>
>
>
>
>
> which creates a new dir, which I made sure was populated with only my
> drives that I knew were in this pool, and only with the cXtXdXs0
> designation.
>
>
>
> Tried: time zpool import -d /mytempdev -fFX -o ro -o failmode=continue -R
> /mnt 13666181038508963033
>
> Result (1 min) The devices below are missing, use ''-m'' to
import the pool
> anyway:
>
>             c9t14d0p0 [log]
>
>             c9t14d1p0 [log]
>
>
>
> Okay.. so that?s a bit of progress. I did what it said.
>
>
>
> Tried: time zpool import -m -d /mytempdev -fFX -o ro -o failmode=continue
> -R /mnt 13666181038508963033
>
> Result (1 min): cannot import ''tank'': one or more devices
is currently
> unavailable
>
>
>
>
>
> Then I downloaded Victor''s dtrace script (
>
http://markmail.org/search/?q=more+ZFS+recovery#query:more%20ZFS%20recovery+page:1+mid:6l2zrn36qis6rydv+state:results
> )
>
> and then executed these commands:
>
>
>
> Tried: dtrace -s ./zpool.d -c "zpool import -m -d /mytempdev -fFX -o
ro -o
> failmode=continue -R /mnt 13666181038508963033" > dtrace.dump
>
> Results (6 min) : huge dump to dtrace.dump, like 12 Gig ? couldn?t open in
> vim, or compress with bzip2. I decided to abandon that, as it?s so large I
> doubt I could do much with it.
>
>
>
>
>
> Tried: zdb -e -dddd 13666181038508963033
>
> Results: zdb: can''t open ''tank'': I/O error
>
>
>
> Tried: zdb -e 13666181038508963033
>
> Results: starts listing labels, then it tanks with an I/O error. It lists
> all the vdevs.
>
>
>
> This is about the only zdb command that works for me.
>
>
>
> The other that at least tries something is:
>
>
>
> Tried: zdb -U -uuuv 13666181038508963033
>
> Result: Assertion failed: thr_create(0, 0, (void *(*)(void *))func, arg,
> THR_DETACHED, &tid) == 0, file ../common/kernel.c, line 73, function
> zk_thread_create
>
>
>
> So I then typed :zpool export tank, and then I deleted the
> /mytempdev/c9t4d4s0 and c9d15d1s0 dev''s, and I will try my import
again.
>  The thinking here is the drives involved in the reslivering may somehow be
> bad.
>
>
>
> Tried: time zpool import -m -d /mytempdev -fFX -o ro -o failmode=continue
> -R /mnt 13666181038508963033
>
> Result: cannot import ''tank'': one or more devices is
currently unavailable
>
>
>
> So it doesn?t like to not have enough vdevs ? which is fair.
>
>
>
> I gave it two blank drives to act as my two slivering drives, put the links
> back in /mytempdev and tried my import again. Same problems, nothing
> different.
>
>
>
> Here?s what my pool looks like at this stage:
>
>
>
>
>
>   pool: tank
>
> state: FAULTED
>
> status: The pool metadata is corrupted and the pool cannot be opened.
>
> action: Destroy and re-create the pool from
>
>         a backup source.
>
>    see: http://www.sun.com/msg/ZFS-8000-72
>
> scan: none requested
>
> config:
>
>
>
>         NAME                        STATE     READ WRITE CKSUM
>
>         tank                        FAULTED      0     0     1  corrupted
> data
>
>           raidz1-0                  ONLINE       0     0     2
>
>             /mytempdev/c9t0d0s0     ONLINE       0     0     0
>
>             /mytempdev/c9t0d1s0     ONLINE       0     0     0
>
>             /mytempdev/c9t0d2s0     ONLINE       0     0     0
>
>             /mytempdev/c9t0d3s0     ONLINE       0     0     0
>
>             /mytempdev/c9t0d4s0     ONLINE       0     0     0
>
>           raidz1-1                  ONLINE       0     0     0
>
>             /mytempdev/c9t1d0s0     ONLINE       0     0     0
>
>             /mytempdev/c9t1d1s0     ONLINE       0     0     0
>
>             /mytempdev/c9t1d2s0     ONLINE       0     0     0
>
>             /mytempdev/c9t1d3s0     ONLINE       0     0     0
>
>             /mytempdev/c9t1d4s0     ONLINE       0     0     0
>
>           raidz1-2                  ONLINE       0     0     0
>
>             /mytempdev/c9t2d0s0     ONLINE       0     0     0
>
>             /mytempdev/c9t2d1s0     ONLINE       0     0     0
>
>             /mytempdev/c9t2d2s0     ONLINE       0     0     0
>
>             /mytempdev/c9t2d3s0     ONLINE       0     0     0
>
>             /mytempdev/c9t2d4s0     ONLINE       0     0     0
>
>           raidz1-3                  ONLINE       0     0     2
>
>             /mytempdev/c9t3d0s0     ONLINE       0     0     0
>
>             /mytempdev/c9t3d1s0     ONLINE       0     0     0
>
>             /mytempdev/c9t3d2s0     ONLINE       0     0     0
>
>             /mytempdev/c9t3d3s0     ONLINE       0     0     0
>
>             /mytempdev/c9t3d4s0     ONLINE       0     0     0
>
>           missing-4                 ONLINE       0     0     0
>
>           missing-5                 ONLINE       0     0     0
>
>           raidz1-6                  ONLINE       0     0     2
>
>             /mytempdev/c9t4d0s0     ONLINE       0     0     0
>
>             /mytempdev/c9t4d1s0     ONLINE       0     0     0
>
>             /mytempdev/c9t4d2s0     ONLINE       0     0     0
>
>             /mytempdev/c9t4d3s0     ONLINE       0     0     0
>
>             replacing-4             ONLINE       0     0     0
>
>               /mytempdev/c9t4d4s0   ONLINE       0     0     0
>
>               /mytempdev/c9t15d1s0  ONLINE       0     0     0
>
>
>
> (Hey! Look, I just noticed. It says missing-4 missing-5. These never
> existed. The reason it jumps to raidz1-6 as far as I know is because
> raidz1-6 is the start of a new backplane. The other raidz1-[0-3] are on a
> different backplane. I wonder if ZFS is suddenly thinking we need those? )
>
>
>
> Tried: zdb -e -dddd 13666181038508963033
>
> Result: (same as before) can''t open ''tank'': I/O
error
>
>
>
> Tried: zpool history tank
>
> Result: no output
>
>
>
> Tried:  zdb -U -lv 13666181038508963033
>
> Result:  zdb: can''t open ''13666181038508963033'':
No such file or directory
>
>
>
> Tried: zdb ?e 13666181038508963033
>
> Result: lists all the vdevs,  gets an I/O error at the end of c9t15d1s0
>
>
>
>
>
> It thinks we have 7 vdev children, and it lists 7 (0-6)
>
>
>
>
>
> Tried: time zpool import -V -m -d /mytempdev -fFX -o ro -o
> failmode=continue -R /mnt 13666181038508963033
>
> Result: works, took 25 min, and all the vdevs are the proper /mytempdev
> devices, not other ones.
>
>
>
> Tried: zpool clear -F tank
>
> Result: cannot clear errors for tank: I/O error
>
>
>
> Zdb does work, my commands run on ?rpool? come back properly.. look:
>
>
>
> solaris:/# zdb -R tank 0:11600:200
>
> zdb: can''t open ''tank'': No such file or
directory
>
> solaris:/# zdb -R rpool 0:11600:200
>
> Found vdev: /dev/dsk/c8d0s0
>
> DVA[0]=<0:11600:200:STD:1> [L0 unallocated] off uncompressed LE
contiguous
> unique unencrypted 1-copy size=200L/200P birth=4L/4P fill=0 cksum=0:0:0:0
>
>           0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
>
> 000000:  070c89010c000254  1310050680101c00  T...............
>
> 000010:  58030001001f0728  060d201528830a07  (......X...(. ..
>
> 000020:  3bdf081041020c10  0f00cc0c00588256  ...A...;V.X.....
>
> 000030:  000800003d2fe64b  130016580df44f09  K./=.....O..X...
>
> 000040:  48b49e8ec3ac74c0  42fc03fcff2f0064  .t.....Hd./....B
>
> 000050:  42fc42fc42fc42fc  fc42fcff42fc42fc  .B.B.B.B.B.B..B.
>
> [..snip..]
>
>
>
>
>
> I?ve tried booting into FreeBSD, and I get pretty much the same results as
> I do under Solaris Express. The only diff with FreeBSD seems to be that
> after I import with ?V and try a zdb command, the zpool doesn?t exist
> anymore (crashes out?). My advantage in FreeBSD is that I have the source
> code that I can browse through, and possibly edit if I need to.
>
>
>
>
>
>
>
> I?m thinking that if I could try some of the uberblock invalidation tricks,
> I could do something here ? But how do I do this without zdb giving my
> information that I need?
>
>
>
> Hopefully some kind soul will take pity and dive into this mess with me. J
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

-- 
George Wilson

 <http://www.delphix.com>

M: +1.770.853.8523
F: +1.650.494.1676
275 Middlefield Road, Suite 50
Menlo Park, CA 94025
http://www.delphix.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110206/a5195883/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 821 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110206/a5195883/attachment-0001.gif>

Chris Forgeron

2011-Feb-08 18:13 UTC

head link

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

Quick update;
 George has been very helpful, and there is progress with my zpool.
I''ve got partial read ability at this point, and some data is being
copied off.

It was _way_ beyond my skillset to do anything.

Once we have things resolved to a better level, I''ll post more details
(with a lot of help from George I''d say).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110208/b389337a/attachment.html>

Chris Forgeron

2011-Feb-09 00:06 UTC

head link

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

Yes, a full disclosure will be made once it''s back to normal (hopefully
that event will happen).

The pool is mounted RO right now, and I can give some better stats; I had 10.3
TB of data in that pool, all a mix of dedup and compression.

Interesting enough, anything that wasn''t being touched (i.e. any VM
that wasn''t mounted, or file that was  being written to) is perfectly
fine so far - So that would mean over 95% of that 10.2 TB should be clean.
I''ve only copied 10% of it so far, but it''s all moved without
issue.

The files that were in motion are showing great destruction - I/O errors when
any attempt is being made to read them.

At this stage, I''m hoping that going backwards in time further with the
txg''s will give me solid files that I can work with.  George has some
ideas there, and hopefully will have something to try around the time
I''m finished copying the data.

This should stand as a warning that even a ZFS pool can disappear if it takes
corruption in the right area. Hopefully there will be time and enough evidence
for a "how did this happen" type of look..  I''m starting to
beef up my second SAN device more as I wait for my primary pool to recover. I
wanted to stress-test this SAN design, I guess I have covered all the bases -
right up to pool destruction and eventual partial recovery. Once this is all
production, I have to keep enough business processes running regardless if one
of the pools goes down.

Many thanks to George and his continued efforts.

From: haaksen at gmail.com [mailto:haaksen at gmail.com] On Behalf Of Mark
Alkema
Sent: Tuesday, February 08, 2011 4:38 PM
To: Chris Forgeron
Subject: Re: [zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t
recognize the pool as existing

good for you chris! i`m very interested in the details.

Rgds, Mark.
2011/2/8 Chris Forgeron <cforgeron at acsi.ca<mailto:cforgeron at
acsi.ca>>
Quick update;
 George has been very helpful, and there is progress with my zpool.
I''ve got partial read ability at this point, and some data is being
copied off.

It was _way_ beyond my skillset to do anything.

Once we have things resolved to a better level, I''ll post more details
(with a lot of help from George I''d say).

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org<mailto:zfs-discuss at opensolaris.org>
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110208/e7db7e90/attachment.html>

zfs discuss - Feb 2011 - Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing

[zfs-discuss] Repairing Faulted ZFS pool when zbd doesn''t recognize the pool as existing