thr3ads.net - zfs discuss - [zfs-discuss] after controller crash, ''one or more devices is currently unavailable'' [Oct 2008]

If this information is useful, please help other people find it:
Share via:

David Champion

2008-Oct-30 20:04 UTC

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''

There are a lot of hits for this error in google, but I''ve had trouble
identifying any that resemble my situation.  I apologize if you''ve
answered it before.  If it''s better for me to open a case with Sun
Support, I can do that, but I''m hoping to cheat my way around the
system
so that I don''t have to send somebody Explorer output before they
escalate it.  Seems more efficient in the long run. :)

Most of my tale of woe is background:

I have a pool running under Solaris 10 5/08.  It''s an 8-member raidz2
whose volumes are on a 2540 array with two controllers.  Volumes are
mapped 1:1 with physical disks.  I didn''t really want a 2540, but I
couldn''t get anyone to swear to me that any other fiber-channel product
would work with Solaris.  I''m using fiber multipathing.

I''ve had two disk failures in the past two weeks.  Last week I replaced
the first.  No problems with ZFS initially; a ''zfs replace''
did the
right thing.  Yesterday I replaced the second.  But while investigating
the problem I noticed that two of my paths had gone down, so that 6
disks had both paths attached, and 2 disks had only one path.

At this time, ''zpool status'' showed:
  pool: z
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun. more devices contains corrupted data./msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Fri Oct 24 20:04:51 2008
config:

        NAME                                         STATE     READ WRITE CKSUM
        z                                            DEGRADED     0     0     0
          raidz2                                     DEGRADED     0     0     0
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030848B3DF52d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030B48B3DF7Ed0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031148B3DFD2d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031448B3DFFAd0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL      0     0     0 
cannot open
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  ONLINE       0     0     0


(At the time I hadn''t figured it out, but I believe now that the one
disk was UNAVAIL because the disk had not been properly partitioned yet,
so s0 was undefined.)

Solaris 10''s mpath support seems so far to be fairly intolerant of
reconfiguration without a reboot, and I wasn''t ready to reboot yet, but
I thought I''d try resetting the controller that wasn''t
attached to all
of the disks.  But it appears that for some reason the CAM software
reset both controllers simultaneously.  The whole pool went into an
error state, and all disks became unavailable.  Very annoying, but not a
problem for zfs-discuss.


At this time, ''zpool status'' showed:
  pool: z
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

        NAME                                         STATE     READ WRITE CKSUM
        z                                            FAULTED      0     0     0 
corrupted data
          raidz2                                     DEGRADED     0     0     0
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  UNAVAIL      0     0     0 
corrupted data
            c6t600A0B800049F9E10000030848B3DF52d0s0  UNAVAIL      0     0     0 
corrupted data
            c6t600A0B800049F9E10000030B48B3CF7Ed0s0  UNAVAIL      0     0     0 
corrupted data
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  UNAVAIL      0     0     0 
cannot open
            c6t600A0B800049F9E10000031148B3DFD2d0s0  UNAVAIL      0     0     0 
corrupted data
            c6t600A0B800049F9E10000031448B3DFFAd0s0  UNAVAIL      0     0     0 
corrupted data
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL      0     0     0 
cannot open
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  UNAVAIL      0     0     0 
corrupted data


I don''t know whether there''s any chance of recovering this,
but I
wanted to try.  I reset the 2540 again, but still no communication with
Solaris.  I rebooted the server, and communications resumed.  I had to
do some further repair/reconfig on the 2540 for the two disks marked
''cannot open'', but it was a minor issue and worked fine. 
Solaris was
then able to see all my disks.


Now we come to the main point.

I still hadn''t figured out the partitioning problem on ....E020d0s0
yet.  It didn''t occur to me because I believed that to be a spare disk
which I had already partitioned, and let''s face it, "I/O
error" can
be anything.  I was wrong, though: I had already used the spare the
previous week, and this one was unformatted.  But lacking any other
ideas, I tried to export and then import the pool.  The export went fine
without complaint.  I corrected the partitioning on the replacement disk
at this point.  But now when I try to import:

# zpool import
  pool: z
    id: 1372922273220982501
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on on another system, but can be imported using
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        z                                            FAULTED   corrupted data
          raidz2                                     ONLINE
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030848B3DF52d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030B48B3DF7Ed0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031148B3DFD2d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031448B3DFFAd0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  UNAVAIL   corrupted data


# zpool import z
cannot import ''z'': pool may be in use from other system
use ''-f'' to import anyway


# zpool import -f z
cannot import ''z'': one or more devices is currently
unavailable


''zdb -l'' shows four valid labels for each of these disks
except for the
new one.  Is this what "unavailable" means, in this case?

Growing adventurous, or perhaps just desperate, I used ''dd'' to
copy the
first label from one of the good disks to the new one which lacked any
labels.  Then I binary-edited the label to patch in the correct guid for
that disk.  (I got the correct guid from the zdb -l output.)  I still
get the same results from zpool import, though.  Is this because I need
to patch in three more copies of the label?  I''m not sure how (or more
correctly, where) to do that.

Is this a lost cause?  Anyone have any suggestions?  Is there a nice
tool for writing zdb -l output as a new label to a new disk?  Why did
zpool export a pool that it can''t import?  This is an experimental
development system, but I''d still like to recover the data if possible.


It may or may not be relevant that I have another exported pool, a relic
of old experiments, which believes it uses the same device (...E020d0s0)
has had no zfs label.  But I had this pool before all of yesterday''s
trouble, too.  If there''s a way to destroy an exported pool
I''m fine
with doing so, but it doesn''t seem like this is part of
today''s problem.

Thanks for any ideas.

-- 
 -D.    dgc at uchicago.edu    NSIT    University of Chicago

Oleg Muravskiy

2008-Oct-30 21:25 UTC

head link

[zfs-discuss] after controller crash,

I had problem with one of the labels on disk being unavailable. I was able to
recover label (according to zdb -l) by doing export/import. But disk was still
unavailable. Only scrub removed UNAVAILABLE status from disk.
-- 
This message posted from opensolaris.org

David Champion

2008-Nov-06 17:02 UTC

head link

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''

I have a feeling I pushed people away with a long message.  Let me
reduce my problem to one question.
> # zpool import -f z
> cannot import ''z'': one or more devices is currently
unavailable
> 
> 
> ''zdb -l'' shows four valid labels for each of these disks
except for the
> new one.  Is this what "unavailable" means, in this case?
I have now faked up a label for the disk that didn''t have one and
applied it with dd.

Can anyone say what "unavailable" means, given that all eight disks
are
registered devices at the correct paths, are readable, and have labels?

-- 
 -D.    dgc at uchicago.edu    NSIT    University of Chicago

Victor Latushkin

2008-Nov-07 01:32 UTC

head link

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''

David Champion wrote:> I have a feeling I pushed people away with a long message.  Let me
> reduce my problem to one question.
> 
>> # zpool import -f z
>> cannot import ''z'': one or more devices is currently
unavailable
>>
>>
>> ''zdb -l'' shows four valid labels for each of these
disks except for the
>> new one.  Is this what "unavailable" means, in this case?
> 
> I have now faked up a label for the disk that didn''t have one and
> applied it with dd.
> Can anyone say what "unavailable" means, given that all eight
disks are
> registered devices at the correct paths, are readable, and have labels?
For that label to be valid you also need to make sure that checksum for 
the label is valid as well.

You may try to get better idea why what is going on during import with 
the help of DTrace. See topic "more ZFS recovery" for the example
script.

regards,
victor

zfs discuss - Oct 2008 - after controller crash, ''one or more devices is currently unavailable''

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''

[zfs-discuss] after controller crash,

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''

[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''