David Champion
2008-Oct-30 20:04 UTC
[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''
There are a lot of hits for this error in google, but I''ve had trouble identifying any that resemble my situation. I apologize if you''ve answered it before. If it''s better for me to open a case with Sun Support, I can do that, but I''m hoping to cheat my way around the system so that I don''t have to send somebody Explorer output before they escalate it. Seems more efficient in the long run. :) Most of my tale of woe is background: I have a pool running under Solaris 10 5/08. It''s an 8-member raidz2 whose volumes are on a 2540 array with two controllers. Volumes are mapped 1:1 with physical disks. I didn''t really want a 2540, but I couldn''t get anyone to swear to me that any other fiber-channel product would work with Solaris. I''m using fiber multipathing. I''ve had two disk failures in the past two weeks. Last week I replaced the first. No problems with ZFS initially; a ''zfs replace'' did the right thing. Yesterday I replaced the second. But while investigating the problem I noticed that two of my paths had gone down, so that 6 disks had both paths attached, and 2 disks had only one path. At this time, ''zpool status'' showed: pool: z state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun. more devices contains corrupted data./msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Fri Oct 24 20:04:51 2008 config: NAME STATE READ WRITE CKSUM z DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c6t600A0B800049F9E10000030548B3DF1Ed0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000030848B3DF52d0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000030B48B3DF7Ed0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000030E48B3DFA6d0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000031148B3DFD2d0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000031448B3DFFAd0s0 ONLINE 0 0 0 c6t600A0B800049F9E10000031748B3E020d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E10000031A48B3E04Cd0s0 ONLINE 0 0 0 (At the time I hadn''t figured it out, but I believe now that the one disk was UNAVAIL because the disk had not been properly partitioned yet, so s0 was undefined.) Solaris 10''s mpath support seems so far to be fairly intolerant of reconfiguration without a reboot, and I wasn''t ready to reboot yet, but I thought I''d try resetting the controller that wasn''t attached to all of the disks. But it appears that for some reason the CAM software reset both controllers simultaneously. The whole pool went into an error state, and all disks became unavailable. Very annoying, but not a problem for zfs-discuss. At this time, ''zpool status'' showed: pool: z state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAME STATE READ WRITE CKSUM z FAULTED 0 0 0 corrupted data raidz2 DEGRADED 0 0 0 c6t600A0B800049F9E10000030548B3DF1Ed0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E10000030848B3DF52d0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E10000030B48B3CF7Ed0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E10000030E48B3DFA6d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E10000031148B3DFD2d0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E10000031448B3DFFAd0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E10000031748B3E020d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E10000031A48B3E04Cd0s0 UNAVAIL 0 0 0 corrupted data I don''t know whether there''s any chance of recovering this, but I wanted to try. I reset the 2540 again, but still no communication with Solaris. I rebooted the server, and communications resumed. I had to do some further repair/reconfig on the 2540 for the two disks marked ''cannot open'', but it was a minor issue and worked fine. Solaris was then able to see all my disks. Now we come to the main point. I still hadn''t figured out the partitioning problem on ....E020d0s0 yet. It didn''t occur to me because I believed that to be a spare disk which I had already partitioned, and let''s face it, "I/O error" can be anything. I was wrong, though: I had already used the spare the previous week, and this one was unformatted. But lacking any other ideas, I tried to export and then import the pool. The export went fine without complaint. I corrected the partitioning on the replacement disk at this point. But now when I try to import: # zpool import pool: z id: 1372922273220982501 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. The pool may be active on on another system, but can be imported using the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-5E config: z FAULTED corrupted data raidz2 ONLINE c6t600A0B800049F9E10000030548B3DF1Ed0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000030848B3DF52d0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000030B48B3DF7Ed0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000030E48B3DFA6d0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000031148B3DFD2d0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000031448B3DFFAd0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000031748B3E020d0s0 UNAVAIL corrupted data c6t600A0B800049F9E10000031A48B3E04Cd0s0 UNAVAIL corrupted data # zpool import z cannot import ''z'': pool may be in use from other system use ''-f'' to import anyway # zpool import -f z cannot import ''z'': one or more devices is currently unavailable ''zdb -l'' shows four valid labels for each of these disks except for the new one. Is this what "unavailable" means, in this case? Growing adventurous, or perhaps just desperate, I used ''dd'' to copy the first label from one of the good disks to the new one which lacked any labels. Then I binary-edited the label to patch in the correct guid for that disk. (I got the correct guid from the zdb -l output.) I still get the same results from zpool import, though. Is this because I need to patch in three more copies of the label? I''m not sure how (or more correctly, where) to do that. Is this a lost cause? Anyone have any suggestions? Is there a nice tool for writing zdb -l output as a new label to a new disk? Why did zpool export a pool that it can''t import? This is an experimental development system, but I''d still like to recover the data if possible. It may or may not be relevant that I have another exported pool, a relic of old experiments, which believes it uses the same device (...E020d0s0) has had no zfs label. But I had this pool before all of yesterday''s trouble, too. If there''s a way to destroy an exported pool I''m fine with doing so, but it doesn''t seem like this is part of today''s problem. Thanks for any ideas. -- -D. dgc at uchicago.edu NSIT University of Chicago
I had problem with one of the labels on disk being unavailable. I was able to recover label (according to zdb -l) by doing export/import. But disk was still unavailable. Only scrub removed UNAVAILABLE status from disk. -- This message posted from opensolaris.org
David Champion
2008-Nov-06 17:02 UTC
[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''
I have a feeling I pushed people away with a long message. Let me reduce my problem to one question.> # zpool import -f z > cannot import ''z'': one or more devices is currently unavailable > > > ''zdb -l'' shows four valid labels for each of these disks except for the > new one. Is this what "unavailable" means, in this case?I have now faked up a label for the disk that didn''t have one and applied it with dd. Can anyone say what "unavailable" means, given that all eight disks are registered devices at the correct paths, are readable, and have labels? -- -D. dgc at uchicago.edu NSIT University of Chicago
Victor Latushkin
2008-Nov-07 01:32 UTC
[zfs-discuss] after controller crash, ''one or more devices is currently unavailable''
David Champion wrote:> I have a feeling I pushed people away with a long message. Let me > reduce my problem to one question. > >> # zpool import -f z >> cannot import ''z'': one or more devices is currently unavailable >> >> >> ''zdb -l'' shows four valid labels for each of these disks except for the >> new one. Is this what "unavailable" means, in this case? > > I have now faked up a label for the disk that didn''t have one and > applied it with dd.> Can anyone say what "unavailable" means, given that all eight disks are > registered devices at the correct paths, are readable, and have labels?For that label to be valid you also need to make sure that checksum for the label is valid as well. You may try to get better idea why what is going on during import with the help of DTrace. See topic "more ZFS recovery" for the example script. regards, victor