first off I don''t have the exact failure messages here, and I did not take good notes of the failures, so I will do the best I can. Please try and give me advice anyway. I have a 7 drive raidz1 pool with 500G drives, and I wanted to replace them all with 2TB drives. Immediately I ran into trouble. If I tired: zpool offline brick <device> I got a message like: insufficient replicas I tried to zpool replace brick <old device> <new device> and I got something like: <new device> must be a single disk I finally got replace and offline to work by: zpool export brick [reboot] zpool import brick now zpool offline brick <old device> zpool replace brick <old device> <new device> This worked. zpool status showed replacing in progress, and then after about 26 hours of resilvering, everything looked fine. The <old device> was gone, and no errors in the pool. Now I tried to do it again with the next device. I missed the "zpool offline" part however. Immediately, I started getting disk errors on both the drive I was replacing and the first drive I replaced. At this point I was starting to panic, so I shut down the machine to make sure the drive cables were plugged in properly. When the machine came back up, the replace that just started was done. There was a resilver in progress but the replace was gone. Now I am getting disk errors on two drives, and I have more than 200,000 data errors ( zpool status -v ) I have the two original drives, they are in good shape and should still have all the data on them, can I somehow put my original zpool back. How? Please help! Also, what do you think went wrong here? -- This message posted from opensolaris.org
Mark J Musante
2010-Aug-10 19:26 UTC
[zfs-discuss] zfs replace problems please please help
On Tue, 10 Aug 2010, seth keith wrote:> first off I don''t have the exact failure messages here, and I did not take good notes of the failures, so I will do the best I can. Please try and give me advice anyway. > > I have a 7 drive raidz1 pool with 500G drives, and I wanted to replace them all with 2TB drives. Immediately I ran into trouble. If I tired: > > zpool offline brick <device>Were you doing an in-place replace? i.e. pulling out the old disk and putting in the new one?> I got a message like: insufficient replicasThis means that there was a problem with the pool already. When ZFS opens a pool, it looks at the disks that are part of that pool. For raidz1, if more than one disk is unopenable, then the pool will report that there are "no valid replicas", which is probably the error message you saw. If that''s the case, then your pool already had one failed drive in, and you were attempting to disable a second drive. Do you have a copy of the output from "zpool status brick" from before you tried your experiment?> > I tried to > > zpool replace brick <old device> <new device> > > and I got something like: <new device> must be a single diskUnfortunately, this just means that we got back an EINVAL from the kernel, which could mean any one of a number of things, but probably there was an issue with calculating the drive size. I''d try plugging it separately and using ''format'' to see how big solaris thinks the drive is.> > I finally got replace and offline to work by: > > zpool export brick > [reboot] > zpool import brickProbably didn''t need to reboot there.> now > > zpool offline brick <old device> > zpool replace brick <old device> <new device>If you use this form for the replace command, you don''t need to offline the old disk first. You only need to offline a disk if you''re going to pull it out. And then you can do an in-place replace just by issuing "zpool replace brick <device-you-swapped>"> This worked. zpool status showed replacing in progress, and then after > about 26 hours of resilvering, everything looked fine. The <old device> > was gone, and no errors in the pool. Now I tried to do it again with the > next device. I missed the "zpool offline" part however. Immediately, I > started getting disk errors on both the drive I was replacing and the > first drive I replaced.Read errors? Write errors? Checksum errors? Sounds like a full scrub would have been a good idea prior to replacing the second disk.> I have the two original drives, they are in good shape and should still > have all the data on them, can I somehow put my original zpool back. > How? Please help!You can try exporting the pool, plugging in the original drives, and then do a recovery on it. See the zpool manpage under "zpool import" for the recovery options and what the flags mean.
First off double thanks for replying to my post. I tried to your advice but something is way wrong. I have all 2TB drives disconnected, and the 7 500GB drives connected. All 7 show up in bios and in format. Here all the drives are the original 7 500Mb drives: # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c3d0 <DEFAULT cyl 4859 alt 2 hd 255 sec 63> /pci at 0,0/pci8086,3a40 at 1c/pci-ide at 0/ide at 1/cmdk at 0,0 1. c4d0 <Maxtor 7-H81AYZ5-0001-465.76GB> /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 2. c4d1 <WDC WD50- WD-WCAS8323204-0001-465.76GB> /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 1,0 3. c6d0 <WDC WD50- WD-WCAS8510568-0001-465.76GB> /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 0,0 4. c6d1 <WDC WD50- WD-WCAUF149175-0001-465.76GB> /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 1,0 5. c7d0 <Maxtor 7-H81DM5X-0001-465.76GB> /pci at 0,0/pci-ide at 1f,5/ide at 0/cmdk at 0,0 6. c12d0 <WDC WD50- WD-WCAUH024469-0001-465.76GB> /pci at 0,0/pci8086,244e at 1e/pci-ide at 1/ide at 1/cmdk at 0,0 7. c13d0 <WDC WD50- WD-WCAS8415731-0001-465.76GB> /pci at 0,0/pci8086,244e at 1e/pci-ide at 1/ide at 0/cmdk at 0,0 Now clear out brick: # zpool export brick # zpool status pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 errors: No known data errors Then an error on the import: # zpool import -F brick cannot open ''brick'': I/O error Now there is a pool but the drives are wrong: # zpool status pool: brick state: UNAVAIL status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM brick UNAVAIL 0 0 0 insufficient replicas raidz1 UNAVAIL 0 0 0 insufficient replicas c13d0 ONLINE 0 0 0 c4d0 ONLINE 0 0 0 c7d0 ONLINE 0 0 0 c4d1 ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c15t0d0 UNAVAIL 0 0 0 cannot open c11t0d0 UNAVAIL 0 0 0 cannot open c12d0 FAULTED 0 0 0 corrupted data c6d0 ONLINE 0 0 0 What I want is to remove c15t0d0 and c11t0d0 and replace with the original c6d1. Suggestions? -- This message posted from opensolaris.org
Mark J Musante
2010-Aug-11 12:03 UTC
[zfs-discuss] zfs replace problems please please help
On Tue, 10 Aug 2010, seth keith wrote:> # zpool status > pool: brick > state: UNAVAIL > status: One or more devices could not be used because the label is missing > or invalid. There are insufficient replicas for the pool to continue > functioning. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-5E > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > brick UNAVAIL 0 0 0 insufficient replicas > raidz1 UNAVAIL 0 0 0 insufficient replicas > c13d0 ONLINE 0 0 0 > c4d0 ONLINE 0 0 0 > c7d0 ONLINE 0 0 0 > c4d1 ONLINE 0 0 0 > replacing UNAVAIL 0 0 0 insufficient replicas > c15t0d0 UNAVAIL 0 0 0 cannot open > c11t0d0 UNAVAIL 0 0 0 cannot open > c12d0 FAULTED 0 0 0 corrupted data > c6d0 ONLINE 0 0 0 > > What I want is to remove c15t0d0 and c11t0d0 and replace with the original c6d1. Suggestions?Do the labels still exist on c6d1? e.g. what do you get from "zdb -l /dev/rdsk/c6d1s0"? If the label still exists, and the pool guid is the same as the labels on the other disks, you could try doing a "zpool detach brick c15t0d0" (or c11t0d0), then export & try re-importing. ZFS may find c6d1 at that point. There''s no way to guarantee that''ll work.
> -----Original Message----- > From: Mark J Musante [mailto:Mark.Musante at oracle.com] > Sent: Wednesday, August 11, 2010 5:03 AM > To: Seth Keith > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] zfs replace problems please please help > > On Tue, 10 Aug 2010, seth keith wrote: > > > # zpool status > > pool: brick > > state: UNAVAIL > > status: One or more devices could not be used because the label is missing > > or invalid. There are insufficient replicas for the pool to continue > > functioning. > > action: Destroy and re-create the pool from a backup source. > > see: http://www.sun.com/msg/ZFS-8000-5E > > scrub: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > brick UNAVAIL 0 0 0 insufficient replicas > > raidz1 UNAVAIL 0 0 0 insufficient replicas > > c13d0 ONLINE 0 0 0 > > c4d0 ONLINE 0 0 0 > > c7d0 ONLINE 0 0 0 > > c4d1 ONLINE 0 0 0 > > replacing UNAVAIL 0 0 0 insufficient replicas > > c15t0d0 UNAVAIL 0 0 0 cannot open > > c11t0d0 UNAVAIL 0 0 0 cannot open > > c12d0 FAULTED 0 0 0 corrupted data > > c6d0 ONLINE 0 0 0 > > > > What I want is to remove c15t0d0 and c11t0d0 and replace with the original c6d1. > Suggestions? > > Do the labels still exist on c6d1? e.g. what do you get from "zdb -l > /dev/rdsk/c6d1s0"? > > If the label still exists, and the pool guid is the same as the labels on > the other disks, you could try doing a "zpool detach brick c15t0d0" (or > c11t0d0), then export & try re-importing. ZFS may find c6d1 at that > point. There''s no way to guarantee that''ll work.When I do a zdb -l /dev/rdsk/<any device> I get the same output for all my drives in the pool, but I don''t think it looks right: # zdb -l /dev/rdsk/c4d0 -------------------------------------------- LABEL 0 -------------------------------------------- failed to unpack label 0 -------------------------------------------- LABEL 1 -------------------------------------------- failed to unpack label 1 -------------------------------------------- LABEL 2 -------------------------------------------- failed to unpack label 2 -------------------------------------------- LABEL 3 -------------------------------------------- failed to unpack label 3 If I try this zpool deatch action, can it be reversed if there is a problem?
Mark J Musante
2010-Aug-11 18:44 UTC
[zfs-discuss] zfs replace problems please please help
On Wed, 11 Aug 2010, Seth Keith wrote:> > When I do a zdb -l /dev/rdsk/<any device> I get the same output for all my drives in the pool, but I don''t think it looks right: > > # zdb -l /dev/rdsk/c4d0What about /dev/rdsk/c4d0s0?
this is for newbies like myself: I used using ''zdb -l'' wrong, just using the drive name from ''zpool status'' or format which is like c6d1, didn''t work. I needed to add s0 to the end: zdb -l /dev/dsk/c6d1s0 gives me a good looking label ( I think ). The pool_guid values are the same for all the drives. I see the first 500GB drive I replaced has "children" that are all 500GB drives. The second 500GB drive I replaced has 1 2TB child. All the other drives have 2 2TB children. I managed to detach one of the drives being replaced, but I count not detach the other two 2TB drives. I exported and imported, now my pool looks like pool: brick state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM brick DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c13d0 ONLINE 0 0 0 c4d0 ONLINE 0 0 0 c7d0 ONLINE 0 0 0 c4d1 ONLINE 0 0 0 14607330800900413650 UNAVAIL 0 0 0 was /dev/dsk/c15t0d0s0 c11t1d0 ONLINE 0 0 0 c6d0 ONLINE 0 0 0 errors: 352808 data errors, use ''-v'' for a list I there someway I can take the original zpool label from the first 500GB drive I replaced and use it to fix up the other drives in the pool? What are my options here... -- This message posted from opensolaris.org
Mark J Musante
2010-Aug-11 19:45 UTC
[zfs-discuss] zfs replace problems please please help
On Wed, 11 Aug 2010, seth keith wrote:> NAME STATE READ WRITE CKSUM > brick DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > c13d0 ONLINE 0 0 0 > c4d0 ONLINE 0 0 0 > c7d0 ONLINE 0 0 0 > c4d1 ONLINE 0 0 0 > 14607330800900413650 UNAVAIL 0 0 0 was /dev/dsk/c15t0d0s0 > c11t1d0 ONLINE 0 0 0 > c6d0 ONLINE 0 0 0OK, that''s good - your missing disk can be replaced with a brand new disk using "zpool replace brick 14607330800900413650 <disk name>". Then wait for the resilver to complete and do a full scrub to be on the safe side.> errors: 352808 data errors, use ''-v'' for a list > > I there someway I can take the original zpool label from the first 500GB > drive I replaced and use it to fix up the other drives in the pool?No. The files with errors can only be restored from any backups you made. If there is an original disk that''s not part of your pool, you might want to try making a backup of it, plug it in, and see if a zpool export/zpool import will find it. But it will only find it if zdb -l shows four valid labels.