Hi, I hope someone can help cos ATM zfs'' logic seems a little askew. I just swapped a failing 200gb drive that was one half of a 400gb gstripe device which I was using as one of the devices in a 3 device raidz1. When the OS came back up after the drive had been changed, the necessary metadata was of course not on the new drive so the stripe didn''t exist. Zfs understandably complained it couldn''t open the stripe, however it did not show the array as degraded. I didn''t save the output, but it was just like described in this thread: http://www.nabble.com/Shooting-yourself-in-the-foot-with-ZFS:-is-quite-easy-t4512790.html I recreated the gstripe device under the same name stripe/str1 and assumed I could just: # zpool replace pool stripe/str1 invalid vdev specification stripe/str1 is in use (r1w1e1) It also told me to try -f, which I did, but was greeted with the same error. Why can I not replace a device with itself? As the man page describes just this procedure I''m a little confused. Try as I might (online, offline, scrub) I could not get the array to rebuild, just like was the guy described in that thread above. I eventually resorted to recreating the stripe with a different name stripe/str2. I could then perform a: # zpool replace pool stripe/str1 stripe/str2 Is there a reason I have to jump through these seemingly pointless hoops to replace a device with itself? Many thanks. This message posted from opensolaris.org
Richard Elling
2007-Oct-03 17:36 UTC
[zfs-discuss] replacing a device with itself doesn''t work
MP wrote:> Hi, > I hope someone can help cos ATM zfs'' logic seems a little askew. > I just swapped a failing 200gb drive that was one half of a 400gb gstripe device which I was using as one of the devices in a 3 device raidz1. When the OS came back up after the drive had been changed, the necessary metadata was of course not on the new drive so the stripe didn''t exist. Zfs understandably complained it couldn''t open the stripe, however it did not show the array as degraded. I didn''t save the output, but it was just like described in this thread: > > http://www.nabble.com/Shooting-yourself-in-the-foot-with-ZFS:-is-quite-easy-t4512790.html > > I recreated the gstripe device under the same name stripe/str1 and assumed I could just: > > # zpool replace pool stripe/str1 > invalid vdev specification > stripe/str1 is in use (r1w1e1) > > It also told me to try -f, which I did, but was greeted with the same error. > Why can I not replace a device with itself? > As the man page describes just this procedure I''m a little confused. > Try as I might (online, offline, scrub) I could not get the array to rebuild, just like was the guy described in that thread above. I eventually resorted to recreating the stripe with a different name stripe/str2. I could then perform a: > > # zpool replace pool stripe/str1 stripe/str2 > > Is there a reason I have to jump through these seemingly pointless hoops to replace a device with itself? > Many thanks.Yes. From the fine manual on zpool: zpool replace [-f] pool old_device [new_device] Replaces old_device with new_device. This is equivalent to attaching new_device, waiting for it to resilver, and then detaching old_device. ... If new_device is not specified, it defaults to old_device. This form of replacement is useful after an existing disk has failed and has been physically replaced. In this case, the new disk may have the same /dev/dsk path as the old device, even though it is actu- ally a different disk. ZFS recognizes this. For a stripe, you don''t have redundancy, so you cannot replace the disk with itself. You would have to specify the [new_device] I''ve submitted CR6612596 for a better error message and CR6612605 to mention this in the man page. -- richard
Richard Elling
2007-Oct-03 19:10 UTC
[zfs-discuss] replacing a device with itself doesn''t work
more below... MP wrote:> On 03/10/2007, *Richard Elling* <Richard.Elling at sun.com > <mailto:Richard.Elling at sun.com>> wrote: > > Yes. From the fine manual on zpool: > zpool replace [-f] pool old_device [new_device] > > Replaces old_device with new_device. This is equivalent > to attaching new_device, waiting for it to resilver, and > then detaching old_device. > ... > If new_device is not specified, it defaults to > old_device. This form of replacement is useful after an > existing disk has failed and has been physically > replaced. In this case, the new disk may have the same > /dev/dsk path as the old device, even though it is actu- > ally a different disk. ZFS recognizes this. > > For a stripe, you don''t have redundancy, so you cannot replace the > disk with itself. > > > I don''t see how a stripe makes a difference. It''s just 2 drives joined together logically to make a > new device. It can be used by the system just like a normal hard drive. Just like a normal hard > drive it too has no redundancy?Correct. It would be redundant if it were a mirror, raidz, or raidz2. In the case of stripes of mirrors, raidz, or raidz2 vdevs, they are redundant.> You would have to specify the [new_device] > I''ve submitted CR6612596 for a better error message and CR6612605 > to mention this in the man page. > > > Perhaps I was a little unclear. Zfs did a few things during this whole > escapade which seemed wrong. > > # mdconfig -a -tswap -s64m > md0 > # mdconfig -a -tswap -s64m > md1 > # mdconfig -a -tswap -s64m > md2I presume you''re not running Solaris, so please excuse me if I take a Solaris view to this problem.> # zpool create tank raidz md0 md1 md2 > # zpool status -v tank > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > md0 ONLINE 0 0 0 > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > # zpool offline tank md0 > Bringing device md0 offline > # dd if=/dev/zero of=/dev/md0 bs=1m > dd: /dev/md0: end of device > 65+0 records in > 64+0 records out > 67108864 bytes transferred in 0.044925 secs (1493798602 bytes/sec) > # zpool status -v tank > pool: tank > state: DEGRADED > status: One or more devices has been taken offline by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > md0 OFFLINE 0 0 0 > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > > -------------------- > At this point where the drive is offline a ''zpool replace tank md0'' will > fix the array.Correct. The pool is redundant.> However, if instead the other advice given; ''zpool online tank md0'' is > used then problems start to occur: > -------------------- > > # zpool online tank md0 > # zpool status -v tank > pool: tank > state: ONLINE > status: One or more devices could not be used because the label is > missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-4J > scrub: resilver completed with 0 errors on Wed Oct 3 18:44:22 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > md0 UNAVAIL 0 0 0 corrupted data > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > > ------------- > ^^^^^^^ > Surely this is wrong? Zpool shows the pool as ''ONLINE'' and not > degraded. Whereas the status explanation > says that it is degraded and ''zpool replace'' is required. That''s just > confusing.I agree, I would expect the STATE to be DEGRADED.> ------------- > > # zpool scrub tank > # zpool status -v tank > pool: tank > state: ONLINE > status: One or more devices could not be used because the label is > missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-4J > scrub: resilver completed with 0 errors on Wed Oct 3 18:45:06 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > md0 UNAVAIL 0 0 0 corrupted data > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > # zpool replace tank md0 > invalid vdev specification > use ''-f'' to override the following errors: > md0 is in use (r1w1e1) > # zpool replace -f tank md0 > invalid vdev specification > the following errors must be manually repaired: > md0 is in use (r1w1e1) > > ----------------- > Well the advice of ''zpool replace'' doesn''t work. At this point the user > is now stuck. There seems to > be just no way to now use the existing device md0.In Solaris NV b72, this works as you expect. # zpool replace zwimming /dev/ramdisk/rd1 # zpool status -v zwimming pool: zwimming state: DEGRADED scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAME STATE READ WRITE CKSUM zwimming DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /dev/ramdisk/rd1/old FAULTED 0 0 0 corrupted data /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd3 ONLINE 0 0 0 errors: No known data errors # zpool status -v zwimming pool: zwimming state: ONLINE scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /dev/ramdisk/rd1 ONLINE 0 0 0 /dev/ramdisk/rd2 ONLINE 0 0 0 /dev/ramdisk/rd3 ONLINE 0 0 0 errors: No known data errors> ----------------- > # mdconfig -a -tswap -s64m > md3 > # zpool replace -f tank md0 md3 > # zpool status -v tank > pool: tank > state: ONLINE > scrub: resilver completed with 0 errors on Wed Oct 3 18:45:52 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > replacing ONLINE 0 0 0 > md0 UNAVAIL 0 0 0 corrupted data > md3 ONLINE 0 0 0 > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > # zpool status -v tank > pool: tank > state: ONLINE > scrub: resilver completed with 0 errors on Wed Oct 3 18:45:52 2007 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > md3 ONLINE 0 0 0 > md1 ONLINE 0 0 0 > md2 ONLINE 0 0 0 > > errors: No known data errors > > -------------------- > > Only changing the device name of the failed component can get zfs to > rebuild the array. That seems > wrong to me. > > 1. Why does zpool status say ''ONLINE'' when the pool is obviously degraded?IMHO, bug.> 2. Why is the 1st advice given ''zpool online'', which does not work?In Solaris I see: # zpool online zwimming /dev/ramdisk/rd1 warning: device ''/dev/ramdisk/rd1'' onlined, but remains in faulted state use ''zpool replace'' to replace devices that are no longer present> 3. Why is the 2nd advice given ''zpool replace'', when that doesn''t work > after the 1st advice has been performed?Works in Solaris. Hopefully it is in the pipeline for *BSD.> 4. Why do I have to use a device with a different name to get this to > work? Surely > what I did above mimics exactly what happens when a drive fails, and > the manual > says that ''zpool replace <pool> <failed-device>'' will fix it?In such cases, I would not try this while online, I would have offlined it before attempting replace. But I see your point, it is confusing. But given that Solaris seems to handle this differently, I think it is just a matter of your release catching up.> 5. If zfs can access all the necessary devices in the pool, then why > doesn''t scrub fix the array?You destroyed all of the data on the device, including the uberblocks. AFAIK, scrub does not attempt to recreate uberblocks, which is why the replace command exists. I think you''ve identified a user interface problem that can be corrected more automatically. What do others think? Should a scrub perform a replace if the uberblocks are nonexistent? -- richard
Pawel Jakub Dawidek
2007-Oct-03 20:02 UTC
[zfs-discuss] replacing a device with itself doesn''t work
On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote:> > ------------- > > > > # zpool scrub tank > > # zpool status -v tank > > pool: tank > > state: ONLINE > > status: One or more devices could not be used because the label is > > missing or > > invalid. Sufficient replicas exist for the pool to continue > > functioning in a degraded state. > > action: Replace the device using ''zpool replace''. > > see: http://www.sun.com/msg/ZFS-8000-4J > > scrub: resilver completed with 0 errors on Wed Oct 3 18:45:06 2007 > > config: > > > > NAME STATE READ WRITE CKSUM > > tank ONLINE 0 0 0 > > raidz1 ONLINE 0 0 0 > > md0 UNAVAIL 0 0 0 corrupted data > > md1 ONLINE 0 0 0 > > md2 ONLINE 0 0 0 > > > > errors: No known data errors > > # zpool replace tank md0 > > invalid vdev specification > > use ''-f'' to override the following errors: > > md0 is in use (r1w1e1) > > # zpool replace -f tank md0 > > invalid vdev specification > > the following errors must be manually repaired: > > md0 is in use (r1w1e1) > > > > ----------------- > > Well the advice of ''zpool replace'' doesn''t work. At this point the user > > is now stuck. There seems to > > be just no way to now use the existing device md0. > > In Solaris NV b72, this works as you expect. > # zpool replace zwimming /dev/ramdisk/rd1 > # zpool status -v zwimming > pool: zwimming > state: DEGRADED > scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 > config: > > NAME STATE READ WRITE CKSUM > zwimming DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > replacing DEGRADED 0 0 0 > /dev/ramdisk/rd1/old FAULTED 0 0 0 corrupted data > /dev/ramdisk/rd1 ONLINE 0 0 0 > /dev/ramdisk/rd2 ONLINE 0 0 0 > /dev/ramdisk/rd3 ONLINE 0 0 0 > > errors: No known data errors > # zpool status -v zwimming > pool: zwimming > state: ONLINE > scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 > config: > > NAME STATE READ WRITE CKSUM > zwimming ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > /dev/ramdisk/rd1 ONLINE 0 0 0 > /dev/ramdisk/rd2 ONLINE 0 0 0 > /dev/ramdisk/rd3 ONLINE 0 0 0 > > errors: No known data errorsGood to know, but I think it''s still a bit of ZFS fault. The error message ''md0 is in use (r1w1e1)'' means that something (I''m quite sure it''s ZFS) keeps device open. Why does it keeps it open when it doesn''t recognize it? Or maybe it tries to open it twice for write (exclusively) when replacing, which is not allowed in GEOM in FreeBSD. I can take a look if this is the former or the latter, but it should be fixed in ZFS itself, IMHO. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071003/ddd273d8/attachment.bin>
I think I might have run into the same problem. At the time I assumed I was doing something wrong, but... I made a b72 raidz out of three new 1gb virtual disks in vmware. I shut the vm off, replaced one of the disks with a new 1.5gb virtual disk. No matter what command I tried, I couldn''t get the new disk into the array. The docs said that replacing the vdev with itself would work, but it didn''t. Nor did setting the ''automatic replace'' feature on the pool and plugging a new device in. I recall most of the errors being "device in use". Maybe I wasn''t the problem after all? 0_o This message posted from opensolaris.org
Pawel, Is this a problem with ZFS trying to open the device twice? Richard, Yes a scrub should fix the device. One of zfs'' faetures is ease of administration. It seems to defy logic that a scrub does not fix all devices, if possible. Why make it any harder for the admin? Cheers. This message posted from opensolaris.org
Pawel Jakub Dawidek
2007-Oct-08 13:07 UTC
[zfs-discuss] replacing a device with itself doesn''t work
On Wed, Oct 03, 2007 at 10:02:03PM +0200, Pawel Jakub Dawidek wrote:> On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote: > > > ------------- > > > > > > # zpool scrub tank > > > # zpool status -v tank > > > pool: tank > > > state: ONLINE > > > status: One or more devices could not be used because the label is > > > missing or > > > invalid. Sufficient replicas exist for the pool to continue > > > functioning in a degraded state. > > > action: Replace the device using ''zpool replace''. > > > see: http://www.sun.com/msg/ZFS-8000-4J > > > scrub: resilver completed with 0 errors on Wed Oct 3 18:45:06 2007 > > > config: > > > > > > NAME STATE READ WRITE CKSUM > > > tank ONLINE 0 0 0 > > > raidz1 ONLINE 0 0 0 > > > md0 UNAVAIL 0 0 0 corrupted data > > > md1 ONLINE 0 0 0 > > > md2 ONLINE 0 0 0 > > > > > > errors: No known data errors > > > # zpool replace tank md0 > > > invalid vdev specification > > > use ''-f'' to override the following errors: > > > md0 is in use (r1w1e1) > > > # zpool replace -f tank md0 > > > invalid vdev specification > > > the following errors must be manually repaired: > > > md0 is in use (r1w1e1) > > > > > > ----------------- > > > Well the advice of ''zpool replace'' doesn''t work. At this point the user > > > is now stuck. There seems to > > > be just no way to now use the existing device md0. > > > > In Solaris NV b72, this works as you expect. > > # zpool replace zwimming /dev/ramdisk/rd1 > > # zpool status -v zwimming > > pool: zwimming > > state: DEGRADED > > scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 > > config: > > > > NAME STATE READ WRITE CKSUM > > zwimming DEGRADED 0 0 0 > > raidz1 DEGRADED 0 0 0 > > replacing DEGRADED 0 0 0 > > /dev/ramdisk/rd1/old FAULTED 0 0 0 corrupted data > > /dev/ramdisk/rd1 ONLINE 0 0 0 > > /dev/ramdisk/rd2 ONLINE 0 0 0 > > /dev/ramdisk/rd3 ONLINE 0 0 0 > > > > errors: No known data errors > > # zpool status -v zwimming > > pool: zwimming > > state: ONLINE > > scrub: resilver completed with 0 errors on Wed Oct 3 11:55:36 2007 > > config: > > > > NAME STATE READ WRITE CKSUM > > zwimming ONLINE 0 0 0 > > raidz1 ONLINE 0 0 0 > > /dev/ramdisk/rd1 ONLINE 0 0 0 > > /dev/ramdisk/rd2 ONLINE 0 0 0 > > /dev/ramdisk/rd3 ONLINE 0 0 0 > > > > errors: No known data errors > > Good to know, but I think it''s still a bit of ZFS fault. The error > message ''md0 is in use (r1w1e1)'' means that something (I''m quite sure > it''s ZFS) keeps device open. Why does it keeps it open when it doesn''t > recognize it? Or maybe it tries to open it twice for write (exclusively) > when replacing, which is not allowed in GEOM in FreeBSD. > > I can take a look if this is the former or the latter, but it should be > fixed in ZFS itself, IMHO.Ok, it seems that it was fixed in ZFS itself already: /* * If we are setting the vdev state to anything but an open state, then * always close the underlying device. Otherwise, we keep accessible * but invalid devices open forever. We don''t call vdev_close() itself, * because that implies some extra checks (offline, etc) that we don''t * want here. This is limited to leaf devices, because otherwise * closing the device will affect other children. */ if (vdev_is_dead(vd) && vd->vdev_ops->vdev_op_leaf) vd->vdev_ops->vdev_op_close(vd); The ZFS version from FreeBSD-CURRENT doesn''t have this code yet, it''s only in my perforce branch for now. I''ll verify later today if it really fixes the problem and I''ll report back if not. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071008/40f00fdf/attachment.bin>