Apologies in advance as this is a Solaris 10 question and not an OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue). System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool version 22). Storage is a pile of J4400 (5 of them). I have run into what appears to be (Sun) Bug ID 6995143, and I opened a case with Oracle requesting to be added to that bug. I am being told that bug had been abandoned and that ZFS is behaving "correctly". Here is what I am seeing: 1) zpool with multiple vdevs and hot spares 2) multiple drive failures at once 3) multiple hot spares in use (so far, only one in each vdev, but they are raidz2 so I suppose it could be up to 2 in each vdev) 4) after repair of the failed drives and resilver completes, the hot spares stay in use I have NOT seen the issue with only a single drive failure. I have NOT seen the problem if the failed drive(s) is(are) replaced BEFORE the resilver of the hot spares completes In other words, I have only seen the issue if there are more than one failed drive at once and if the hot spares complete resilvering before the bad drives are repaired. This has all been seen in our test environment, and we simulate a drive failure by either removing a drive or disabling it (via CAM, these are J4400 drives). This came to light due to testing to determine resolution of another bug (SATA over SAS multipathing driver issues). We do have a work around. Once the resilver of the repaired drives completes we can ''zpool detach'' the hot spare device from the zpool (vdev) and it goes back into the AVAIL state. Is this EXPECTED behavior for multiple drive failures ? Here is some detailed information from my latest test. Here is the pool in failed state (2 drives failed)... pool: nyc-test-01 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: scrub completed after 0h0m with 0 errors on Thu Mar 3 08:41:04 2011 config: NAME STATE READ WRITE CKSUM nyc-test-01 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 spare-3 DEGRADED 0 0 0 c5t5000CCA215C28142d0 REMOVED 0 0 0 c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c5t5000CCA215C280F8d0 REMOVED 0 0 0 c5t5000CCA215C83160d0 ONLINE 0 0 0 c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 INUSE currently in use c5t5000CCA215C83160d0 INUSE currently in use errors: No known data errors Here is the attempt to bring one of the failed drives back online using ''zpool replace'' (after the drive was enabled), which tosses a warning (as expected)...> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0Password: invalid vdev specification use ''-f'' to override the following errors: /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool nyc-test-01. Please see zpool(1M).>Instead of forcing the replace, I did a ''zpool online'' Here is the pool resilvering after one of the two failed drives is brought back online via the ''zpool online'' command (while the resilver is still running)... pool: nyc-test-01 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go config: NAME STATE READ WRITE CKSUM nyc-test-01 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 spare-3 ONLINE 0 0 0 c5t5000CCA215C28142d0 ONLINE 0 0 0 104M resilvered c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c5t5000CCA215C280F8d0 REMOVED 0 0 0 c5t5000CCA215C83160d0 ONLINE 0 0 0 c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 INUSE currently in use c5t5000CCA215C83160d0 INUSE currently in use errors: No known data errors Here is the pool after the resilver has completed, but the hot spare is still in use... pool: nyc-test-01 state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 08:49:03 2011 config: NAME STATE READ WRITE CKSUM nyc-test-01 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 spare-3 ONLINE 0 0 0 c5t5000CCA215C28142d0 ONLINE 0 0 0 302M resilvered c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c5t5000CCA215C280F8d0 REMOVED 0 0 0 c5t5000CCA215C83160d0 ONLINE 0 0 0 c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 INUSE currently in use c5t5000CCA215C83160d0 INUSE currently in use errors: No known data errors Here is the zpool after the second failed drive has been brought back online via ''zpool online''... pool: nyc-test-01 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 config: NAME STATE READ WRITE CKSUM nyc-test-01 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 spare-3 ONLINE 0 0 0 c5t5000CCA215C28142d0 ONLINE 0 0 0 c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 spare-1 ONLINE 0 0 0 c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered c5t5000CCA215C83160d0 ONLINE 0 0 0 c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 INUSE currently in use c5t5000CCA215C83160d0 INUSE currently in use errors: No known data errors Note that BOTH hot spares are still in use even though the faults have been cleared. Now I detach one of the hot spares...> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0pool: nyc-test-01 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 config: NAME STATE READ WRITE CKSUM nyc-test-01 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 c5t5000CCA215C28142d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 spare-1 ONLINE 0 0 0 c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered c5t5000CCA215C83160d0 ONLINE 0 0 0 c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 AVAIL c5t5000CCA215C83160d0 INUSE currently in use errors: No known data errors and now the other hot spare ...> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0pool: nyc-test-01 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 config: NAME STATE READ WRITE CKSUM nyc-test-01 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA215C8A649d0 ONLINE 0 0 0 c5t5000CCA215C84A65d0 ONLINE 0 0 0 c5t5000CCA215C34786d0 ONLINE 0 0 0 c5t5000CCA215C28142d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered c5t5000CCA215C34753d0 ONLINE 0 0 0 c5t5000CCA215C34823d0 ONLINE 0 0 0 spares c5t5000CCA215C7FD6Ed0 AVAIL c5t5000CCA215C83160d0 AVAIL errors: No known data errors -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Mar 3, 2011, at 6:45 AM, Paul Kraus wrote:> Apologies in advance as this is a Solaris 10 question and not an > OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue). > System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool > version 22). Storage is a pile of J4400 (5 of them).No problem, this is the "zfs-discuss" forum. AFAIK, there is no Solaris-10 specific alias.> > I have run into what appears to be (Sun) Bug ID 6995143, and I opened > a case with Oracle requesting to be added to that bug. I am being told > that bug had been abandoned and that ZFS is behaving "correctly". Here > is what I am seeing: > > 1) zpool with multiple vdevs and hot spares > 2) multiple drive failures at onceIn my experience, hot spares do not help with the case where the failures are not explicitly drive failures. In those cases where I see multiple failures at once, the root cause has never been that all of the implicated drives are bad.> 3) multiple hot spares in use (so far, only one in each vdev, but they > are raidz2 so I suppose it could be up to 2 in each vdev) > 4) after repair of the failed drives and resilver completes, the hot > spares stay in use > > I have NOT seen the issue with only a single drive failure. > I have NOT seen the problem if the failed drive(s) is(are) replaced > BEFORE the resilver of the hot spares completes > > In other words, I have only seen the issue if there are more than one > failed drive at once and if the hot spares complete resilvering before > the bad drives are repaired. > > This has all been seen in our test environment, and we simulate a > drive failure by either removing a drive or disabling it (via CAM, > these are J4400 drives). This came to light due to testing to > determine resolution of another bug (SATA over SAS multipathing driver > issues). > > We do have a work around. Once the resilver of the repaired drives > completes we can ''zpool detach'' the hot spare device from the zpool > (vdev) and it goes back into the AVAIL state. > > Is this EXPECTED behavior for multiple drive failures ?I believe it is the right thing to do.> > Here is some detailed information from my latest test. > > Here is the pool in failed state (2 drives failed)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: scrub completed after 0h0m with 0 errors on Thu Mar 3 08:41:04 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 DEGRADED 0 0 0 > c5t5000CCA215C28142d0 REMOVED 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the attempt to bring one of the failed drives back online > using ''zpool replace'' (after the drive was enabled), which tosses a > warning (as expected)... > >> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0 > Password: > invalid vdev specification > use ''-f'' to override the following errors: > /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool > nyc-test-01. Please see zpool(1M). >> > > Instead of forcing the replace, I did a ''zpool online''In the case of SAS drives, it is rare that replacing a disk with itself can work -- a replacement disk will have a different WWN. In this case, your test plan is incorrect and the zpool online is correct.> > Here is the pool resilvering after one of the two failed drives is > brought back online via the ''zpool online'' command (while the resilver > is still running)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 104M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the pool after the resilver has completed, but the hot spare > is still in use...correct> pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 08:49:03 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 302M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the zpool after the second failed drive has been brought back > online via ''zpool online''... > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Note that BOTH hot spares are still in use even though the faults have > been cleared. > > Now I detach one of the hot spares... > >> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0 > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > and now the other hot spare ... > >> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0sudo is so old skewl :-)> > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 AVAIL > > errors: No known data errorsThis is the expected behaviour, and IMNSO, the best solution to the thorny problem of sparing. This procedure allows you to manage the intent of the replacement. Note that you can run perfectly fine in the ONLINE state indefinitely, so the only issue is to ensure that the administrator''s intent is implemented. -- richard
Hi Paul, I''ve seen some spare stickiness too and its generally when I''m trying to simulate a drive failure (like you are below) without actually physically replacing the device. If I actually physically replace the failed drive, the spare is detached automatically after the new device is resilvered. I haven''t seen a multiple drive failure in our systems here so I can''t comment on that part, but I don''t think this part is related to spare behavior. The issue here is that the drive was only "disabled," you didn''t actually "physically" replace it so ZFS throws an error: sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0 > Password: > invalid vdev specification > use ''-f'' to override the following errors: > /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool > nyc-test-01. Please see zpool(1M). If you want to rerun your test, tries these steps: 1. Removing a device from the pool 2. Watch for the spare to kick-in 3. Replace the "failed" device with a new physical device 4. Run the zpool replace command 5. Observe spare behavior I don''t see the spare as hung, it just needs to be detached as described here: http://download.oracle.com/docs/cd/E19253-01/819-5461/6n7ht6qvv/index.html#gjfbs Thanks, Cindy On 03/03/11 07:45, Paul Kraus wrote:> Apologies in advance as this is a Solaris 10 question and not an > OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue). > System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool > version 22). Storage is a pile of J4400 (5 of them). > > I have run into what appears to be (Sun) Bug ID 6995143, and I opened > a case with Oracle requesting to be added to that bug. I am being told > that bug had been abandoned and that ZFS is behaving "correctly". Here > is what I am seeing: > > 1) zpool with multiple vdevs and hot spares > 2) multiple drive failures at once > 3) multiple hot spares in use (so far, only one in each vdev, but they > are raidz2 so I suppose it could be up to 2 in each vdev) > 4) after repair of the failed drives and resilver completes, the hot > spares stay in use > > I have NOT seen the issue with only a single drive failure. > I have NOT seen the problem if the failed drive(s) is(are) replaced > BEFORE the resilver of the hot spares completes > > In other words, I have only seen the issue if there are more than one > failed drive at once and if the hot spares complete resilvering before > the bad drives are repaired. > > This has all been seen in our test environment, and we simulate a > drive failure by either removing a drive or disabling it (via CAM, > these are J4400 drives). This came to light due to testing to > determine resolution of another bug (SATA over SAS multipathing driver > issues). > > We do have a work around. Once the resilver of the repaired drives > completes we can ''zpool detach'' the hot spare device from the zpool > (vdev) and it goes back into the AVAIL state. > > Is this EXPECTED behavior for multiple drive failures ? > > Here is some detailed information from my latest test. > > Here is the pool in failed state (2 drives failed)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: scrub completed after 0h0m with 0 errors on Thu Mar 3 08:41:04 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 DEGRADED 0 0 0 > c5t5000CCA215C28142d0 REMOVED 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the attempt to bring one of the failed drives back online > using ''zpool replace'' (after the drive was enabled), which tosses a > warning (as expected)... > >> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0 > Password: > invalid vdev specification > use ''-f'' to override the following errors: > /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool > nyc-test-01. Please see zpool(1M). > > Instead of forcing the replace, I did a ''zpool online'' > > Here is the pool resilvering after one of the two failed drives is > brought back online via the ''zpool online'' command (while the resilver > is still running)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 104M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the pool after the resilver has completed, but the hot spare > is still in use... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 08:49:03 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 302M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the zpool after the second failed drive has been brought back > online via ''zpool online''... > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Note that BOTH hot spares are still in use even though the faults have > been cleared. > > Now I detach one of the hot spares... > >> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0 > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > and now the other hot spare ... > >> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0 > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 AVAIL > > errors: No known data errors > >
On Thu, Mar 3, 2011 at 11:48 AM, Richard Elling <richard.elling at gmail.com> wrote:>> 1) zpool with multiple vdevs and hot spares >> 2) multiple drive failures at once > > In my experience, hot spares do not help with the case where the failures > are not explicitly drive failures. In those cases where I see multiple failures > at once, the root cause has never been that all of the implicated drives are bad.With one exception, my experience agrees with yours. In general, multiple simultaneous drive failures are almost never real drive failures, but failures of something else that makes the drives unavailable (such as a bad cable or a flaky SIM). The one exception involved two 750 GB SATA drives failing within hours of each other (out of the 120 drives in five J4400 all purchased at the same time). This one exception, which happened in pre-production testing led us to test the multiple failure case. <snip>>> Is this EXPECTED behavior for multiple drive failures ? > > I believe it is the right thing to do.OK, so it *may* be that ZFS reacts differently to single and multiple drive failures on purpose. My concern is that this behavior is an unintended consequence of something. <snip>>> Instead of forcing the replace, I did a ''zpool online'' > > In the case of SAS drives, it is rare that replacing a disk with itself > can work -- a replacement disk will have a different WWN. ? In this > case, your test plan is incorrect and the zpool online is correct.I have been told by Oracle Support that in the case of SATA drives in J4400, the WWN is associated with the slot and NOT the drive, so a replacement drive will have the same WWN. Empirically, I have seen both behaviors (and I don''t really like that). In some cases replacing a drive did not cause a WWN change and in others it did. I think it depended on the manufacturer of the drive. In other words, if a Hitachi was replaced with a Hitachi the WWN did not change, but when a Hitachi was replaced with a Seagate then the WWN did change. <snip>> This is the expected behaviour, and IMNSO, the best solution to the thorny problem > of sparing. This procedure allows you to manage the intent of the replacement. Note > that you can run perfectly fine in the ONLINE state indefinitely, so the only issue is > to ensure that the administrator''s intent is implemented.The only issue I see is that it keeps a hot spare busy and unavailable to cover for a different (real ?) failure. But, one would hope that the administrator would notice and take corrective action as necessary ... -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Thu, Mar 3, 2011 at 2:08 PM, Cindy Swearingen <cindy.swearingen at oracle.com> wrote:> I''ve seen some spare stickiness too and its generally when I''m trying to > simulate a drive failure (like you are below) without actually > physically replacing the device. > > If I actually physically replace the failed drive, the spare is > detached automatically after the new device is resilvered. I haven''t > seen a multiple drive failure in our systems here so I can''t comment on > that part, but I don''t think this part is related to spare behavior.I will have to see if I can scare up enough loose drives to do this (really replace the ''failed'' drive). Unfortunately, I am 200 miles aware from the drives in question ... <snip>> If you want to rerun your test, tries these steps: > > 1. Removing a device from the pool > 2. Watch for the spare to kick-in > 3. Replace the "failed" device with a new physical device > 4. Run the zpool replace command > 5. Observe spare behaviorI *may* have enough spare drives to try this.> I don''t see the spare as hung, it just needs to be detached as described > here: > > http://download.oracle.com/docs/cd/E19253-01/819-5461/6n7ht6qvv/index.html#gjfbs-- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Just a shot in the dark, but could this possibly be related to my issue as posted with the subject "Nasty zfs issue"? roy ----- Original Message -----> Apologies in advance as this is a Solaris 10 question and not an > OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue). > System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool > version 22). Storage is a pile of J4400 (5 of them). > > I have run into what appears to be (Sun) Bug ID 6995143, and I opened > a case with Oracle requesting to be added to that bug. I am being told > that bug had been abandoned and that ZFS is behaving "correctly". Here > is what I am seeing: > > 1) zpool with multiple vdevs and hot spares > 2) multiple drive failures at once > 3) multiple hot spares in use (so far, only one in each vdev, but they > are raidz2 so I suppose it could be up to 2 in each vdev) > 4) after repair of the failed drives and resilver completes, the hot > spares stay in use > > I have NOT seen the issue with only a single drive failure. > I have NOT seen the problem if the failed drive(s) is(are) replaced > BEFORE the resilver of the hot spares completes > > In other words, I have only seen the issue if there are more than one > failed drive at once and if the hot spares complete resilvering before > the bad drives are repaired. > > This has all been seen in our test environment, and we simulate a > drive failure by either removing a drive or disabling it (via CAM, > these are J4400 drives). This came to light due to testing to > determine resolution of another bug (SATA over SAS multipathing driver > issues). > > We do have a work around. Once the resilver of the repaired drives > completes we can ''zpool detach'' the hot spare device from the zpool > (vdev) and it goes back into the AVAIL state. > > Is this EXPECTED behavior for multiple drive failures ? > > Here is some detailed information from my latest test. > > Here is the pool in failed state (2 drives failed)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device > with > ''zpool replace''. > scrub: scrub completed after 0h0m with 0 errors on Thu Mar 3 08:41:04 > 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 DEGRADED 0 0 0 > c5t5000CCA215C28142d0 REMOVED 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the attempt to bring one of the failed drives back online > using ''zpool replace'' (after the drive was enabled), which tosses a > warning (as expected)... > > > sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0 > Password: > invalid vdev specification > use ''-f'' to override the following errors: > /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool > nyc-test-01. Please see zpool(1M). > > > > Instead of forcing the replace, I did a ''zpool online'' > > Here is the pool resilvering after one of the two failed drives is > brought back online via the ''zpool online'' command (while the resilver > is still running)... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device > with > ''zpool replace''. > scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 104M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the pool after the resilver has completed, but the hot spare > is still in use... > > pool: nyc-test-01 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device > with > ''zpool replace''. > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 > 08:49:03 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 302M resilvered > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 0 > c5t5000CCA215C280F8d0 REMOVED 0 0 0 > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Here is the zpool after the second failed drive has been brought back > online via ''zpool online''... > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 > 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 INUSE currently in use > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > Note that BOTH hot spares are still in use even though the faults have > been cleared. > > Now I detach one of the hot spares... > > > sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0 > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 > 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > spare-1 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C83160d0 ONLINE 0 0 0 > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 INUSE currently in use > > errors: No known data errors > > and now the other hot spare ... > > > sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0 > > pool: nyc-test-01 > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3 > 09:04:46 2011 > config: > > NAME STATE READ WRITE CKSUM > nyc-test-01 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > c5t5000CCA215C8A649d0 ONLINE 0 0 0 > c5t5000CCA215C84A65d0 ONLINE 0 0 0 > c5t5000CCA215C34786d0 ONLINE 0 0 0 > c5t5000CCA215C28142d0 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > c5t5000CCA215C8A5B5d0 ONLINE 0 0 0 > c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered > c5t5000CCA215C34753d0 ONLINE 0 0 0 > c5t5000CCA215C34823d0 ONLINE 0 0 0 > spares > c5t5000CCA215C7FD6Ed0 AVAIL > c5t5000CCA215C83160d0 AVAIL > > errors: No known data errors > > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( > http://www.garnetriver.com/ ) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, RPI Players > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Thu, Mar 3, 2011 at 4:28 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> Just a shot in the dark, but could this possibly be related to my issue as posted with the subject "Nasty zfs issue"?I do not think they are directly related. I have seen some odd behavior when I replace a failed drive before the resilver completes, but nothing as dramatic as what you saw. I have also seen at least as many cases of completely normal behavior when replacing a failed drive before the hot spare finishes resilvering (the scrub restarts and resilvers both the hot spare and the replacement drive). My standard practice is to wait for the hot spare to finish resilvering before replacing the failed drive, just to be safe. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
> > Just a shot in the dark, but could this possibly be related to my > > issue as posted with the subject "Nasty zfs issue"? > > I do not think they are directly related. I have seen some odd > behavior when I replace a failed drive before the resilver completes, > but nothing as dramatic as what you saw. I have also seen at least as > many cases of completely normal behavior when replacing a failed drive > before the hot spare finishes resilvering (the scrub restarts and > resilvers both the hot spare and the replacement drive). My standard > practice is to wait for the hot spare to finish resilvering before > replacing the failed drive, just to be safe.That''s my plan for tomorrow :? I''ll be doing some more tests on a test system to see if I can reproduce this error of mine. If I can, I''ll file a bug (for Illumos, primarily) Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk
2011-Mar-03 22:14 UTC
[zfs-discuss] Use of spares? (Was: Hung Hot Spare)
> > > Just a shot in the dark, but could this possibly be related to my > > > issue as posted with the subject "Nasty zfs issue"? > > > > I do not think they are directly related. I have seen some odd > > behavior when I replace a failed drive before the resilver > > completes, > > but nothing as dramatic as what you saw. I have also seen at least > > as > > many cases of completely normal behavior when replacing a failed > > drive > > before the hot spare finishes resilvering (the scrub restarts and > > resilvers both the hot spare and the replacement drive). My standard > > practice is to wait for the hot spare to finish resilvering before > > replacing the failed drive, just to be safe. > > That''s my plan for tomorrow :? > > I''ll be doing some more tests on a test system to see if I can > reproduce this error of mine. If I can, I''ll file a bug (for Illumos, > primarily)Another thing I''ve seen in the lab, is that if I have a RAIDz2 VDEV, and two drives fail, then getting replaced by hot spares, the VDEV fails if a third drive fails, even if the spares are functional. After the failed drive is resilvered, the pool operates normally. Wouldn''t it be better for a spare to work as a full member of the pool instead of just being a legimite spare? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Fred Liu
2011-Mar-04 00:34 UTC
[zfs-discuss] performance of whole pool suddenly degrade awhile and restore when one file system trys to exceed the quota
Hi, Has anyone met this? I meet this every time just like somebody steps on brake paddle suddenly and release it in the car. Thanks. Fred