Tuomas Leikola
2010-Sep-27 10:13 UTC
[zfs-discuss] Resilver endlessly restarting at completion
Hi! My home server had some disk outages due to flaky cabling and whatnot, and started resilvering to a spare disk. During this another disk or two dropped, and were reinserted into the array. So no devices were actually lost, they just were intermittently away for a while each. The situation is currently as follows: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c11t1d0p0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 5 c11t6d0p0 ONLINE 0 0 0 spare-3 ONLINE 0 0 0 c11t3d0p0 ONLINE 0 0 0 106M resilvered c9d1 ONLINE 0 0 0 104G resilvered c11t4d0p0 ONLINE 0 0 0 c11t0d0p0 ONLINE 0 0 0 c11t5d0p0 ONLINE 0 0 0 c11t7d0p0 ONLINE 0 0 0 93.6G resilvered raidz1-2 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 2.50K resilvered c6t5d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 1 logs /dev/zvol/dsk/rpool/log ONLINE 0 0 0 cache c6t0d0p0 ONLINE 0 0 0 spares c9d1 INUSE currently in use errors: No known data errors And this has been going on for a week now, always restarting when it should complete. The questions in my mind atm: 1. How can i determine the cause for each resilver? Is there a log? 2. Why does it resilver the same data over and over, and not just the changed bits? 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be resilvered instead? I''m running opensolaris 134, but the event originally happened on 111b. I upgraded and tried quiescing snapshots and IO, none of which helped. I''ve already ordered some new hardware to recreate this entire array as raidz2 among other things, but there''s about a week of time when I can run debuggers and traces if instructed to. - Tuomas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100927/d9d34dc6/attachment-0001.html>
Tuomas Leikola
2010-Sep-29 17:13 UTC
[zfs-discuss] Resilver endlessly restarting at completion
The endless resilver problem still persists on OI b147. Restarts when it should complete. I see no other solution than to copy the data to safety and recreate the array. Any hints would be appreciated as that takes days unless i can stop or pause the resilvering. On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at gmail.com>wrote:> Hi! > > My home server had some disk outages due to flaky cabling and whatnot, and > started resilvering to a spare disk. During this another disk or two > dropped, and were reinserted into the array. So no devices were actually > lost, they just were intermittently away for a while each. > > The situation is currently as follows: > pool: tank > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > c11t1d0p0 ONLINE 0 0 0 > c11t2d0 ONLINE 0 0 5 > c11t6d0p0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c11t3d0p0 ONLINE 0 0 0 106M > resilvered > c9d1 ONLINE 0 0 0 104G > resilvered > c11t4d0p0 ONLINE 0 0 0 > c11t0d0p0 ONLINE 0 0 0 > c11t5d0p0 ONLINE 0 0 0 > c11t7d0p0 ONLINE 0 0 0 93.6G > resilvered > raidz1-2 ONLINE 0 0 0 > c6t2d0 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c6t4d0 ONLINE 0 0 0 2.50K > resilvered > c6t5d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 1 > logs > /dev/zvol/dsk/rpool/log ONLINE 0 0 0 > cache > c6t0d0p0 ONLINE 0 0 0 > spares > c9d1 INUSE currently in use > > errors: No known data errors > > And this has been going on for a week now, always restarting when it should > complete. > > The questions in my mind atm: > > 1. How can i determine the cause for each resilver? Is there a log? > > 2. Why does it resilver the same data over and over, and not just the > changed bits? > > 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be > resilvered instead? > > I''m running opensolaris 134, but the event originally happened on 111b. I > upgraded and tried quiescing snapshots and IO, none of which helped. > > I''ve already ordered some new hardware to recreate this entire array as > raidz2 among other things, but there''s about a week of time when I can run > debuggers and traces if instructed to. > > - Tuomas > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/6f16d48e/attachment-0001.html>
George Wilson
2010-Sep-29 18:01 UTC
[zfs-discuss] Resilver endlessly restarting at completion
Answers below... Tuomas Leikola wrote:> The endless resilver problem still persists on OI b147. Restarts when it > should complete. > > I see no other solution than to copy the data to safety and recreate the > array. Any hints would be appreciated as that takes days unless i can > stop or pause the resilvering. > > On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola > <tuomas.leikola at gmail.com <mailto:tuomas.leikola at gmail.com>> wrote: > > Hi! > > My home server had some disk outages due to flaky cabling and > whatnot, and started resilvering to a spare disk. During this > another disk or two dropped, and were reinserted into the array. So > no devices were actually lost, they just were intermittently away > for a while each. > > The situation is currently as follows: > pool: tank > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the > errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > c11t1d0p0 ONLINE 0 0 0 > c11t2d0 ONLINE 0 0 5 > c11t6d0p0 ONLINE 0 0 0 > spare-3 ONLINE 0 0 0 > c11t3d0p0 ONLINE 0 0 0 106M > resilvered > c9d1 ONLINE 0 0 0 104G > resilvered > c11t4d0p0 ONLINE 0 0 0 > c11t0d0p0 ONLINE 0 0 0 > c11t5d0p0 ONLINE 0 0 0 > c11t7d0p0 ONLINE 0 0 0 93.6G > resilvered > raidz1-2 ONLINE 0 0 0 > c6t2d0 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c6t4d0 ONLINE 0 0 0 2.50K > resilvered > c6t5d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 1 > logs > /dev/zvol/dsk/rpool/log ONLINE 0 0 0 > cache > c6t0d0p0 ONLINE 0 0 0 > spares > c9d1 INUSE currently in use > > errors: No known data errors > > And this has been going on for a week now, always restarting when it > should complete. > > The questions in my mind atm: > > 1. How can i determine the cause for each resilver? Is there a log?If you''re running OI b147 then you should be able to do the following: # echo "::zfs_dbgmsg" | mdb -k > /var/tmp/dbg.out Send me the output.> > 2. Why does it resilver the same data over and over, and not just > the changed bits?If you''re having drives fail prior to the initial resilver finishing then it will restart and do all the work over again. Are drives still failing randomly for you?> > 3. Can i force remove c9d1 as it is no longer needed but c11t3 can > be resilvered instead?You can detach the spare and let the resilver work on only c11t3. Can you send me the output of ''zdb -dddd tank 0''? Thanks, George
Tuomas Leikola
2010-Sep-29 19:06 UTC
[zfs-discuss] Resilver endlessly restarting at completion
Thanks for taking an interest. Answers below. On Wed, Sep 29, 2010 at 9:01 PM, George Wilson <george.r.wilson at oracle.com>wrote:> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at gmail.com<mailto: >> tuomas.leikola at gmail.com>> wrote: >> > >> (continuous resilver loop) has been going on for a week now, always >> restarting when it >> should complete. >> >> The questions in my mind atm: >> 1. How can i determine the cause for each resilver? Is there a log? >> > > If you''re running OI b147 then you should be able to do the following: > > # echo "::zfs_dbgmsg" | mdb -k > /var/tmp/dbg.out > > Send me the output.Sending verbose output in a separate email. I''m not very familiar with this but it does show some "restarting" lines.> 2. Why does it resilver the same data over and over, and not just >> the changed bits? >> > > If you''re having drives fail prior to the initial resilver finishing then > it will restart and do all the work over again. Are drives still failing > randomly for you? > > >Drives haven''t been dropping since the initial incidents. It''s run to completion a few times now without (visible) issues with the drives. Then again I think there is some magic to reinsert a device back into the array if there is some intermittent SATA disconnection.> >> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can >> be resilvered instead? >> > > You can detach the spare and let the resilver work on only c11t3. Can you > send me the output of ''zdb -dddd tank 0''?Detach commands complain there''s not enough replicas. Of course I can physically remove the device, at which point a scrub would suffice (the disks must be relatively well up-to-date by now..) Sending zdb output in a separate mail as soon as it completes.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/7bf15a7e/attachment.html>
Tuomas Leikola
2010-Oct-05 09:10 UTC
[zfs-discuss] Resilver endlessly restarting at completion
This seems to have been a false alarm, sorry for that. As soon as I started paying attention (logging zpool status, peeking around with zdb & mdb) the resilver didn''t restart unless provoked. A cleartext log would have been nice ("restarted due to c11t7 becoming online"). A slight problem i can see is that resilver restarts always if a device is added to the array. In my case devices were absent for a short period (some SATA failure that corrected itself by running cfgadm -c disconnect & connect) and it would have been beneficial to let resilver run to completion and restart only after that to resilver missing data on the added device. ZFS does have some intelligence in those cases that all data is not resilvered, but only blocks that have been born after the outage. Also, as i had a spare in the array, that kicked in, which probably was not what I would have wanted, as that triggered a full resilver, and not a partial one. After the fact I could not kick the spare out, and could not make the resilvering process forget about doing a full resilver. Plus now I have to replace it back out and make it a cold spare. But end is well, all is well.. mostly. Devices seem to be still dropping from the SATA bus randomly. Maybe I''ll cough together a report and post to storage-discuss. On Wed, Sep 29, 2010 at 8:13 PM, Tuomas Leikola <tuomas.leikola at gmail.com>wrote:> The endless resilver problem still persists on OI b147. Restarts when it > should complete. > > I see no other solution than to copy the data to safety and recreate the > array. Any hints would be appreciated as that takes days unless i can stop > or pause the resilvering. > > > On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at gmail.com>wrote: > >> Hi! >> >> My home server had some disk outages due to flaky cabling and whatnot, and >> started resilvering to a spare disk. During this another disk or two >> dropped, and were reinserted into the array. So no devices were actually >> lost, they just were intermittently away for a while each. >> >> The situation is currently as follows: >> pool: tank >> state: ONLINE >> status: One or more devices has experienced an unrecoverable error. An >> attempt was made to correct the error. Applications are >> unaffected. >> action: Determine if the device needs to be replaced, and clear the errors >> using ''zpool clear'' or replace the device with ''zpool replace''. >> see: http://www.sun.com/msg/ZFS-8000-9P >> scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go >> config: >> >> NAME STATE READ WRITE CKSUM >> tank ONLINE 0 0 0 >> raidz1-0 ONLINE 0 0 0 >> c11t1d0p0 ONLINE 0 0 0 >> c11t2d0 ONLINE 0 0 5 >> c11t6d0p0 ONLINE 0 0 0 >> spare-3 ONLINE 0 0 0 >> c11t3d0p0 ONLINE 0 0 0 106M >> resilvered >> c9d1 ONLINE 0 0 0 104G >> resilvered >> c11t4d0p0 ONLINE 0 0 0 >> c11t0d0p0 ONLINE 0 0 0 >> c11t5d0p0 ONLINE 0 0 0 >> c11t7d0p0 ONLINE 0 0 0 93.6G >> resilvered >> raidz1-2 ONLINE 0 0 0 >> c6t2d0 ONLINE 0 0 0 >> c6t3d0 ONLINE 0 0 0 >> c6t4d0 ONLINE 0 0 0 2.50K >> resilvered >> c6t5d0 ONLINE 0 0 0 >> c6t6d0 ONLINE 0 0 0 >> c6t7d0 ONLINE 0 0 0 >> c6t1d0 ONLINE 0 0 1 >> logs >> /dev/zvol/dsk/rpool/log ONLINE 0 0 0 >> cache >> c6t0d0p0 ONLINE 0 0 0 >> spares >> c9d1 INUSE currently in use >> >> errors: No known data errors >> >> And this has been going on for a week now, always restarting when it >> should complete. >> >> The questions in my mind atm: >> >> 1. How can i determine the cause for each resilver? Is there a log? >> >> 2. Why does it resilver the same data over and over, and not just the >> changed bits? >> >> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be >> resilvered instead? >> >> I''m running opensolaris 134, but the event originally happened on 111b. I >> upgraded and tried quiescing snapshots and IO, none of which helped. >> >> I''ve already ordered some new hardware to recreate this entire array as >> raidz2 among other things, but there''s about a week of time when I can run >> debuggers and traces if instructed to. >> >> - Tuomas >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101005/76bb1354/attachment.html>