Hello all, SHORT VERSION: What conditions can cause the reset of the resilvering process? My lost-and-found disk can''t get back into the pool because of resilvers restarting... LONG VERSION: I maintain a number of older systems, and among them a Sun Thumper (X4500) running OpenSolaris SXCE with nowadays-smallish 250Gb disks and a ZPOOL v14 (with the OS supporting ZPOOLv16). This year it has "lost contact" with two different drives, however a poweroff followed by poweron restored the contact. On the first time a couple of months ago the system successfully used the hotspare disk with no issues worth mentioning or remembering. This week however the hotspare took days to get used and ultimately didn''t, with pool CKSUM errors growing into millions on the lost disk, and dmesg clobbered with messages about the lost drive. There was even no message about resilvering in "zpool status", other than that the hotspare is "IN USE". After a reboot the disk was re-found and the resilver went humming along, with the hotspare still marked as used. The error count is non-zero but stable overnight, and no data loss was reported: pool: pond state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h49m, 9.52% done, 7h46m to go config: NAME STATE READ WRITE CKSUM pond ONLINE 0 0 0 ... raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spare ONLINE 0 0 46 c1t2d0 ONLINE 0 0 0 17.3G resilvered c5t6d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 ... spares c5t6d0 INUSE currently in use errors: No known data errors The problem I come here with is that the resilver does not complete. It got reset after about 40, 84, 80Gb of resilvered progress and restarts from scratch. I hoped that it would at least progress further each time, but that seems to be not true as of the last two restarts. The pool is indeed active, used by zfs-auto-snapshots in particular (I disabled them for now while asking - they do also perform very slowly, with some "zfs destroy" invokations overlaying each other and failing due to missing targets). There are no correlating errors in dmesg that would point me at a problem, and the CKSUM count stays the same. So far I guess that the stale disk may be found referencing some transactions that are gone from the live pool (i.e. via deletion of automatic snapshots), but I didn''t think this should cause a restart of the whole resilvering process?.. The system must be quite fragmented I guess, because the resilver of a 250Gb disk is promised to complete in 10 hours (after each report - even if 80Gb are behind). "zpool iostat 60" shows low figures, about 1Mb/s writes and about 5Mb/s reads; "iostat -xnz 1" also shows low utilizations on drives in general (about 5-10%busy, 500-900Kb/s reads), but with a bigger peak on the recovering raidz''s drives and the hotspare disk. But even these numbers usually seem within the bounds of IOPS or bandwidth bottlenecks: # iostat -xzn c0t1d0 c1t2d0 c5t6d0 c4t3d0 c6t5d0 c7t6d0 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 107.5 21.1 5200.2 93.1 1.3 0.9 10.0 6.9 22 37 c0t1d0 107.7 21.2 5200.7 93.2 1.2 0.9 9.6 7.0 22 37 c4t3d0 107.2 21.1 5199.6 93.2 1.3 0.9 10.0 7.1 22 38 c6t5d0 107.6 21.1 5199.2 93.2 1.2 0.9 9.5 7.3 21 37 c7t6d0 1.5 200.0 71.3 4574.1 0.5 0.4 2.7 1.9 10 16 c1t2d0 106.9 21.2 5124.4 93.1 1.1 0.9 8.8 7.1 20 35 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 177.9 0.0 4251.1 0.0 1.2 0.4 6.9 2.0 28 35 c0t1d0 156.8 0.0 4074.2 0.0 1.1 0.4 6.8 2.3 25 35 c4t3d0 156.8 0.0 4048.1 0.0 0.4 1.1 2.6 6.8 12 33 c6t5d0 156.8 0.0 4188.3 0.0 1.2 0.4 7.6 2.6 30 41 c7t6d0 0.0 343.8 0.0 2927.3 0.1 0.1 0.4 0.2 5 6 c1t2d0 167.9 0.0 4017.4 0.0 1.3 0.4 7.7 2.3 29 37 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 113.0 114.0 2588.9 134.5 1.7 0.3 7.4 1.5 29 33 c0t1d0 79.0 122.0 2777.0 144.0 1.6 0.4 7.9 1.8 33 37 c4t3d0 98.0 122.0 2775.0 144.5 0.6 1.1 2.8 4.9 17 35 c6t5d0 125.0 125.0 2946.0 144.0 1.5 0.3 6.0 1.3 27 33 c7t6d0 0.0 395.1 0.0 2363.4 0.3 0.1 0.9 0.2 7 9 c1t2d0 131.0 114.0 3091.5 138.5 1.5 0.4 6.0 1.5 30 36 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 82.0 80.0 1715.7 62.0 0.5 0.2 3.4 1.3 9 21 c0t1d0 75.0 69.0 1673.8 57.5 0.7 0.2 4.8 1.4 11 20 c4t3d0 68.0 69.0 1607.8 58.5 0.3 0.5 1.9 3.9 8 22 c6t5d0 71.0 74.0 1539.8 57.0 0.5 0.2 3.1 1.2 8 17 c7t6d0 0.0 205.0 0.0 1079.3 0.2 0.0 0.9 0.2 4 4 c1t2d0 72.0 77.0 1679.8 60.0 0.6 0.2 3.9 1.3 12 19 c5t6d0 What conditions can cause the reset of the resilvering process? And how can it be sped up to complete other than disabling zfs auto-snapshots? There are no other heavy users of the system, though several VMs and webapp servers run off the pool''s data, and can not be put down... The knobs I know from OI did not work here: # echo zfs_resilver_delay/W0t0 | mdb -kw mdb: failed to dereference symbol: unknown symbol name # echo zfs_resilver_min_time_ms/W0t20000 | mdb -kw mdb: failed to dereference symbol: unknown symbol name Thanks for any ideas, //Jim Klimov
On Fri, 11 May 2012, Jim Klimov wrote:> Hello all, > > SHORT VERSION: > > What conditions can cause the reset of the resilvering > process? My lost-and-found disk can''t get back into the > pool because of resilvers restarting...I recall that with sufficiently old vintage zfs, resilver would restart if a snapshot was taken. What sort of zfs is being used here? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-05-11 17:18, Bob Friesenhahn wrote:> On Fri, 11 May 2012, Jim Klimov wrote: > >> Hello all, >> >> SHORT VERSION: >> >> What conditions can cause the reset of the resilvering >> process? My lost-and-found disk can''t get back into the >> pool because of resilvers restarting... > > I recall that with sufficiently old vintage zfs, resilver would restart > if a snapshot was taken. What sort of zfs is being used here?Maybe sufficiently old, SXCE snv_117 with "zpool upgrade" reporting the version as "14" for this pool (as detailed in the long version of my message). It is also possible that auto-snaps are taken and deleted thus interfering with the resilver, however, the recovery is restarted after a few hours while auto-snapshots have a 5-minute schedule over about a thousand datasets. With the lags in place, though, I am not sure they all succeed at that frequency ;) Thanks, //Jim
2012-05-11 17:18, Bob Friesenhahn ???????:> On Fri, 11 May 2012, Jim Klimov wrote: > >> Hello all, >> >> SHORT VERSION: >> >> What conditions can cause the reset of the resilvering >> process? My lost-and-found disk can''t get back into the >> pool because of resilvers restarting... > > I recall that with sufficiently old vintage zfs, resilver would restart > if a snapshot was taken. What sort of zfs is being used here? > > BobWell, for the night I rebooted the machine into single-user mode, to rule out zones, crontabs and networked abusers, but I still get resilvering resets every now and then, about once an hour. I''m now trying a run with all zfs datasets unmounted, hope that helps somewhat... I''m growing puzzled now. Wonder if things would get better or worse if I kick one of the drives (i.e. hotspare c5t6d0) out of the equation: raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spare ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 6.72G resilvered c5t6d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 Open to ideas and explanations, thanks, //Jim
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > Sent: 12 May 2012 01:27 > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Resilver restarting several times > > 2012-05-11 17:18, Bob Friesenhahn ???????: > > On Fri, 11 May 2012, Jim Klimov wrote: > > > >> Hello all, > >> > >> SHORT VERSION: > >> > >> What conditions can cause the reset of the resilvering process? My > >> lost-and-found disk can''t get back into the pool because of resilvers > >> restarting... > > > > I recall that with sufficiently old vintage zfs, resilver would > > restart if a snapshot was taken. What sort of zfs is being used here? > > > > Bob > > Well, for the night I rebooted the machine into single-user mode, to rule out > zones, crontabs and networked abusers, but I still get resilvering resets every > now and then, about once an hour. > > I''m now trying a run with all zfs datasets unmounted, hope that helps > somewhat... I''m growing puzzled now.To double check that no snapshots, etc. are being created run: zpool history -il pond -- Best regards, Robert Milkowski http://milek.blogspot.com
2012-05-12 5:50, Robert Milkowski wrote:>>>> What conditions can cause the reset of the resilvering process? My >>>> lost-and-found disk can''t get back into the pool because of resilvers >>>> restarting...>> Well, for the night I rebooted the machine into single-user mode, to rule out >> zones, crontabs and networked abusers, but I still get resilvering resets every >> now and then, about once an hour. >> >> I''m now trying a run with all zfs datasets unmounted, hope that helps >> somewhat... I''m growing puzzled now. > > To double check that no snapshots, etc. are being created run: zpool history -il pondHmmm, I should have thought of that - but didn''t. Anyhow, the pool history reports many "internal scrub" starting and ending, with times quite matching the resilver restarts I am seeing. Since the reboot into single-user mode there were no other snapshots, but while the system was fully up there were occasional "internal snapshots" as well. I''ll post the last several lines, in hopes that someone would make more sense of them. Overall the applied question is whether the disk will make it back into the live pool (ultimately with no continuous resilvering), and how fast that can be done - I don''t want to risk the big pool with nonredundant arrays for too long. It has already taken 2 days to try and resilver a 250Gb disk into the pool, but never made it past 100Gb progress. :( Reports no errors that I''d see either... :) 2012-05-12.00:50:36 [internal pool scrub done txg:91070404] complete=0 [user root on thumper] 2012-05-12.00:50:38 [internal pool scrub txg:91070404] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.01:19:00 [internal pool scrub done txg:91070477] complete=0 [user root on thumper] 2012-05-12.01:19:02 [internal pool scrub txg:91070477] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.01:46:28 [internal pool scrub done txg:91070571] complete=0 [user root on thumper] 2012-05-12.01:46:30 [internal pool scrub txg:91070571] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.02:45:52 [internal snapshot txg:91071279] dataset = 58650 [user daemon on thumper] 2012-05-12.02:45:56 [internal snapshot txg:91071280] dataset = 58652 [user jim on thumper] 2012-05-12.02:46:15 [internal snapshot txg:91071283] dataset = 58654 [user daemon on thumper] 2012-05-12.02:53:01 [internal pool scrub done txg:91071298] complete=0 [user root on thumper] 2012-05-12.02:53:03 [internal pool scrub txg:91071298] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] System rebooted into single-user around 3:03am; subsequent entries seem to be only about resilvers of different lengths - if that is what "internal scrubs" are: 2012-05-12.03:06:09 [internal pool scrub done txg:91071322] complete=0 [user root on thumper] 2012-05-12.03:06:11 [internal pool scrub txg:91071322] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.04:09:35 [internal pool scrub done txg:91071448] complete=0 [user root on thumper] 2012-05-12.04:09:37 [internal pool scrub txg:91071448] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.05:42:43 [internal pool scrub done txg:91071631] complete=0 [user root on thumper] 2012-05-12.05:42:45 [internal pool scrub txg:91071631] func=1 mintxg=41 maxtxg=91051854 [user root on thumper]
2012-05-12 4:26, Jim Klimov wrote:> Wonder if things would get better or worse if I kick one of the > drives (i.e. hotspare c5t6d0) out of the equation: > > raidz1 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > spare ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 6.72G resilvered > c5t6d0 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0While googling around for similar reports, I found the rumour that in a situation like mine the original and (partial) hot spare disk can conflict causing the restarts I am seeing, and to a poster on some FreeBSD list kicking the spare from the pool had helped reportedly. Can some real zfs gurus please confirm or deny the rumor, before I do a possibly stupid thing by kicking the hotspare drive out of the equation (and the pool)? Also, would booting from OI LiveCD help resilver the pool in one go, as it is (without upgrading the pool on-disk), and would the snv_117 server be able to work with it afterwards? ;) Thanks, //Jim
2012-05-11 14:22, Jim Klimov wrote:> What conditions can cause the reset of the resilvering > process? My lost-and-found disk can''t get back into the > pool because of resilvers restarting...FOLLOW-UP AND NEW QUESTIONS Here is a new piece of evidence - I''ve finally got something out of fmdump - series of several (5) retries ending with a fail, dated 75 seconds before resilvers restart (more below). Not a squeak in zpool status nor dmesg nor /dev/console. Guess I must assume that the disk is dying indeed, losing connection or something like that after a random time (my resilvers restart after 15min-5hrs), and at least a run of SMART long diags is in order, while the pool would try to rebuild onto another disk (the hotspare) instead of trying to update this one which was in the pool. Anyhow, information on the ex-pool disk is likely unused anyway - from iostat I see that it is only written to, with few to zero reads for minutes - so I''d lose nothing by replacing it with a blank drive (that''s strange though)... I also guess that the disk gets found after something like an unlogged bus reset or whatever, and this event causes the resilvering to restart from scratch. Q: Would this be the same in OI_151a, or would it continue resilvering from where it left off? I think I had the pool exported once during a resilver, and it restarted from the same percentage counter, so it is possible ;) Best course of action would be to get those people to fully replace the untrustworthy disk... Or at least pull and push it a bit - maybe it''s contacts just got plain dirty/oxidized and the disk should be re-seated in the enclosure... I''d like someone to please confirm or deny my hypotheses and guesses :) DETAILS According to format, the disk in tailed fmdump reports below is indeed the one I''m trying to resilver into: # format | gegrep -B1 ''/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0'' 10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB> /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 From the pool history we see the resilvering restart timestamps... they closely match the retry-fail cycles, following them by some 75 seconds: # zpool history -il pond | tail; date .... 2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 [user root on thumper] 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 [user root on thumper] 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] Sat May 12 15:45:50 MSK 2012 And last but not least - the FMDUMP messages... # fmdump -eV | tail -150 ... May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x928e32c9d1700401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = fail op-code = 0x28 cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae064b 0x21565490 May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cff7c1a May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cfeefbd May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cfe5782 May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cfd3c8d May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cfcadef May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x492896f3a8500401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 (end detector) driver-assessment = fail op-code = 0x28 cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x1 pkt-state = 0x37 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4fae374f 0x2cfc23f2 //Jim
2012-05-12 15:52, Jim Klimov wrote:> 2012-05-11 14:22, Jim Klimov wrote: >> What conditions can cause the reset of the resilvering >> process? My lost-and-found disk can''t get back into the >> pool because of resilvers restarting...> Guess I must assume that the disk is dying indeed, losing > connection or something like that after a random time (my > resilvers restart after 15min-5hrs), and at least a run of > SMART long diags is in order, while the pool would try to > rebuild onto another disk (the hotspare) instead of trying > to update this one which was in the pool.Well, I decided to bite the bullet and kick the original disk from the pool after replacing it with the spare, and to say the least, speed prognosis looks a lot better now, about two times better in zpool''s resilver progress/estimate and by up to an order of magnitude better at iostat speeds: Sat May 12 16:05:44 MSK 2012 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 571.0 5.4 52488.4 5.1 0.5 3.8 0.9 6.6 11 77 c0t1d0 589.4 5.6 52630.5 5.3 1.6 2.2 2.6 3.7 32 75 c4t3d0 562.0 6.2 52552.9 5.8 0.4 4.1 0.8 7.1 9 80 c6t5d0 582.4 6.2 52554.5 5.8 1.5 2.4 2.5 4.0 29 76 c7t6d0 0.0 591.0 0.0 52081.8 1.9 6.1 3.3 10.3 23 91 c5t6d0 Sat May 12 16:10:09 MSK 2012 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 248.4 0.1 14190.5 0.0 0.0 6.9 0.0 27.9 0 87 c0t1d0 247.9 0.1 14188.3 0.0 0.0 6.9 0.0 27.7 0 87 c4t3d0 248.3 0.1 14193.2 0.0 0.0 6.9 0.0 27.7 0 87 c6t5d0 245.1 0.1 14195.4 0.0 0.0 7.1 0.0 28.9 0 88 c7t6d0 0.1 214.4 0.6 13698.7 0.0 9.6 0.0 44.6 0 97 c5t6d0 Disk replacement was as follows: I kicked the hotspare from the resilvering pool and reattached it, then I kicked the original disk from the pair and attached it as a new hotspare. Excerpt from the zpool history follows: 2012-05-12.10:13:03 zpool detach pond c5t6d0 [user root on thumper:global] 2012-05-12.15:58:31 [internal vdev attach txg:91072933] spare in vdev=/dev/dsk/c5t6d0s0 for vdev=/dev/dsk/c1t2d0s0 [user root on thumper] 2012-05-12.15:59:08 zpool replace pond c1t2d0 c5t6d0 [user root on thumper:global] 2012-05-12.16:02:06 zpool detach pond c1t2d0 [user root on thumper:global] 2012-05-12.16:04:35 zpool add pond spare c1t2d0 [user root on thumper:global] Hope this helps us, thanks for listening, //Jim Klimov
On May 12, 2012, at 4:52 AM, Jim Klimov wrote:> 2012-05-11 14:22, Jim Klimov wrote: >> What conditions can cause the reset of the resilvering >> process? My lost-and-found disk can''t get back into the >> pool because of resilvers restarting... > > FOLLOW-UP AND NEW QUESTIONS > > Here is a new piece of evidence - I''ve finally got something > out of fmdump - series of several (5) retries ending with a > fail, dated 75 seconds before resilvers restart (more below). > Not a squeak in zpool status nor dmesg nor /dev/console. > > Guess I must assume that the disk is dying indeed, losing > connection or something like that after a random time (my > resilvers restart after 15min-5hrs), and at least a run of > SMART long diags is in order, while the pool would try to > rebuild onto another disk (the hotspare) instead of trying > to update this one which was in the pool.Please share if SMART offers anything useful.> Anyhow, information on the ex-pool disk is likely unused > anyway - from iostat I see that it is only written to, > with few to zero reads for minutes - so I''d lose nothing > by replacing it with a blank drive (that''s strange though)... > > I also guess that the disk gets found after something like > an unlogged bus reset or whatever, and this event causes > the resilvering to restart from scratch.This makes sense.> Q: Would this be the same in OI_151a, or would it continue > resilvering from where it left off? I think I had the pool > exported once during a resilver, and it restarted from the > same percentage counter, so it is possible ;) > > Best course of action would be to get those people to fully > replace the untrustworthy disk... Or at least pull and push > it a bit - maybe it''s contacts just got plain dirty/oxidized > and the disk should be re-seated in the enclosure...Not likely to help. SATA? -- richard> > I''d like someone to please confirm or deny my hypotheses > and guesses :) > > DETAILS > > According to format, the disk in tailed fmdump reports below > is indeed the one I''m trying to resilver into: > > # format | gegrep -B1 ''/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0'' > 10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB> > /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > > From the pool history we see the resilvering restart > timestamps... they closely match the retry-fail cycles, > following them by some 75 seconds: > > # zpool history -il pond | tail; date > .... > 2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 [user root on thumper] > 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] > 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 [user root on thumper] > 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] > Sat May 12 15:45:50 MSK 2012 > > And last but not least - the FMDUMP messages... > > # fmdump -eV | tail -150 > > ... > May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x928e32c9d1700401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = fail > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae064b 0x21565490 > > May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cff7c1a > > > May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfeefbd > > May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfe5782 > > May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfd3c8d > > > > May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfcadef > > > May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 > (end detector) > > driver-assessment = fail > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfc23f2 > > > //Jim > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120512/794918d8/attachment-0001.html>
Thanks for staying tuned! ;) 2012-05-12 18:34, Richard Elling wrote:> On May 12, 2012, at 4:52 AM, Jim Klimov wrote: > >> 2012-05-11 14:22, Jim Klimov wrote: >>> What conditions can cause the reset of the resilvering >>> process? My lost-and-found disk can''t get back into the >>> pool because of resilvers restarting... >> >> FOLLOW-UP AND NEW QUESTIONS >> >> Here is a new piece of evidence - I''ve finally got something >> out of fmdump - series of several (5) retries ending with a >> fail, dated 75 seconds before resilvers restart (more below). >> Not a squeak in zpool status nor dmesg nor /dev/console. >> >> Guess I must assume that the disk is dying indeed, losing >> connection or something like that after a random time (my >> resilvers restart after 15min-5hrs), and at least a run of >> SMART long diags is in order, while the pool would try to >> rebuild onto another disk (the hotspare) instead of trying >> to update this one which was in the pool. > > Please share if SMART offers anything useful.I plan to run the tests after (if, when) the resilver completes as to not disturb the system. So far I got smartmontools-5.42 compiled and it sees the disks :) I am not sure if it would indeed run the self-tests and report back - this seems to be not supported by disk(?): # /usr/local/sbin/smartctl -x /dev/rdsk/c1t2d0 -d scsi smartctl 5.42 2011-10-20 r3458 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net User Capacity: 250,056,000,000 bytes [250 GB] Logical block size: 512 bytes Serial number: 5QE5ADXW Device type: disk Local Time is: Sat May 12 18:40:53 2012 MSK Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 26 C Error Counter logging not supported No self-tests have been logged Device does not support Background scan results logging scsiPrintSasPhy Log Sense Failed [unsupported field in scsi command]>> I also guess that the disk gets found after something like >> an unlogged bus reset or whatever, and this event causes >> the resilvering to restart from scratch. > > This makes sense.Good, thanks for the sanity-check ;)>> Best course of action would be to get those people to fully >> replace the untrustworthy disk... Or at least pull and push >> it a bit - maybe it''s contacts just got plain dirty/oxidized >> and the disk should be re-seated in the enclosure... > > Not likely to help. SATA?Yes, 250Gb Seagate SATA on a Thumper X4500, used since about May 2008 (last rpool installation date) or maybe before. 10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB> /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0 Thanks, //Jim
2012-05-12 7:01, Jim Klimov wrote:> Overall the applied question is whether the disk will > make it back into the live pool (ultimately with no > continuous resilvering), and how fast that can be done - > I don''t want to risk the big pool with nonredundant > arrays for too long.Here lies another "grumpy gripe", although maybe pertaining to the oldish snv_117 on that box: the system is not making its best possible effort to complete the resilver ASAP :) According to "iostat 60", disk utilizations of this raidz set vary 15-50%busy, queue lengths vary within 5 outstanding tasks, the CPU kernel time is 2-7% with over 90% idling, over 2GB RAM remains free... Why won''t it go to complete the quest faster? Can some tire be kicked? ;) Sat May 12 19:06:09 MSK 2012 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 309.6 3.8 14863.0 5.0 0.0 4.7 0.0 15.0 0 65 c0t1d0 312.5 3.9 14879.7 5.1 0.0 4.6 0.0 14.7 0 64 c4t3d0 308.5 4.0 14855.0 5.2 0.0 4.7 0.0 15.1 0 66 c6t5d0 310.7 3.9 14855.7 5.1 0.0 4.6 0.0 14.8 0 65 c7t6d0 0.0 225.3 0.0 14484.2 0.0 8.1 0.0 36.0 0 83 c5t6d0 Sat May 12 19:07:09 MSK 2012 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 228.0 3.0 6859.7 4.0 0.0 6.9 0.0 29.9 0 81 c0t1d0 227.7 3.3 6850.0 4.3 0.0 6.9 0.0 30.0 0 81 c4t3d0 228.1 3.4 6857.9 4.4 0.0 7.0 0.0 30.0 0 81 c6t5d0 227.6 3.1 6860.4 4.1 0.0 7.1 0.0 30.7 0 82 c7t6d0 0.0 225.8 0.0 6379.1 0.0 8.1 0.0 35.8 0 85 c5t6d0 ... On some minutes the disks sit there doing almost nothing at all: Sat May 12 19:01:09 MSK 2012 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 10.7 0.8 665.4 0.7 0.0 0.1 0.0 11.4 0 13 c0t1d0 10.7 0.9 667.5 0.7 0.0 0.1 0.0 11.6 0 13 c4t3d0 10.7 0.8 666.4 0.7 0.0 0.1 0.0 11.9 0 13 c6t5d0 10.7 0.9 668.5 0.7 0.0 0.1 0.0 11.6 0 13 c7t6d0 0.1 15.5 0.6 20.3 0.0 0.0 0.0 0.2 0 0 c5t6d0 last pid: 18121; load avg: 0.16, 0.15, 0.12; up 0+16:03:44 19:06:51 96 processes: 95 sleeping, 1 on cpu CPU states: 96.6% idle, 0.2% user, 3.2% kernel, 0.0% iowait, 0.0% swap Memory: 16G phys mem, 2476M free mem, 16G total swap, 16G free swap ...> It has already taken 2 days to try and resilver a 250Gb > disk into the pool, but never made it past 100Gb progress. :( > Reports no errors that I''d see either... :)Well, that part seems to have been explained in my other mails, and hopefully worked-around by the hotspare. //Jim
Jim, This makes sense. fmdump -eV reported that your original drive was experiencing repeated fread failures ( scsi command code 0x28) Steve ----- Original Message ----- .... Well, I decided to bite the bullet and kick the original disk from the pool after replacing it with the spare, and to say the least, speed prognosis looks a lot better now, about two times better in zpool''s resilver progress/estimate and by up to an order of magnitude better at iostat speeds: .... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120512/f38d1d79/attachment.html>
comments below... On May 12, 2012, at 8:10 AM, Jim Klimov wrote:> 2012-05-12 7:01, Jim Klimov wrote: >> Overall the applied question is whether the disk will >> make it back into the live pool (ultimately with no >> continuous resilvering), and how fast that can be done - >> I don''t want to risk the big pool with nonredundant >> arrays for too long. > > Here lies another "grumpy gripe", although maybe pertaining > to the oldish snv_117 on that box: the system is not making > its best possible effort to complete the resilver ASAP :) > > According to "iostat 60", disk utilizations of this raidz > set vary 15-50%busy, queue lengths vary within 5 outstanding > tasks, the CPU kernel time is 2-7% with over 90% idling, > over 2GB RAM remains free... Why won''t it go to complete > the quest faster? Can some tire be kicked? ;) > > Sat May 12 19:06:09 MSK 2012 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 309.6 3.8 14863.0 5.0 0.0 4.7 0.0 15.0 0 65 c0t1d0 > 312.5 3.9 14879.7 5.1 0.0 4.6 0.0 14.7 0 64 c4t3d0 > 308.5 4.0 14855.0 5.2 0.0 4.7 0.0 15.1 0 66 c6t5d0 > 310.7 3.9 14855.7 5.1 0.0 4.6 0.0 14.8 0 65 c7t6d0 > 0.0 225.3 0.0 14484.2 0.0 8.1 0.0 36.0 0 83 c5t6d0 > Sat May 12 19:07:09 MSK 2012 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 228.0 3.0 6859.7 4.0 0.0 6.9 0.0 29.9 0 81 c0t1d0 > 227.7 3.3 6850.0 4.3 0.0 6.9 0.0 30.0 0 81 c4t3d0 > 228.1 3.4 6857.9 4.4 0.0 7.0 0.0 30.0 0 81 c6t5d0 > 227.6 3.1 6860.4 4.1 0.0 7.1 0.0 30.7 0 82 c7t6d0 > 0.0 225.8 0.0 6379.1 0.0 8.1 0.0 35.8 0 85 c5t6d0In general asvc_t of this magnitude along with actv of this size means you might be better off lowering zfs_vdev_max_pending.> ... > > On some minutes the disks sit there doing almost nothing at all: > > Sat May 12 19:01:09 MSK 2012 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 10.7 0.8 665.4 0.7 0.0 0.1 0.0 11.4 0 13 c0t1d0 > 10.7 0.9 667.5 0.7 0.0 0.1 0.0 11.6 0 13 c4t3d0 > 10.7 0.8 666.4 0.7 0.0 0.1 0.0 11.9 0 13 c6t5d0 > 10.7 0.9 668.5 0.7 0.0 0.1 0.0 11.6 0 13 c7t6d0 > 0.1 15.5 0.6 20.3 0.0 0.0 0.0 0.2 0 0 c5t6d0This behaviour cannot be debugged with iostat or any of the various CPU-monitoring stat utilities. There is blocking somewhere, and it is likely to be in the data path. You might try iosnoop and look for I/O completion times that are large (> 1 second). -- richard> last pid: 18121; load avg: 0.16, 0.15, 0.12; up 0+16:03:44 19:06:51 > 96 processes: 95 sleeping, 1 on cpu > CPU states: 96.6% idle, 0.2% user, 3.2% kernel, 0.0% iowait, 0.0% swap > Memory: 16G phys mem, 2476M free mem, 16G total swap, 16G free swap > ... > >> It has already taken 2 days to try and resilver a 250Gb >> disk into the pool, but never made it past 100Gb progress. :( >> Reports no errors that I''d see either... :) > > Well, that part seems to have been explained in my other > mails, and hopefully worked-around by the hotspare. > > //Jim > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120513/294cce6d/attachment.html>