How can I diagnose why a resilver appears to be hanging at a certain percentage, seemingly doing nothing for quite a while, even though the HDD LED is lit up permanently (no apparent head seeking)? The drives in the pool are WD Raid Editions, thus have TLER and should time out on errors in just seconds. ZFS nor the syslog however were reporting any IO errors, so it weren''t the disks. Stopping the scrub didn''t work, the zfs command didn''t return. It took a hard reset to make it stop. The system in question is OpenSolaris updated to snv_98, the pool had version 11 when I tried, I''ve upgraded it to 13 but didn''t retry scrubbing yet. Any ideas? Thanks, -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 225 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081008/f7d4ece6/attachment.bin>
Mario Goebbels wrote:> How can I diagnose why a resilver appears to be hanging at a certain > percentage, seemingly doing nothing for quite a while, even though the > HDD LED is lit up permanently (no apparent head seeking)? > > The drives in the pool are WD Raid Editions, thus have TLER and should > time out on errors in just seconds. ZFS nor the syslog however were > reporting any IO errors, so it weren''t the disks. >Check the FMA logs: fmadm faulty fmdump -e[vV]> Stopping the scrub didn''t work, the zfs command didn''t return. It took a > hard reset to make it stop. >scrub is not a zfs subcommand, perhaps you meant zpool? Depending on the failure, zpool commands may hang, fixed in b100. -- richard> The system in question is OpenSolaris updated to snv_98, the pool had > version 11 when I tried, I''ve upgraded it to 13 but didn''t retry > scrubbing yet. > > Any ideas? > > Thanks, > -mg > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
>> How can I diagnose why a resilver appears to be hanging at a certain >> percentage, seemingly doing nothing for quite a while, even though the >> HDD LED is lit up permanently (no apparent head seeking)? >> >> The drives in the pool are WD Raid Editions, thus have TLER and should >> time out on errors in just seconds. ZFS nor the syslog however were >> reporting any IO errors, so it weren''t the disks. >> > > Check the FMA logs: > fmadm faulty > fmdump -e[vV]Nothing noteworthy in there. fmadm shows nothing, fmdump just ereport.io.ddi.fm-capability repeatedly, which comes from oss_cmi8788 (some OpenSound driver).> >> Stopping the scrub didn''t work, the zfs command didn''t return. It took a >> hard reset to make it stop. >> > > scrub is not a zfs subcommand, perhaps you meant zpool? > Depending on the failure, zpool commands may hang, fixed in b100.Yeah, sorry, zpool doesn''t return. Regards, -mg
Tom Servo wrote:>>> How can I diagnose why a resilver appears to be hanging at a certain >>> percentage, seemingly doing nothing for quite a while, even though the >>> HDD LED is lit up permanently (no apparent head seeking)? >>> >>> The drives in the pool are WD Raid Editions, thus have TLER and should >>> time out on errors in just seconds. ZFS nor the syslog however were >>> reporting any IO errors, so it weren''t the disks. >>> >> Check the FMA logs: >> fmadm faulty >> fmdump -e[vV] > > Nothing noteworthy in there. fmadm shows nothing, fmdump just > ereport.io.ddi.fm-capability repeatedly, which comes from oss_cmi8788 > (some OpenSound driver).ouch. OSS OpenSound code is riddled with improper use of Solaris DDI, memory leaks, interrupt storms, and other problems. It would be interesting to remove that variable, and see what issues remain.> >>> Stopping the scrub didn''t work, the zfs command didn''t return. It took a >>> hard reset to make it stop. >>> >> scrub is not a zfs subcommand, perhaps you meant zpool? >> Depending on the failure, zpool commands may hang, fixed in b100. > > Yeah, sorry, zpool doesn''t return. > > Regards, > -mg > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tom Servo wrote:>>> How can I diagnose why a resilver appears to be hanging at a certain >>> percentage, seemingly doing nothing for quite a while, even though the >>> HDD LED is lit up permanently (no apparent head seeking)? >>> >>> The drives in the pool are WD Raid Editions, thus have TLER and should >>> time out on errors in just seconds. ZFS nor the syslog however were >>> reporting any IO errors, so it weren''t the disks. >>> >> >> Check the FMA logs: >> fmadm faulty >> fmdump -e[vV] > > Nothing noteworthy in there. fmadm shows nothing, fmdump just > ereport.io.ddi.fm-capability repeatedly, which comes from oss_cmi8788 > (some OpenSound driver).argv! Can you try to dd from the misbehaving device and see if that kicks off a diagnosis? It may take some time to timeout, though, by default it will be several minutes per iop. (for the geezers, format has had a media scanner for decades) -- richard>> >>> Stopping the scrub didn''t work, the zfs command didn''t return. It >>> took a >>> hard reset to make it stop. >>> >> >> scrub is not a zfs subcommand, perhaps you meant zpool? >> Depending on the failure, zpool commands may hang, fixed in b100. > > Yeah, sorry, zpool doesn''t return. > > Regards, > -mg
>>>> How can I diagnose why a resilver appears to be hanging at a certain >>>> percentage, seemingly doing nothing for quite a while, even though the >>>> HDD LED is lit up permanently (no apparent head seeking)? >>>> >>>> The drives in the pool are WD Raid Editions, thus have TLER and should >>>> time out on errors in just seconds. ZFS nor the syslog however were >>>> reporting any IO errors, so it weren''t the disks. >>>> >>> >>> Check the FMA logs: >>> fmadm faulty >>> fmdump -e[vV] >> >> Nothing noteworthy in there. fmadm shows nothing, fmdump just >> ereport.io.ddi.fm-capability repeatedly, which comes from oss_cmi8788 (some >> OpenSound driver). > > argv! Can you try to dd from the misbehaving device and see if > that kicks off a diagnosis? It may take some time to timeout, though, > by default it will be several minutes per iop. (for the geezers, format > has had a media scanner for decades)Well, I''m not even sure if a device actually misbehaves. During regular operation, there don''t appear to be any issues. Both disks of the mirror in the pool appeared to work correctly tho during the stuck scrub, because according to zpool iostat reads and writes went through. I will try to reproduce it this weekend (it''s a desktop machine, can''t hard reset via ssh :), hoping that the ZFS version upgrade fixed this. FYI, the disks have TLER of 7 seconds. Should it really take several minutes per IOP? Regards, -mg>>> >>>> Stopping the scrub didn''t work, the zfs command didn''t return. It took a >>>> hard reset to make it stop. >>>> >>> >>> scrub is not a zfs subcommand, perhaps you meant zpool? >>> Depending on the failure, zpool commands may hang, fixed in b100. >> >> Yeah, sorry, zpool doesn''t return. >> >> Regards, >> -mg > >