I''m trying to work out the case a remedy for a very sick iSCSI pool on a Solaris 11 host. The volume is exported from an Oracle storage appliance and there are no errors reported there. The host has no entries in its logs relating to the network connections. Any zfs or zpool commands the change the state of the pool (such as zfs mount or zpool export) hang and can''t be killed. fmadm faulty reports: Jun 27 14:04:24 536fb2ad-1fca-c8b2-fc7d-f5a4a94c165d ZFS-8000-FD Major Host : taitaklsc01 Platform : SUN-FIRE-X4170-M2-SERVER Chassis_id : 1142FMM02N Product_sn : 1142FMM02N Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 faulted but still in service Problem in : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 faulted but still in service Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. The zpool status paints a very gloomy picture: pool: fileserver state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Jun 29 11:59:59 2012 858K scanned out of 15.7T at 43/s, (scan is slow, no estimated time) 567K resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM fileserver ONLINE 0 1.16M 0 c0t600144F096C94AC700004ECD96F20001d0 ONLINE 0 1.16M 0 (resilvering) errors: 1557164 data errors, use ''-v'' for a list Any ideas how to determine the cause of the problem and remedy it? -- Ian.
Hi Ian, Chapter 7 of the DTrace book has some examples of how to look at iSCSI target and initiator behaviour. -- richard On Jun 28, 2012, at 10:47 PM, Ian Collins wrote:> I''m trying to work out the case a remedy for a very sick iSCSI pool on a Solaris 11 host. > > The volume is exported from an Oracle storage appliance and there are no errors reported there. The host has no entries in its logs relating to the network connections. > > Any zfs or zpool commands the change the state of the pool (such as zfs mount or zpool export) hang and can''t be killed. > > fmadm faulty reports: > > Jun 27 14:04:24 536fb2ad-1fca-c8b2-fc7d-f5a4a94c165d ZFS-8000-FD Major > > Host : taitaklsc01 > Platform : SUN-FIRE-X4170-M2-SERVER Chassis_id : 1142FMM02N > Product_sn : 1142FMM02N > > Fault class : fault.fs.zfs.vdev.io > Affects : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 > faulted but still in service > Problem in : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 > faulted but still in service > > Description : The number of I/O errors associated with a ZFS device exceeded > acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD > for more information. > > The zpool status paints a very gloomy picture: > > pool: fileserver > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scan: resilver in progress since Fri Jun 29 11:59:59 2012 > 858K scanned out of 15.7T at 43/s, (scan is slow, no estimated time) > 567K resilvered, 0.00% done > config: > > NAME STATE READ WRITE CKSUM > fileserver ONLINE 0 1.16M 0 > c0t600144F096C94AC700004ECD96F20001d0 ONLINE 0 1.16M 0 (resilvering) > > errors: 1557164 data errors, use ''-v'' for a list > > Any ideas how to determine the cause of the problem and remedy it? > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120629/f44bf8b1/attachment.html>
On 06/30/12 03:01 AM, Richard Elling wrote:> Hi Ian, > Chapter 7 of the DTrace book has some examples of how to look at iSCSI > target > and initiator behaviour.Thanks Richard, I ''ll have a look. I''m assuming the pool is hosed?> -- richard > > On Jun 28, 2012, at 10:47 PM, Ian Collins wrote: > >> I''m trying to work out the case a remedy for a very sick iSCSI pool >> on a Solaris 11 host. >> >> The volume is exported from an Oracle storage appliance and there are >> no errors reported there. The host has no entries in its logs >> relating to the network connections. >> >> Any zfs or zpool commands the change the state of the pool (such as >> zfs mount or zpool export) hang and can''t be killed. >> >> fmadm faulty reports: >> >> Jun 27 14:04:24 536fb2ad-1fca-c8b2-fc7d-f5a4a94c165d ZFS-8000-FD >> Major >> >> Host : taitaklsc01 >> Platform : SUN-FIRE-X4170-M2-SERVER Chassis_id : 1142FMM02N >> Product_sn : 1142FMM02N >> >> Fault class : fault.fs.zfs.vdev.io >> Affects : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 >> faulted but still in service >> Problem in : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 >> faulted but still in service >> >> Description : The number of I/O errors associated with a ZFS device >> exceeded >> acceptable levels. Refer to >> http://sun.com/msg/ZFS-8000-FD >> for more information. >> >> The zpool status paints a very gloomy picture: >> >> pool: fileserver >> state: ONLINE >> status: One or more devices is currently being resilvered. The pool will >> continue to function, possibly in a degraded state. >> action: Wait for the resilver to complete. >> scan: resilver in progress since Fri Jun 29 11:59:59 2012 >> 858K scanned out of 15.7T at 43/s, (scan is slow, no estimated time) >> 567K resilvered, 0.00% done >> config: >> >> NAME STATE READ WRITE >> CKSUM >> fileserver ONLINE 0 1.16M >> 0 >> c0t600144F096C94AC700004ECD96F20001d0 ONLINE 0 1.16M >> 0 (resilvering) >> >> errors: 1557164 data errors, use ''-v'' for a list >> >> Any ideas how to determine the cause of the problem and remedy it? >> >> -- >> Ian. >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > ZFS Performance and Training > Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com> > +1-760-896-4422 > > > > > > >-- Ian.
On Sun, Jul 1, 2012 at 4:18 AM, Ian Collins <ian at ianshome.com> wrote:> On 06/30/12 03:01 AM, Richard Elling wrote: >> >> Hi Ian, >> Chapter 7 of the DTrace book has some examples of how to look at iSCSI >> target >> and initiator behaviour. > > > Thanks Richard, I ''ll have a look. > > I''m assuming the pool is hosed?Before making that assumption, I''d try something simple first: - reading from the imported iscsi disk (e.g. with dd) to make sure it''s not iscsi-related problem - import the disk in another host, and try to read the disk again, to make sure it''s not client-specific problem - possibly restart the iscsi server, just to make sure I suspect the problem is with your oracle storage appliance. But since you say there''s no errors there, then the simple tests should make sure whethere it''s client, disk, or zfs problem. -- Fajar
On 07/ 1/12 10:20 AM, Fajar A. Nugraha wrote:> On Sun, Jul 1, 2012 at 4:18 AM, Ian Collins<ian at ianshome.com> wrote: >> On 06/30/12 03:01 AM, Richard Elling wrote: >>> Hi Ian, >>> Chapter 7 of the DTrace book has some examples of how to look at iSCSI >>> target >>> and initiator behaviour. >> >> Thanks Richard, I ''ll have a look. >> >> I''m assuming the pool is hosed? > Before making that assumption, I''d try something simple first: > - reading from the imported iscsi disk (e.g. with dd) to make sure > it''s not iscsi-related problem > - import the disk in another host, and try to read the disk again, to > make sure it''s not client-specific problem > - possibly restart the iscsi server, just to make sureBooting the initiator host from a live DVD image and attempting to import the pool gives the same error report.> I suspect the problem is with your oracle storage appliance. But since > you say there''s no errors there, then the simple tests should make > sure whethere it''s client, disk, or zfs problem. >So did I. I''ll get the admin for that system to dig a little deeper and export a new volume to see if I can create a new pool. -- Ian.
On 07/ 1/12 08:57 PM, Ian Collins wrote:> On 07/ 1/12 10:20 AM, Fajar A. Nugraha wrote: >> On Sun, Jul 1, 2012 at 4:18 AM, Ian Collins<ian at ianshome.com> wrote: >>> On 06/30/12 03:01 AM, Richard Elling wrote: >>>> Hi Ian, >>>> Chapter 7 of the DTrace book has some examples of how to look at iSCSI >>>> target >>>> and initiator behaviour. >>> Thanks Richard, I ''ll have a look. >>> >>> I''m assuming the pool is hosed? >> Before making that assumption, I''d try something simple first: >> - reading from the imported iscsi disk (e.g. with dd) to make sure >> it''s not iscsi-related problem >> - import the disk in another host, and try to read the disk again, to >> make sure it''s not client-specific problem >> - possibly restart the iscsi server, just to make sure > Booting the initiator host from a live DVD image and attempting to > import the pool gives the same error report.The pool''s data appears to be recoverable when I import it read only. The storage appliance is so full they can''t delete files from it! Now that shouldn''t have caused problems with a fixed sized volume, but who knows? -- Ian.
On Tue, Jul 3, 2012 at 11:08 AM, Ian Collins <ian at ianshome.com> wrote:>>>> I''m assuming the pool is hosed? >>> >>> Before making that assumption, I''d try something simple first: >>> - reading from the imported iscsi disk (e.g. with dd) to make sure >>> it''s not iscsi-related problem >>> - import the disk in another host, and try to read the disk again, to >>> make sure it''s not client-specific problem >>> - possibly restart the iscsi server, just to make sure >> >> Booting the initiator host from a live DVD image and attempting to >> import the pool gives the same error report. > > > The pool''s data appears to be recoverable when I import it read only.That''s good> > The storage appliance is so full they can''t delete files from it!Hahaha :D> Now that > shouldn''t have caused problems with a fixed sized volume, but who knows?AFAIK you''ll always need space, e.g. to replay/rollback transactions during pool import. The best way is, of course, fix the appliance. Sometimes something simple like deleting snapshots/datasets will do the trick. -- Fajar