Hi, I have installed OpenSolaris snv_134 from the iso at genunix.org. Mon Mar 8 2010 New OpenSolaris preview, based on build 134 I created a zpool:- NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 logs c5d1p1 ONLINE 0 0 0 cache c5d1p2 ONLINE 0 0 0 The log device and cache are each one half of a 128GB OCZ VERTEX-TURBO flash card. I am getting good NFS performance but have seen this error:- rich at brszfs02:~# zpool status tank pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use ''zpool clear'' to mark the device repaired. scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 c7t4d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 logs c5d1p1 FAULTED 0 4 0 too many errors cache c5d1p2 ONLINE 0 0 0 errors: No known data errors rich at brszfs02:~# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52 ZFS-8000-FD Major Host : brszfs02 Platform : HP-Compaq-dc7700-Convertible-Minitower Chassis_id : CZC7264JN4 Product_sn : Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=tank/vdev=4ec464b5bf74a898 faulted but still in service Problem in : zfs://pool=tank/vdev=4ec464b5bf74a898 faulted but still in service Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. rich at brszfs02:~# iostat -En c5d1 c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: OCZ VERTEX-TURB Revision: Serial No: 062F97G71C5T676 Size: 128.04GB <128035160064 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 As there seems to be not hardware errors as reported by iostat I ran zpool clear tank and a scrub on Monday. Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 each day. Is the flash card faulty or is this a ZFS problem? Cheers Richard -- This message posted from opensolaris.org
comment below... On Apr 14, 2010, at 1:49 AM, Richard Skelton wrote:> Hi, > I have installed OpenSolaris snv_134 from the iso at genunix.org. > Mon Mar 8 2010 New OpenSolaris preview, based on build 134 > I created a zpool:- > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c7t8d0 ONLINE 0 0 0 > c7t9d0 ONLINE 0 0 0 > logs > c5d1p1 ONLINE 0 0 0 > cache > c5d1p2 ONLINE 0 0 0 > > The log device and cache are each one half of a 128GB OCZ VERTEX-TURBO flash card. > > I am getting good NFS performance but have seen this error:- > rich at brszfs02:~# zpool status tank > pool: tank > state: DEGRADED > status: One or more devices are faulted in response to persistent errors. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Replace the faulted device, or use ''zpool clear'' to mark the device > repaired. > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > c7t4d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c7t8d0 ONLINE 0 0 0 > c7t9d0 ONLINE 0 0 0 > logs > c5d1p1 FAULTED 0 4 0 too many errors > cache > c5d1p2 ONLINE 0 0 0 > > errors: No known data errors > > rich at brszfs02:~# fmadm faulty > --------------- ------------------------------------ -------------- --------- > TIME EVENT-ID MSG-ID SEVERITY > --------------- ------------------------------------ -------------- --------- > Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52 ZFS-8000-FD Major > > Host : brszfs02 > Platform : HP-Compaq-dc7700-Convertible-Minitower Chassis_id : CZC7264JN4 > Product_sn : > > Fault class : fault.fs.zfs.vdev.io > Affects : zfs://pool=tank/vdev=4ec464b5bf74a898 > faulted but still in service > Problem in : zfs://pool=tank/vdev=4ec464b5bf74a898 > faulted but still in service > > Description : The number of I/O errors associated with a ZFS device exceeded > acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD > for more information. > > Response : The device has been offlined and marked as faulted. An attempt > will be made to activate a hot spare if available. > > Impact : Fault tolerance of the pool may be compromised. > > Action : Run ''zpool status -x'' and replace the bad device. > > rich at brszfs02:~# iostat -En c5d1 > c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Model: OCZ VERTEX-TURB Revision: Serial No: 062F97G71C5T676 Size: 128.04GB <128035160064 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 > > > As there seems to be not hardware errors as reported by iostat I ran zpool clear tank and a scrub on Monday. > Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 each day. > > Is the flash card faulty or is this a ZFS problem?In my testing of Flash-based SSDs, this is the most common error. Since the drive is not reporting media errors or hard errors, the only interim conclusion is that something in the data path caused data to be corrupted. This can mean the drive doesn''t report these errors, the errors are transient, or an error occurred which is not related to the data (eg. phantom writes). For example, my current bad-boy says: $ iostat -En ... c7t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: USB2.0 Product: VAULT DRIVE Revision: 1100 Serial No: Size: 8.12GB <8120172544 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 103 Predictive Failure Analysis: 0 ... $ pfexec zpool status -v syspool pool: syspool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 0h1m with 325 errors on Wed Apr 14 11:06:58 2010 config: NAME STATE READ WRITE CKSUM syspool ONLINE 0 0 330 c7t0d0s0 ONLINE 0 0 690 errors: Permanent errors have been detected in the following files: syspool/rootfs-nmu-000 at initial:/var/lib/dpkg/info/man-db.list syspool/rootfs-nmu-000 at initial:/var/lib/dpkg/triggers/File ... -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hi, After a little bit more digging I found in /var/adm/messages:- Mar 25 13:13:08 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:13:08 brszfs02 timeout: early timeout, target=1 lun=0 Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 1,0 (Disk1): Mar 25 13:13:08 brszfs02 Error for command ''write sector'' Error Level: Informational Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice] Vendor ''Gen-ATA '' error code: 0x3 Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:13:43 brszfs02 timeout: early timeout, target=1 lun=0 Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:13:43 brszfs02 timeout: early timeout, target=1 lun=0 Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 1,0 (Disk1): Mar 25 13:13:43 brszfs02 Error for command ''read sector'' Error Level: Informational Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Vendor ''Gen-ATA '' error code: 0x3 Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 1,0 (Disk1): Mar 25 13:13:43 brszfs02 Error for command ''read sector'' Error Level: Informational Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Vendor ''Gen-ATA '' error code: 0x3 Mar 25 13:14:18 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:14:18 brszfs02 timeout: early timeout, target=1 lun=0 Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1/cmdk at 1,0 (Disk1): Mar 25 13:14:18 brszfs02 Error for command ''read sector'' Error Level: Informational Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice] Vendor ''Gen-ATA '' error code: 0x3 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:14:33 brszfs02 timeout: abort request, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:14:33 brszfs02 timeout: abort device, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:14:33 brszfs02 timeout: reset target, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 1 (ata1): Mar 25 13:14:33 brszfs02 timeout: reset bus, target=0 lun=0 Mar 25 13:14:34 brszfs02 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Mar 25 13:14:34 brszfs02 EVENT-TIME: Thu Mar 25 13:14:34 GMT 2010 Mar 25 13:14:34 brszfs02 PLATFORM: HP-Compaq-dc7700-Convertible-Minitower, CSN: CZC7264JN4, HOSTNAME: brszfs02 Mar 25 13:14:34 brszfs02 SOURCE: zfs-diagnosis, REV: 1.0 Mar 25 13:14:34 brszfs02 EVENT-ID: 6c0bd163-56bf-ee92-e393-ce2063355b52 Mar 25 13:14:34 brszfs02 DESC: The number of I/O errors associated with a ZFS device exceeded Mar 25 13:14:34 brszfs02 acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Mar 25 13:14:34 brszfs02 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Mar 25 13:14:34 brszfs02 will be made to activate a hot spare if available. Mar 25 13:14:34 brszfs02 IMPACT: Fault tolerance of the pool may be compromised. Mar 25 13:14:34 brszfs02 REC-ACTION: Run ''zpool status -x'' and replace the bad device. If I remember correctly I was thrashing this pool with Bonnie++ at this time. Cheers Richard. -- This message posted from opensolaris.org