Hi all, Today i notice that one of the ZFS based servers within my company is complaining about disk errors, but i would like to know if this a real physical error or something like a transport error or something. The server in question runs snv_134 attached to 2 J4400 jbods , and the head-node has 2 hba''s and i''ve enabled multipath support. I''ve 1TB sata enterprise class disks on the server. The messages seen in the system are : Jul 15 12:30:48 storage01 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: DISK-8000-4Q, TYPE: Fault, VER: 1, SEVERITY: Critical Jul 15 12:30:48 storage01 EVENT-TIME: Thu Jul 15 12:30:48 CEST 2010 Jul 15 12:30:48 storage01 PLATFORM: PowerEdge-R710, CSN: HR9SG9J, HOSTNAME: storage01 Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16 Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1 Jul 15 12:30:48 storage01 DESC: The command was terminated with a non-recovered error condition that may have been caused by a flaw in the media or an error in the recorded data. Jul 15 12:30:48 storage01 Refer to http://sun.com/msg/DISK-8000-4Q for more information. Jul 15 12:30:48 storage01 AUTO-RESPONSE: The device may be offlined or degraded. Jul 15 12:30:48 storage01 IMPACT: It is likely that continued operation will result in data corruption, which may eventually cause the loss of service or the service degradation. Jul 15 12:30:48 storage01 REC-ACTION: Schedule a repair procedure to replace the affected device. Use ''fmadm faulty'' to find the affected disk. Jul 15 12:30:48 storage01 genunix: [ID 846333 kern.warning] WARNING: constraints forbid retire: /scsi_vhci/disk at g5000c50019f03af6 /usr/local/sbin/smartctl -xa -d scsi /dev/rdsk/c0t5000C50019F03AF6d0 smartctl 5.39.1 2010-01-28 r3054 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Serial number: 9QJ60QFG Device type: disk Local Time is: Sat Jul 17 11:13:00 2010 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 28 C Error Counter logging not supported [GLTSD (Global Logging Target Save Disable) set. Enable Save with ''-S on''] No self-tests have been logged Long (extended) Self Test duration: 13800 seconds [230.0 minutes] Device does not support Background scan results logging scsiPrintSasPhy Log Sense Failed [unsupported field in scsi command] iostat -En | grep c0t5000C50019F03AF6 c0t5000C50019F03AF6d0 Soft Errors: 0 Hard Errors: 50 Transport Errors: 0 iostat -en | grep c0t5000C50019F03AF6 ---- errors --- s/w h/w trn tot device 0 50 0 50 c0t5000C50019F03AF6d0 So i''m confused because S.M.A.R.T reports no errors, but i see that iostat reports 50 hard-errors... With this should i already start a disk replacement in the pool in question and then get a RMA with the disk vendor? Thanks for all your time, Bruno -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Sat, 17 Jul 2010, Bruno Sousa wrote:> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16 > Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1 > Jul 15 12:30:48 storage01 DESC: The command was terminated with a > non-recovered error condition that may have been caused by a flaw in the > media or an error in the recorded data.This sounds like a hard error to me. I suggest using ''iostat -xe'' to check the hard error counts and check the system log files. If your storage array was undergoing maintenance and had a cable temporarily disconnected or controller rebooted, then it is possible that hard errors could be counted. FMA usually waits until several errors have been reported over a period of time before reporting a fault. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, Jul 17, 2010 at 10:49 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sat, 17 Jul 2010, Bruno Sousa wrote: >> >> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16 >> Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1 >> Jul 15 12:30:48 storage01 DESC: The command was terminated with a >> non-recovered error condition that may have been caused by a flaw in the >> media or an error in the recorded data. > > This sounds like a hard error to me. ?I suggest using ''iostat -xe'' to check > the hard error counts and check the system log files. ?If your storage array > was undergoing maintenance and had a cable temporarily disconnected or > controller rebooted, then it is possible that hard errors could be counted. > ?FMA usually waits until several errors have been reported over a period of > time before reporting a fault.Speaking of that, is there a place where one can see/change these thresholds ? -- Giovanni Tirloni gtirloni at sysdroid.com
On 17-7-2010 15:49, Bob Friesenhahn wrote:> On Sat, 17 Jul 2010, Bruno Sousa wrote: >> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16 >> Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1 >> Jul 15 12:30:48 storage01 DESC: The command was terminated with a >> non-recovered error condition that may have been caused by a flaw in the >> media or an error in the recorded data. > > This sounds like a hard error to me. I suggest using ''iostat -xe'' to > check the hard error counts and check the system log files. If your > storage array was undergoing maintenance and had a cable temporarily > disconnected or controller rebooted, then it is possible that hard > errors could be counted. FMA usually waits until several errors have > been reported over a period of time before reporting a fault. > > BobThanks for the tip, i already setup a small cron job to run the iostat -xe command every hour. This sort of messages occured during a system backup, so maybe some components are failing under heavy tasks... Oh well, let''s just wait and see if something changes. Once again , thanks for your time. Bruno -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.