thr3ads.net - zfs discuss - [zfs-discuss] fmadm warnings about media erros [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Bruno Sousa

2010-Jul-17 09:25 UTC

[zfs-discuss] fmadm warnings about media erros

Hi all,

Today i notice that one of the ZFS based servers within my company is
complaining about disk errors, but i would like to know if this a real
physical error or something like a transport error or something.
The server in question runs snv_134 attached to 2 J4400 jbods , and the
head-node has 2 hba''s and i''ve enabled multipath support.
I''ve 1TB sata
enterprise class disks on the server.

The messages seen in the system are :

Jul 15 12:30:48 storage01 fmd: [ID 377184 daemon.error] SUNW-MSG-ID:
DISK-8000-4Q, TYPE: Fault, VER: 1, SEVERITY: Critical
Jul 15 12:30:48 storage01 EVENT-TIME: Thu Jul 15 12:30:48 CEST 2010
Jul 15 12:30:48 storage01 PLATFORM: PowerEdge-R710, CSN: HR9SG9J,
HOSTNAME: storage01
Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16
Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1
Jul 15 12:30:48 storage01 DESC: The command was terminated with a
non-recovered error condition that may have been caused by a flaw in the
media or an error in the recorded data.
Jul 15 12:30:48 storage01   Refer to http://sun.com/msg/DISK-8000-4Q for
more information.
Jul 15 12:30:48 storage01 AUTO-RESPONSE: The device may be offlined or
degraded.
Jul 15 12:30:48 storage01 IMPACT: It is likely that continued operation
will result in data corruption, which may eventually cause the loss of
service or the service degradation.
Jul 15 12:30:48 storage01 REC-ACTION: Schedule a repair procedure to
replace the affected device. Use ''fmadm faulty'' to find the
affected disk.
Jul 15 12:30:48 storage01 genunix: [ID 846333 kern.warning] WARNING:
constraints forbid retire: /scsi_vhci/disk at g5000c50019f03af6

/usr/local/sbin/smartctl -xa -d scsi /dev/rdsk/c0t5000C50019F03AF6d0
smartctl 5.39.1 2010-01-28 r3054 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Serial number: 9QJ60QFG
Device type: disk
Local Time is: Sat Jul 17 11:13:00 2010 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     28 C

Error Counter logging not supported

[GLTSD (Global Logging Target Save Disable) set. Enable Save with ''-S
on'']
No self-tests have been logged
Long (extended) Self Test duration: 13800 seconds [230.0 minutes]
Device does not support Background scan results logging
scsiPrintSasPhy Log Sense Failed [unsupported field in scsi command]

iostat -En | grep c0t5000C50019F03AF6
c0t5000C50019F03AF6d0 Soft Errors: 0 Hard Errors: 50 Transport Errors: 0


iostat -en | grep c0t5000C50019F03AF6
 
---- errors ---
  s/w h/w trn tot device

    0  50   0  50 c0t5000C50019F03AF6d0


So i''m confused because S.M.A.R.T reports no errors, but i see that
iostat reports 50 hard-errors...
With this should i already start a disk replacement in the pool in
question and then get a RMA with the disk vendor?

Thanks for all your time,
Bruno

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Bob Friesenhahn

2010-Jul-17 13:49 UTC

head link

[zfs-discuss] fmadm warnings about media erros

On Sat, 17 Jul 2010, Bruno Sousa wrote:> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16
> Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1
> Jul 15 12:30:48 storage01 DESC: The command was terminated with a
> non-recovered error condition that may have been caused by a flaw in the
> media or an error in the recorded data.
This sounds like a hard error to me.  I suggest using ''iostat
-xe'' to
check the hard error counts and check the system log files.  If your 
storage array was undergoing maintenance and had a cable temporarily 
disconnected or controller rebooted, then it is possible that hard 
errors could be counted.  FMA usually waits until several errors have 
been reported over a period of time before reporting a fault.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Giovanni Tirloni

2010-Jul-17 15:14 UTC

head link

[zfs-discuss] fmadm warnings about media erros

On Sat, Jul 17, 2010 at 10:49 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Sat, 17 Jul 2010, Bruno Sousa wrote:
>>
>> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16
>> Jul 15 12:30:48 storage01 EVENT-ID:
859b9d9c-1214-4302-8089-b9447619a2a1
>> Jul 15 12:30:48 storage01 DESC: The command was terminated with a
>> non-recovered error condition that may have been caused by a flaw in
the
>> media or an error in the recorded data.
>
> This sounds like a hard error to me. ?I suggest using ''iostat
-xe'' to check
> the hard error counts and check the system log files. ?If your storage
array
> was undergoing maintenance and had a cable temporarily disconnected or
> controller rebooted, then it is possible that hard errors could be counted.
> ?FMA usually waits until several errors have been reported over a period of
> time before reporting a fault.
Speaking of that, is there a place where one can see/change these thresholds ?

-- 
Giovanni Tirloni
gtirloni at sysdroid.com

Bruno Sousa

2010-Jul-17 15:17 UTC

head link

[zfs-discuss] fmadm warnings about media erros

On 17-7-2010 15:49, Bob Friesenhahn wrote:> On Sat, 17 Jul 2010, Bruno Sousa wrote:
>> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16
>> Jul 15 12:30:48 storage01 EVENT-ID:
859b9d9c-1214-4302-8089-b9447619a2a1
>> Jul 15 12:30:48 storage01 DESC: The command was terminated with a
>> non-recovered error condition that may have been caused by a flaw in
the
>> media or an error in the recorded data.
>
> This sounds like a hard error to me.  I suggest using ''iostat
-xe'' to
> check the hard error counts and check the system log files.  If your
> storage array was undergoing maintenance and had a cable temporarily
> disconnected or controller rebooted, then it is possible that hard
> errors could be counted.  FMA usually waits until several errors have
> been reported over a period of time before reporting a fault.
>
> BobThanks for the tip, i already setup a small cron job to run the iostat
-xe command every hour. This sort of messages occured during a system
backup, so maybe some components are failing under heavy tasks...
Oh well, let''s just wait and see if something changes.

Once again , thanks for your time.

Bruno

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

zfs discuss - Jul 2010 - fmadm warnings about media erros

[zfs-discuss] fmadm warnings about media erros

[zfs-discuss] fmadm warnings about media erros

[zfs-discuss] fmadm warnings about media erros

[zfs-discuss] fmadm warnings about media erros