thr3ads.net - zfs discuss - [zfs-discuss] fmadm faulty not showing faulty/offline disks? [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Krunal Desai

2011-Feb-01 16:55 UTC

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

I recently discovered a drive failure (either that or a loose cable, I
need to investigate further) on my home fileserver. ''fmadm
faulty''
returns no output, but I can clearly see a failure when I do zpool
status -v:

pool: tank
state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using ''zpool online'' or replace the
device with
        ''zpool replace''.
scan: scrub canceled on Tue Feb  1 11:51:58 2011
config:

        NAME         STATE     READ WRITE CKSUM
        tank         DEGRADED     0     0     0
          raidz2-0   DEGRADED     0     0     0
            c10t0d0  ONLINE       0     0     0
            c10t1d0  ONLINE       0     0     0
            c10t2d0  ONLINE       0     0     0
            c10t3d0  REMOVED      0     0     0
            c10t4d0  ONLINE       0     0     0
            c10t5d0  ONLINE       0     0     0
            c10t6d0  ONLINE       0     0     0
            c10t7d0  ONLINE       0     0     0

In dmesg, I see:
Feb  1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 (sd8):
Feb  1 11:14:33 megatron        Command failed to complete...Device is gone

never had any problems with these drives + mpt in snv_134 (on snv_151a
now), only change was adding a second 1068E-IT that''s currently
unpopulated with drives. But more importantly I guess, why can''t I see
this failure in fmadm (and how would I go about setting up
automatically dispatching an e-mail to me when stuff like this
happens?)? Is a pool going degraded != to failure?

-- 
--khd

Cindy Swearingen

2011-Feb-01 18:29 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

Hi Krunal,

It looks to me like FMA thinks that you removed the disk so you''ll need
to confirm whether the cable dropped or something else.

I agree that we need to get email updates for failing devices.

See if fmdump generated an error report using the commands below.

Thanks,

Cindy

# fmdump
TIME                 UUID                                 SUNW-MSG-ID EVENT

Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH 
Diagnosed
Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 
Updated

Then, review the contents:

fmdump -u 04ee736a-b2cb-612f-ce5e-a0e43d666762 -v
TIME                 UUID                                 SUNW-MSG-ID EVENT
Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH 
Diagnosed
   100%  fault.fs.zfs.vdev.checksum

         Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
            Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
                FRU: -
           Location: -

Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 
Updated
   100%  fault.fs.zfs.vdev.checksum

         Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
            Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
                FRU: -
           Location: -

Thanks,

Cindy



On 02/01/11 09:55, Krunal Desai wrote:> I recently discovered a drive failure (either that or a loose cable, I
> need to investigate further) on my home fileserver. ''fmadm
faulty''
> returns no output, but I can clearly see a failure when I do zpool
> status -v:
> 
> pool: tank
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>         ''zpool replace''.
> scan: scrub canceled on Tue Feb  1 11:51:58 2011
> config:
> 
>         NAME         STATE     READ WRITE CKSUM
>         tank         DEGRADED     0     0     0
>           raidz2-0   DEGRADED     0     0     0
>             c10t0d0  ONLINE       0     0     0
>             c10t1d0  ONLINE       0     0     0
>             c10t2d0  ONLINE       0     0     0
>             c10t3d0  REMOVED      0     0     0
>             c10t4d0  ONLINE       0     0     0
>             c10t5d0  ONLINE       0     0     0
>             c10t6d0  ONLINE       0     0     0
>             c10t7d0  ONLINE       0     0     0
> 
> In dmesg, I see:
> Feb  1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING:
> /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 (sd8):
> Feb  1 11:14:33 megatron        Command failed to complete...Device is gone
> 
> never had any problems with these drives + mpt in snv_134 (on snv_151a
> now), only change was adding a second 1068E-IT that''s currently
> unpopulated with drives. But more importantly I guess, why can''t I
see
> this failure in fmadm (and how would I go about setting up
> automatically dispatching an e-mail to me when stuff like this
> happens?)? Is a pool going degraded != to failure?
>

Krunal Desai

2011-Feb-01 22:47 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen
<cindy.swearingen at oracle.com> wrote:> I agree that we need to get email updates for failing devices.
Definitely!
> See if fmdump generated an error report using the commands below.
Unfortunately not, see below:

movax at megatron:/root# fmdump
TIME                 UUID                                 SUNW-MSG-ID EVENT
fmdump: warning: /var/fm/fmd/fltlog is empty

--khd

Cindy Swearingen

2011-Feb-01 23:11 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

I misspoke and should clarify:

1. fmdump identifies fault reports that explain system issues

2. fmdump -eV identifies errors or problem symptoms

I''m unclear about your REMOVED status. I don''t see it very
often.

The ZFS Admin Guide says:

REMOVED

The device was physically removed while the system was running. Device 
removal detection is hardware-dependent and might not be supported on 
all platforms.

I need to check if FMA generally reports on devices that are REMOVED
by the administrator, as ZFS seems to think in this case.

Thanks,

Cindy

On 02/01/11 15:47, Krunal Desai wrote:> On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen
> <cindy.swearingen at oracle.com> wrote:
>> I agree that we need to get email updates for failing devices.
> 
> Definitely!
> 
>> See if fmdump generated an error report using the commands below.
> 
> Unfortunately not, see below:
> 
> movax at megatron:/root# fmdump
> TIME                 UUID                                 SUNW-MSG-ID EVENT
> fmdump: warning: /var/fm/fmd/fltlog is empty
> 
> --khd

Krunal Desai

2011-Feb-02 01:52 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen
<cindy.swearingen at oracle.com> wrote:> I misspoke and should clarify:
>
> 1. fmdump identifies fault reports that explain system issues
>
> 2. fmdump -eV identifies errors or problem symptoms
Gotcha; fmdump -eV gives me the information I need. It appears to have
been a loose cable, I''m hitting the machine with some heavy I/O load,
and the pool resilvered itself, drive has not dropped out.

SMART status was reported healthy as well (got smartctl kind of
working), but I cannot read the SMART data of my disks behind the
1068E due to limitations of smartmontools I guess. (e.g. ''smartctl -d
scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a
generic ''SMART Ok''). I assume that SUNWhd is licensed only for
use on
the X4500 Thumper and family? I''d like to see if it works with the
1068E.

It''s getting kind of tempting for me to investigate oing a run of
boards that run Marvell 88SX6081s behind a PLX PCIe <-> PCI-X bridge.
They should have beyond excellent support seeing as that is what the
X4500 uses to run its SATA ports.

Richard Elling

2011-Feb-02 02:45 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Feb 1, 2011, at 5:52 PM, Krunal Desai wrote:
> On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen
> <cindy.swearingen at oracle.com> wrote:
>> I misspoke and should clarify:
>> 
>> 1. fmdump identifies fault reports that explain system issues
>> 
>> 2. fmdump -eV identifies errors or problem symptoms
> 
> Gotcha; fmdump -eV gives me the information I need. It appears to have
> been a loose cable, I''m hitting the machine with some heavy I/O
load,
> and the pool resilvered itself, drive has not dropped out.
The output of fmdump is explicit. I am interested to know if you saw 
aborts and timeouts or some other errors.
> 
> SMART status was reported healthy as well (got smartctl kind of
> working), but I cannot read the SMART data of my disks behind the
> 1068E due to limitations of smartmontools I guess. (e.g. ''smartctl
-d
> scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a
> generic ''SMART Ok''). I assume that SUNWhd is licensed
only for use on
> the X4500 Thumper and family? I''d like to see if it works with the
> 1068E.
The open-source version of smartmontools seems to be slightly out
of date and somewhat finicky. Does anyone know of a better SMART
implementation?
> 
> It''s getting kind of tempting for me to investigate oing a run of
> boards that run Marvell 88SX6081s behind a PLX PCIe <-> PCI-X bridge.
> They should have beyond excellent support seeing as that is what the
> X4500 uses to run its SATA ports.
Nice idea, except that the X4500 was EOL years ago and the replacement,
X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI
chipsets because Oracle''s Sun products use them from the top (M9000)
all
the way down the product line.
 -- richard

Krunal Desai

2011-Feb-02 02:49 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> The output of fmdump is explicit. I am interested to know if you saw 
> aborts and timeouts or some other errors.
I have the machine off atm while I install new disks (18x ST32000542AS), but
IIRC they appeared as transport errors (scsi.<something>.transport, I can
paste the exact errors in a little bit). A slew of transfer/soft errors followed
by the drive disappearing. I assume that my HBA took it offline, and mpt driver
reported that to the OS as an admin disconnecting, not as a "failure"
per se.
> The open-source version of smartmontools seems to be slightly out
> of date and somewhat finicky. Does anyone know of a better SMART
> implementation?
That SUNWhd I mentioned seemed interesting, but I assume licensing means I can
only get that if I purchase SUn hardware.
> Nice idea, except that the X4500 was EOL years ago and the replacement,
> X4540, uses LSI HBAs. I think you will find better Solaris support for the
LSI
> chipsets because Oracle''s Sun products use them from the top
(M9000) all
> the way down the product line.
Oops, forgot that the X4500s are actually kind of "old". I''ll
have to look up what LSI controllers the newer models are using (the LSI 2xx8
something IIRC? Will have to Google).

--khd

Richard Elling

2011-Feb-02 04:34 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Feb 1, 2011, at 6:49 PM, Krunal Desai wrote:
>> The output of fmdump is explicit. I am interested to know if you saw 
>> aborts and timeouts or some other errors.
> 
> I have the machine off atm while I install new disks (18x ST32000542AS),
but IIRC they appeared as transport errors (scsi.<something>.transport, I
can paste the exact errors in a little bit). A slew of transfer/soft errors
followed by the drive disappearing. I assume that my HBA took it offline, and
mpt driver reported that to the OS as an admin disconnecting, not as a
"failure" per se.
There is a failure going on here.  It could be a cable or it could be a bad
disk or firmware. The actual fault might not be in the disk reporting the errors
(!)
It is not a media error.
> 
>> The open-source version of smartmontools seems to be slightly out
>> of date and somewhat finicky. Does anyone know of a better SMART
>> implementation?
> 
> That SUNWhd I mentioned seemed interesting, but I assume licensing means I
can only get that if I purchase SUn hardware.
> 
>> Nice idea, except that the X4500 was EOL years ago and the replacement,
>> X4540, uses LSI HBAs. I think you will find better Solaris support for
the LSI
>> chipsets because Oracle''s Sun products use them from the top
(M9000) all
>> the way down the product line.
> 
> Oops, forgot that the X4500s are actually kind of "old".
I''ll have to look up what LSI controllers the newer models are using
(the LSI 2xx8 something IIRC? Will have to Google).
No, they aren''t that new.  The LSI 2008 are 6 Gbps HBAs and the older
1064/1068
series are 3 Gbps.
 -- richard

Krunal Desai

2011-Feb-02 04:54 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling
<richard.elling at gmail.com> wrote:> There is a failure going on here. ?It could be a cable or it could be a bad
> disk or firmware. The actual fault might not be in the disk reporting the
errors (!)
> It is not a media error.
>
Errors were as follows:
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered    0x269213b01d700401
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered    0x269213b01d700401
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered    0x269213b01d700401
Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran         0x269f99ef0b300401
Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran         0x269f9a165a400401

Verbose of a message:
Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x269f99ef0b300401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd
at 3,0
        (end detector)

        devid = id1,sd at n5000c50010ed6a31
        driver-assessment = fail
        op-code = 0x0
        cdb = 0x0 0x0 0x0 0x0 0x0 0x0
        pkt-reason = 0x18
        pkt-state = 0x1
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4d48a640 0x3b6bfabb

It was a cable error, but why didn''t fault management tell me about
it? What do you mean by "The actual fault might not be in the disk
reporting the errors (!)
It is not a media error."? Fault might be sourcing from my SATA
controller or something possibly?

Richard Elling

2011-Feb-02 14:23 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Feb 1, 2011, at 8:54 PM, Krunal Desai wrote:
> On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> There is a failure going on here.  It could be a cable or it could be a
bad
>> disk or firmware. The actual fault might not be in the disk reporting
the errors (!)
>> It is not a media error.
>> 
> 
> Errors were as follows:
> Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered   
0x269213b01d700401
> Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered   
0x269213b01d700401
> Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered   
0x269213b01d700401
> Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran        
0x269f99ef0b300401
> Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran        
0x269f9a165a400401
> 
> Verbose of a message:
> Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x269f99ef0b300401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at
0/sd at 3,0
>        (end detector)
> 
>        devid = id1,sd at n5000c50010ed6a31
>        driver-assessment = fail
>        op-code = 0x0
>        cdb = 0x0 0x0 0x0 0x0 0x0 0x0
>        pkt-reason = 0x18
This error code means the device is gone.
>        pkt-state = 0x1
The command got the bus, but could not access the target.
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4d48a640 0x3b6bfabb
> 
> It was a cable error, but why didn''t fault management tell me
about
> it? What do you mean by "The actual fault might not be in the disk
> reporting the errors (!)
> It is not a media error."? Fault might be sourcing from my SATA
> controller or something possibly?
Possibly.
 -- richard

Oyvind Syljuasen

2011-Feb-02 16:59 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> I agree that we need to get email updates for failing
> devices.
> 
If FMA discovers it, email can be sent, at least in Solaris Express 11;
http://blogs.sun.com/robj/entry/fma_and_email_notifications

br,
syljua
-- 
This message posted from opensolaris.org

Carson Gaspar

2011-Feb-03 01:38 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On 2/1/11 5:52 PM, Krunal Desai wrote:
> SMART status was reported healthy as well (got smartctl kind of
> working), but I cannot read the SMART data of my disks behind the
> 1068E due to limitations of smartmontools I guess. (e.g. ''smartctl
-d
> scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a
> generic ''SMART Ok''). I assume that SUNWhd is licensed
only for use on
> the X4500 Thumper and family? I''d like to see if it works with the
> 1068E.
Works For Me (TM).

c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 
105 Firmware rev 011d0000 (1.29.00.00) (IT FW)

This is a SATA disk - I don''t have any SAS disks behind a LSI1068E to
test.

# uname -a
SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pc

# /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0 
                                                           smartctl 5.40 
2010-10-16 r3189 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ==Model Family:     Seagate Barracuda 7200.11
family
Device Model:     ST31500341AS
Serial Number:    9VS4HDYH
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb  2 17:37:56 2011 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test
result: PASSED

Krunal Desai

2011-Feb-03 01:43 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> This error code means the device is gone.
> The command got the bus, but could not access the target.
Thanks for that!

I updated firmware on both of my USAS-L8i (LSI1068E based), and while
controller numbering has shifted around in Solaris (went from c10/c11
to c11/c12, not a big deal I think), suddently smartctl is able to
pull temperatures. Can''t get a full SMART listing, but temperatures
are going now. Oddly enough, my second LSI controller has skipped
c12t0d0 and jumped straight from number c12t1d0 and onwards. It''s a
good thing that ZFS can figure out what is what, but it will make
configuring power management tricky.

I''ll post in pm-discuss about the kernel panics I was getting after
enabling drive power management.

-- 
--khd

Krunal Desai

2011-Feb-03 01:47 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> # uname -a
> SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pc
movax at megatron:~# uname -a
SunOS megatron 5.11 snv_151a i86pc i386 i86pc

> # /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? smartctl 5.40 2010-10-16 r3189
> [i386-pc-solaris2.11] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
Fails for me, my version does not recognize the ''sat'' option.
I''ve
been using -d scsi:

movax at megatron:~# smartctl -h
smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen

but,

movax at megatron:~# smartctl -a -d scsi /dev/rdsk/c11t0d0
smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ATA      ST31500341AS     Version: CC1H
Serial number:             9VS14DJD
Device type: disk
Local Time is: Wed Feb  2 20:45:00 2011 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     49 C

Error Counter logging not supported

Carson Gaspar

2011-Feb-03 01:57 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On 2/2/11 5:43 PM, Krunal Desai wrote:
> I updated firmware on both of my USAS-L8i (LSI1068E based), and while
> controller numbering has shifted around in Solaris (went from c10/c11
> to c11/c12, not a big deal I think), suddently smartctl is able to
> pull temperatures. Can''t get a full SMART listing, but
temperatures
> are going now. Oddly enough, my second LSI controller has skipped
> c12t0d0 and jumped straight from number c12t1d0 and onwards. It''s
a
> good thing that ZFS can figure out what is what, but it will make
> configuring power management tricky.
Re: d1 vs d0, you probably want to disable persistent mappings on your 
1068E. Using lsiutil, it''s option 15 in "expert" mode, then
option 12 in
the sub-menu. Or it could be something else ;-)

-- 
Carson

Carson Gaspar

2011-Feb-03 01:59 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On 2/2/11 5:47 PM, Krunal Desai wrote:
> Fails for me, my version does not recognize the ''sat''
option. I''ve
> been using -d scsi:
>
> movax at megatron:~# smartctl -h
> smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen
So build the current version of smartmontools. As you should have seen 
in my original response, I''m using 5.40. Bugs in 5.36 are unlikely to
be
interesting to the maintainers of the package ;-)

-- 
Carson

Krunal Desai

2011-Feb-03 02:05 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> So build the current version of smartmontools. As you should have seen in
my original response, I''m using 5.40. Bugs in 5.36 are unlikely to be
interesting to the maintainers of the package ;-)
Oops, missed that in your log. Will try compiling from source and see what
happens.

Also, recently it seems like all the links to tools I need are broken. Where can
I find a lsiutil binary for Solaris?

--khd

Eric D. Mudama

2011-Feb-03 02:17 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Wed, Feb  2 at 21:05, Krunal Desai wrote:>> So build the current version of smartmontools. As you should have seen
in my original response, I''m using 5.40. Bugs in 5.36 are unlikely to
be interesting to the maintainers of the package ;-)
>
>Oops, missed that in your log. Will try compiling from source and see what
happens.
>
>Also, recently it seems like all the links to tools I need are broken. Where
can I find a lsiutil binary for Solaris?
If you search for ''lsiutil solaris'' on lsi.com, it''ll
direct you to
zipfile that includes a solaris binary for x86 solaris.

At home now so can''t test it.

-- 
Eric D. Mudama
edmudama at bounceswoosh.org

Krunal Desai

2011-Feb-03 02:20 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

> If you search for ''lsiutil solaris'' on lsi.com,
it''ll direct you to
> zipfile that includes a solaris binary for x86 solaris.
Yep, that worked, grabbed it off some other adapter''s page. Thanks!

Richard Elling

2011-Feb-03 03:31 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Feb 2, 2011, at 8:59 AM, Oyvind Syljuasen wrote:
>> I agree that we need to get email updates for failing
>> devices.
>> 
> 
> If FMA discovers it, email can be sent, at least in Solaris Express 11;
> http://blogs.sun.com/robj/entry/fma_and_email_notifications
For NexentaStor we have a slightly different email delivery of system
fault notices. For those who are using the current version, please note that
there are improvements coming in configuration and reporting so that we
can help detect some specific pathologies often associated with transport
errors :-). There is always room for improvement in fault management...
 -- richard

Krunal Desai

2011-Feb-17 05:58 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Wed, Feb 2, 2011 at 8:38 PM, Carson Gaspar <carson at taltos.org>
wrote:> Works For Me (TM).
>
> c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 105
> Firmware rev 011d0000 (1.29.00.00) (IT FW)
>
> This is a SATA disk - I don''t have any SAS disks behind a LSI1068E
to test.
When I try to do a SMART status read (more than just a simple
identify), looks like the 1068E drops the drive for a little bit. I
bought the Intel-branded LSI SAS3081E:
Current active firmware version is 01200000 (1.32.00)
Firmware image''s version is MPTFW-01.32.00.00-IT
  LSI Logic
x86 BIOS image''s version is MPTBIOS-6.34.00.00 (2010.12.07)

kernel log messages:
Feb 17 00:54:05 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:05 megatron        Disconnected command timeout for Target 0
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Log info 0x31140000 received for target 0.
Feb 17 00:54:06 megatron        scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Log info 0x31130000 received for target 0.
Feb 17 00:54:06 megatron        scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Log info 0x31130000 received for target 0.
Feb 17 00:54:06 megatron        scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Log info 0x31130000 received for target 0.
Feb 17 00:54:06 megatron        scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Log info 0x31130000 received for target 0.
Feb 17 00:54:06 megatron        scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 107833 kern.notice]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        mpt_flush_target discovered non-NULL
cmd in slot 33, tasktype 0x3
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        Cmd (0xffffff02dea63a40) dump for
Target 0 Lun 0:
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron                cdb=[ ]
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        pkt_flags=0x8000 pkt_statistics=0x0
pkt_state=0x0
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        pkt_scbp=0x0 cmd_flags=0x2800024
Feb 17 00:54:06 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4):
Feb 17 00:54:06 megatron        ioc reset abort passthru

Fault management records some transport errors followed by recovery.
Any ideas? Disks are ST32000542AS.

Carson Gaspar

2011-Feb-17 15:52 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On 2/16/11 9:58 PM, Krunal Desai wrote:
> When I try to do a SMART status read (more than just a simple
> identify), looks like the 1068E drops the drive for a little bit. I
> bought the Intel-branded LSI SAS3081E:
> Current active firmware version is 01200000 (1.32.00)
> Firmware image''s version is MPTFW-01.32.00.00-IT
>    LSI Logic
> x86 BIOS image''s version is MPTBIOS-6.34.00.00 (2010.12.07)
...> Fault management records some transport errors followed by recovery.
> Any ideas? Disks are ST32000542AS.
Please give the _exact_ command you are running. I see the same thing, 
but only if I tray and retrieve some of the extended info (-x...). I 
don''t see it with -a.

-- 
Carson

Krunal Desai

2011-Feb-17 15:58 UTC

head link

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

On Thu, Feb 17, 2011 at 10:52 AM, Carson Gaspar <carson at taltos.org>
wrote:> Please give the _exact_ command you are running. I see the same thing, but
> only if I tray and retrieve some of the extended info (-x...). I
don''t see
> it with -a.
Sure, here it is (apologies in advance if GMail applies its forced wrapping):

movax at megatron:~/downloads# smartctl -a -d sat /dev/rdsk/c1t0d0
smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ==Model Family:     Seagate Barracuda LP
Device Model:     ST32000542AS
Serial Number:    <redacted>
Firmware Version: CC34
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Feb 17 00:52:56 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

<drive drops/resets here>

zfs discuss - Feb 2011 - fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?

[zfs-discuss] fmadm faulty not showing faulty/offline disks?