Krunal Desai
2011-Feb-01 16:55 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
I recently discovered a drive failure (either that or a loose cable, I need to investigate further) on my home fileserver. ''fmadm faulty'' returns no output, but I can clearly see a failure when I do zpool status -v: pool: tank state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scan: scrub canceled on Tue Feb 1 11:51:58 2011 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c10t0d0 ONLINE 0 0 0 c10t1d0 ONLINE 0 0 0 c10t2d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 c10t6d0 ONLINE 0 0 0 c10t7d0 ONLINE 0 0 0 In dmesg, I see: Feb 1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 (sd8): Feb 1 11:14:33 megatron Command failed to complete...Device is gone never had any problems with these drives + mpt in snv_134 (on snv_151a now), only change was adding a second 1068E-IT that''s currently unpopulated with drives. But more importantly I guess, why can''t I see this failure in fmadm (and how would I go about setting up automatically dispatching an e-mail to me when stuff like this happens?)? Is a pool going degraded != to failure? -- --khd
Cindy Swearingen
2011-Feb-01 18:29 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
Hi Krunal, It looks to me like FMA thinks that you removed the disk so you''ll need to confirm whether the cable dropped or something else. I agree that we need to get email updates for failing devices. See if fmdump generated an error report using the commands below. Thanks, Cindy # fmdump TIME UUID SUNW-MSG-ID EVENT Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH Diagnosed Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 Updated Then, review the contents: fmdump -u 04ee736a-b2cb-612f-ce5e-a0e43d666762 -v TIME UUID SUNW-MSG-ID EVENT Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH Diagnosed 100% fault.fs.zfs.vdev.checksum Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383 Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383 FRU: - Location: - Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 Updated 100% fault.fs.zfs.vdev.checksum Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383 Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383 FRU: - Location: - Thanks, Cindy On 02/01/11 09:55, Krunal Desai wrote:> I recently discovered a drive failure (either that or a loose cable, I > need to investigate further) on my home fileserver. ''fmadm faulty'' > returns no output, but I can clearly see a failure when I do zpool > status -v: > > pool: tank > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scan: scrub canceled on Tue Feb 1 11:51:58 2011 > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > c10t0d0 ONLINE 0 0 0 > c10t1d0 ONLINE 0 0 0 > c10t2d0 ONLINE 0 0 0 > c10t3d0 REMOVED 0 0 0 > c10t4d0 ONLINE 0 0 0 > c10t5d0 ONLINE 0 0 0 > c10t6d0 ONLINE 0 0 0 > c10t7d0 ONLINE 0 0 0 > > In dmesg, I see: > Feb 1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING: > /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 (sd8): > Feb 1 11:14:33 megatron Command failed to complete...Device is gone > > never had any problems with these drives + mpt in snv_134 (on snv_151a > now), only change was adding a second 1068E-IT that''s currently > unpopulated with drives. But more importantly I guess, why can''t I see > this failure in fmadm (and how would I go about setting up > automatically dispatching an e-mail to me when stuff like this > happens?)? Is a pool going degraded != to failure? >
Krunal Desai
2011-Feb-01 22:47 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen <cindy.swearingen at oracle.com> wrote:> I agree that we need to get email updates for failing devices.Definitely!> See if fmdump generated an error report using the commands below.Unfortunately not, see below: movax at megatron:/root# fmdump TIME UUID SUNW-MSG-ID EVENT fmdump: warning: /var/fm/fmd/fltlog is empty --khd
Cindy Swearingen
2011-Feb-01 23:11 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
I misspoke and should clarify: 1. fmdump identifies fault reports that explain system issues 2. fmdump -eV identifies errors or problem symptoms I''m unclear about your REMOVED status. I don''t see it very often. The ZFS Admin Guide says: REMOVED The device was physically removed while the system was running. Device removal detection is hardware-dependent and might not be supported on all platforms. I need to check if FMA generally reports on devices that are REMOVED by the administrator, as ZFS seems to think in this case. Thanks, Cindy On 02/01/11 15:47, Krunal Desai wrote:> On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen > <cindy.swearingen at oracle.com> wrote: >> I agree that we need to get email updates for failing devices. > > Definitely! > >> See if fmdump generated an error report using the commands below. > > Unfortunately not, see below: > > movax at megatron:/root# fmdump > TIME UUID SUNW-MSG-ID EVENT > fmdump: warning: /var/fm/fmd/fltlog is empty > > --khd
Krunal Desai
2011-Feb-02 01:52 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen <cindy.swearingen at oracle.com> wrote:> I misspoke and should clarify: > > 1. fmdump identifies fault reports that explain system issues > > 2. fmdump -eV identifies errors or problem symptomsGotcha; fmdump -eV gives me the information I need. It appears to have been a loose cable, I''m hitting the machine with some heavy I/O load, and the pool resilvered itself, drive has not dropped out. SMART status was reported healthy as well (got smartctl kind of working), but I cannot read the SMART data of my disks behind the 1068E due to limitations of smartmontools I guess. (e.g. ''smartctl -d scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a generic ''SMART Ok''). I assume that SUNWhd is licensed only for use on the X4500 Thumper and family? I''d like to see if it works with the 1068E. It''s getting kind of tempting for me to investigate oing a run of boards that run Marvell 88SX6081s behind a PLX PCIe <-> PCI-X bridge. They should have beyond excellent support seeing as that is what the X4500 uses to run its SATA ports.
Richard Elling
2011-Feb-02 02:45 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Feb 1, 2011, at 5:52 PM, Krunal Desai wrote:> On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen > <cindy.swearingen at oracle.com> wrote: >> I misspoke and should clarify: >> >> 1. fmdump identifies fault reports that explain system issues >> >> 2. fmdump -eV identifies errors or problem symptoms > > Gotcha; fmdump -eV gives me the information I need. It appears to have > been a loose cable, I''m hitting the machine with some heavy I/O load, > and the pool resilvered itself, drive has not dropped out.The output of fmdump is explicit. I am interested to know if you saw aborts and timeouts or some other errors.> > SMART status was reported healthy as well (got smartctl kind of > working), but I cannot read the SMART data of my disks behind the > 1068E due to limitations of smartmontools I guess. (e.g. ''smartctl -d > scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a > generic ''SMART Ok''). I assume that SUNWhd is licensed only for use on > the X4500 Thumper and family? I''d like to see if it works with the > 1068E.The open-source version of smartmontools seems to be slightly out of date and somewhat finicky. Does anyone know of a better SMART implementation?> > It''s getting kind of tempting for me to investigate oing a run of > boards that run Marvell 88SX6081s behind a PLX PCIe <-> PCI-X bridge. > They should have beyond excellent support seeing as that is what the > X4500 uses to run its SATA ports.Nice idea, except that the X4500 was EOL years ago and the replacement, X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI chipsets because Oracle''s Sun products use them from the top (M9000) all the way down the product line. -- richard
Krunal Desai
2011-Feb-02 02:49 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> The output of fmdump is explicit. I am interested to know if you saw > aborts and timeouts or some other errors.I have the machine off atm while I install new disks (18x ST32000542AS), but IIRC they appeared as transport errors (scsi.<something>.transport, I can paste the exact errors in a little bit). A slew of transfer/soft errors followed by the drive disappearing. I assume that my HBA took it offline, and mpt driver reported that to the OS as an admin disconnecting, not as a "failure" per se.> The open-source version of smartmontools seems to be slightly out > of date and somewhat finicky. Does anyone know of a better SMART > implementation?That SUNWhd I mentioned seemed interesting, but I assume licensing means I can only get that if I purchase SUn hardware.> Nice idea, except that the X4500 was EOL years ago and the replacement, > X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI > chipsets because Oracle''s Sun products use them from the top (M9000) all > the way down the product line.Oops, forgot that the X4500s are actually kind of "old". I''ll have to look up what LSI controllers the newer models are using (the LSI 2xx8 something IIRC? Will have to Google). --khd
Richard Elling
2011-Feb-02 04:34 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Feb 1, 2011, at 6:49 PM, Krunal Desai wrote:>> The output of fmdump is explicit. I am interested to know if you saw >> aborts and timeouts or some other errors. > > I have the machine off atm while I install new disks (18x ST32000542AS), but IIRC they appeared as transport errors (scsi.<something>.transport, I can paste the exact errors in a little bit). A slew of transfer/soft errors followed by the drive disappearing. I assume that my HBA took it offline, and mpt driver reported that to the OS as an admin disconnecting, not as a "failure" per se.There is a failure going on here. It could be a cable or it could be a bad disk or firmware. The actual fault might not be in the disk reporting the errors (!) It is not a media error.> >> The open-source version of smartmontools seems to be slightly out >> of date and somewhat finicky. Does anyone know of a better SMART >> implementation? > > That SUNWhd I mentioned seemed interesting, but I assume licensing means I can only get that if I purchase SUn hardware. > >> Nice idea, except that the X4500 was EOL years ago and the replacement, >> X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI >> chipsets because Oracle''s Sun products use them from the top (M9000) all >> the way down the product line. > > Oops, forgot that the X4500s are actually kind of "old". I''ll have to look up what LSI controllers the newer models are using (the LSI 2xx8 something IIRC? Will have to Google).No, they aren''t that new. The LSI 2008 are 6 Gbps HBAs and the older 1064/1068 series are 3 Gbps. -- richard
Krunal Desai
2011-Feb-02 04:54 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling <richard.elling at gmail.com> wrote:> There is a failure going on here. ?It could be a cable or it could be a bad > disk or firmware. The actual fault might not be in the disk reporting the errors (!) > It is not a media error. >Errors were as follows: Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran 0x269f99ef0b300401 Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran 0x269f9a165a400401 Verbose of a message: Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x269f99ef0b300401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 (end detector) devid = id1,sd at n5000c50010ed6a31 driver-assessment = fail op-code = 0x0 cdb = 0x0 0x0 0x0 0x0 0x0 0x0 pkt-reason = 0x18 pkt-state = 0x1 pkt-stats = 0x0 __ttl = 0x1 __tod = 0x4d48a640 0x3b6bfabb It was a cable error, but why didn''t fault management tell me about it? What do you mean by "The actual fault might not be in the disk reporting the errors (!) It is not a media error."? Fault might be sourcing from my SATA controller or something possibly?
Richard Elling
2011-Feb-02 14:23 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Feb 1, 2011, at 8:54 PM, Krunal Desai wrote:> On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> There is a failure going on here. It could be a cable or it could be a bad >> disk or firmware. The actual fault might not be in the disk reporting the errors (!) >> It is not a media error. >> > > Errors were as follows: > Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 > Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 > Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered 0x269213b01d700401 > Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran 0x269f99ef0b300401 > Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran 0x269f9a165a400401 > > Verbose of a message: > Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x269f99ef0b300401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci at 0,0/pci8086,2e21 at 1/pci15d9,a580 at 0/sd at 3,0 > (end detector) > > devid = id1,sd at n5000c50010ed6a31 > driver-assessment = fail > op-code = 0x0 > cdb = 0x0 0x0 0x0 0x0 0x0 0x0 > pkt-reason = 0x18This error code means the device is gone.> pkt-state = 0x1The command got the bus, but could not access the target.> pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4d48a640 0x3b6bfabb > > It was a cable error, but why didn''t fault management tell me about > it? What do you mean by "The actual fault might not be in the disk > reporting the errors (!) > It is not a media error."? Fault might be sourcing from my SATA > controller or something possibly?Possibly. -- richard
Oyvind Syljuasen
2011-Feb-02 16:59 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> I agree that we need to get email updates for failing > devices. >If FMA discovers it, email can be sent, at least in Solaris Express 11; http://blogs.sun.com/robj/entry/fma_and_email_notifications br, syljua -- This message posted from opensolaris.org
Carson Gaspar
2011-Feb-03 01:38 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On 2/1/11 5:52 PM, Krunal Desai wrote:> SMART status was reported healthy as well (got smartctl kind of > working), but I cannot read the SMART data of my disks behind the > 1068E due to limitations of smartmontools I guess. (e.g. ''smartctl -d > scsi -a /dev/rdsk/c10t0d0'' gives me serial #, model, and just a > generic ''SMART Ok''). I assume that SUNWhd is licensed only for use on > the X4500 Thumper and family? I''d like to see if it works with the > 1068E.Works For Me (TM). c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 105 Firmware rev 011d0000 (1.29.00.00) (IT FW) This is a SATA disk - I don''t have any SAS disks behind a LSI1068E to test. # uname -a SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pc # /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0 smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==Model Family: Seagate Barracuda 7200.11 family Device Model: ST31500341AS Serial Number: 9VS4HDYH Firmware Version: CC1H User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Feb 2 17:37:56 2011 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED
Krunal Desai
2011-Feb-03 01:43 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> This error code means the device is gone. > The command got the bus, but could not access the target.Thanks for that! I updated firmware on both of my USAS-L8i (LSI1068E based), and while controller numbering has shifted around in Solaris (went from c10/c11 to c11/c12, not a big deal I think), suddently smartctl is able to pull temperatures. Can''t get a full SMART listing, but temperatures are going now. Oddly enough, my second LSI controller has skipped c12t0d0 and jumped straight from number c12t1d0 and onwards. It''s a good thing that ZFS can figure out what is what, but it will make configuring power management tricky. I''ll post in pm-discuss about the kernel panics I was getting after enabling drive power management. -- --khd
Krunal Desai
2011-Feb-03 01:47 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> # uname -a > SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pcmovax at megatron:~# uname -a SunOS megatron 5.11 snv_151a i86pc i386 i86pc> # /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0 > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? smartctl 5.40 2010-10-16 r3189 > [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net >Fails for me, my version does not recognize the ''sat'' option. I''ve been using -d scsi: movax at megatron:~# smartctl -h smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen but, movax at megatron:~# smartctl -a -d scsi /dev/rdsk/c11t0d0 smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: ATA ST31500341AS Version: CC1H Serial number: 9VS14DJD Device type: disk Local Time is: Wed Feb 2 20:45:00 2011 EST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 49 C Error Counter logging not supported
Carson Gaspar
2011-Feb-03 01:57 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On 2/2/11 5:43 PM, Krunal Desai wrote:> I updated firmware on both of my USAS-L8i (LSI1068E based), and while > controller numbering has shifted around in Solaris (went from c10/c11 > to c11/c12, not a big deal I think), suddently smartctl is able to > pull temperatures. Can''t get a full SMART listing, but temperatures > are going now. Oddly enough, my second LSI controller has skipped > c12t0d0 and jumped straight from number c12t1d0 and onwards. It''s a > good thing that ZFS can figure out what is what, but it will make > configuring power management tricky.Re: d1 vs d0, you probably want to disable persistent mappings on your 1068E. Using lsiutil, it''s option 15 in "expert" mode, then option 12 in the sub-menu. Or it could be something else ;-) -- Carson
Carson Gaspar
2011-Feb-03 01:59 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On 2/2/11 5:47 PM, Krunal Desai wrote:> Fails for me, my version does not recognize the ''sat'' option. I''ve > been using -d scsi: > > movax at megatron:~# smartctl -h > smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce AllenSo build the current version of smartmontools. As you should have seen in my original response, I''m using 5.40. Bugs in 5.36 are unlikely to be interesting to the maintainers of the package ;-) -- Carson
Krunal Desai
2011-Feb-03 02:05 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> So build the current version of smartmontools. As you should have seen in my original response, I''m using 5.40. Bugs in 5.36 are unlikely to be interesting to the maintainers of the package ;-)Oops, missed that in your log. Will try compiling from source and see what happens. Also, recently it seems like all the links to tools I need are broken. Where can I find a lsiutil binary for Solaris? --khd
Eric D. Mudama
2011-Feb-03 02:17 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Wed, Feb 2 at 21:05, Krunal Desai wrote:>> So build the current version of smartmontools. As you should have seen in my original response, I''m using 5.40. Bugs in 5.36 are unlikely to be interesting to the maintainers of the package ;-) > >Oops, missed that in your log. Will try compiling from source and see what happens. > >Also, recently it seems like all the links to tools I need are broken. Where can I find a lsiutil binary for Solaris?If you search for ''lsiutil solaris'' on lsi.com, it''ll direct you to zipfile that includes a solaris binary for x86 solaris. At home now so can''t test it. -- Eric D. Mudama edmudama at bounceswoosh.org
Krunal Desai
2011-Feb-03 02:20 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
> If you search for ''lsiutil solaris'' on lsi.com, it''ll direct you to > zipfile that includes a solaris binary for x86 solaris.Yep, that worked, grabbed it off some other adapter''s page. Thanks!
Richard Elling
2011-Feb-03 03:31 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Feb 2, 2011, at 8:59 AM, Oyvind Syljuasen wrote:>> I agree that we need to get email updates for failing >> devices. >> > > If FMA discovers it, email can be sent, at least in Solaris Express 11; > http://blogs.sun.com/robj/entry/fma_and_email_notificationsFor NexentaStor we have a slightly different email delivery of system fault notices. For those who are using the current version, please note that there are improvements coming in configuration and reporting so that we can help detect some specific pathologies often associated with transport errors :-). There is always room for improvement in fault management... -- richard
Krunal Desai
2011-Feb-17 05:58 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Wed, Feb 2, 2011 at 8:38 PM, Carson Gaspar <carson at taltos.org> wrote:> Works For Me (TM). > > c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 105 > Firmware rev 011d0000 (1.29.00.00) (IT FW) > > This is a SATA disk - I don''t have any SAS disks behind a LSI1068E to test.When I try to do a SMART status read (more than just a simple identify), looks like the 1068E drops the drive for a little bit. I bought the Intel-branded LSI SAS3081E: Current active firmware version is 01200000 (1.32.00) Firmware image''s version is MPTFW-01.32.00.00-IT LSI Logic x86 BIOS image''s version is MPTBIOS-6.34.00.00 (2010.12.07) kernel log messages: Feb 17 00:54:05 megatron scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:05 megatron Disconnected command timeout for Target 0 Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Log info 0x31140000 received for target 0. Feb 17 00:54:06 megatron scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Log info 0x31130000 received for target 0. Feb 17 00:54:06 megatron scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Log info 0x31130000 received for target 0. Feb 17 00:54:06 megatron scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Log info 0x31130000 received for target 0. Feb 17 00:54:06 megatron scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Log info 0x31130000 received for target 0. Feb 17 00:54:06 megatron scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Feb 17 00:54:06 megatron scsi: [ID 107833 kern.notice] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron mpt_flush_target discovered non-NULL cmd in slot 33, tasktype 0x3 Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron Cmd (0xffffff02dea63a40) dump for Target 0 Lun 0: Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron cdb=[ ] Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron pkt_flags=0x8000 pkt_statistics=0x0 pkt_state=0x0 Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron pkt_scbp=0x0 cmd_flags=0x2800024 Feb 17 00:54:06 megatron scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2e29 at 6/pci1000,3140 at 0 (mpt4): Feb 17 00:54:06 megatron ioc reset abort passthru Fault management records some transport errors followed by recovery. Any ideas? Disks are ST32000542AS.
Carson Gaspar
2011-Feb-17 15:52 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On 2/16/11 9:58 PM, Krunal Desai wrote:> When I try to do a SMART status read (more than just a simple > identify), looks like the 1068E drops the drive for a little bit. I > bought the Intel-branded LSI SAS3081E: > Current active firmware version is 01200000 (1.32.00) > Firmware image''s version is MPTFW-01.32.00.00-IT > LSI Logic > x86 BIOS image''s version is MPTBIOS-6.34.00.00 (2010.12.07)...> Fault management records some transport errors followed by recovery. > Any ideas? Disks are ST32000542AS.Please give the _exact_ command you are running. I see the same thing, but only if I tray and retrieve some of the extended info (-x...). I don''t see it with -a. -- Carson
Krunal Desai
2011-Feb-17 15:58 UTC
[zfs-discuss] fmadm faulty not showing faulty/offline disks?
On Thu, Feb 17, 2011 at 10:52 AM, Carson Gaspar <carson at taltos.org> wrote:> Please give the _exact_ command you are running. I see the same thing, but > only if I tray and retrieve some of the extended info (-x...). I don''t see > it with -a.Sure, here it is (apologies in advance if GMail applies its forced wrapping): movax at megatron:~/downloads# smartctl -a -d sat /dev/rdsk/c1t0d0 smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==Model Family: Seagate Barracuda LP Device Model: ST32000542AS Serial Number: <redacted> Firmware Version: CC34 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Feb 17 00:52:56 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled <drive drops/resets here>