Stuart Anderson
2007-Oct-27 00:59 UTC
[zfs-discuss] X4500 device disconnect problem persists
After applying 125205-07 on two X4500 machines running Sol10U4 and removing "set sata:sata_func_enable = 0x5" from /etc/system to re-enable NCQ, I am again observing drive disconnect error messages. This in spite of the patch description which claims multiple fixes in this area: 6587133 repeated DMA command timeouts and device resets on x4500 6538627 x4500 message logs contain multiple device disk resets but nothing logged in FMA 6564956 "Disparity error" for marvell88sx3 was shown during boot-time for example, Has anyone else had any better luck with this? Thanks. Oct 26 16:25:34 thumper2 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx3: device on port 1 reset: no matching NCQ I/O found Oct 26 16:25:34 thumper2 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx3: device on port 1 reset: device disconnected or device error Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: Oct 26 16:25:34 thumper2 port 1: device reset Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: Oct 26 16:25:34 thumper2 port 1: link lost Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: Oct 26 16:25:34 thumper2 port 1: link established Oct 26 16:25:34 thumper2 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx3: error on port 1: Oct 26 16:25:34 thumper2 marvell88sx: [ID 517869 kern.info] device disconnected Oct 26 16:25:34 thumper2 marvell88sx: [ID 517869 kern.info] device connected Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 1,0 (sd25): Oct 26 16:25:34 thumper2 Error for Command: read(10) Error Level: Retryable Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 521002402 Error Block: 521002402 Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
Stuart Anderson wrote:> After applying 125205-07 on two X4500 machines running Sol10U4 and > removing "set sata:sata_func_enable = 0x5" from /etc/system to > re-enable NCQ, I am again observing drive disconnect error messages. > This in spite of the patch description which claims multiple fixes > in this area: > > 6587133 repeated DMA command timeouts and device resets on x4500 > 6538627 x4500 message logs contain multiple device disk resets but nothing logged in FMA > 6564956 "Disparity error" for marvell88sx3 was shown during boot-time > for example, > > Has anyone else had any better luck with this? >I have never seen this before. Please let me know all the patches you have added to your machine. It would appear that you are having some sort of hardware issue, but apparently you provided only part of what was in /var/adm/messages below, which makes it hard to say for certain. What you have below may be all related to a single device error (note the time stamps). Are you saying this occurs over and over again? By the way, only the fix for CR 6587133 deals with repeated device resets. The "fix" for CR 6538627 just added to the logged message the reason for the reset and the fix for CR6564956 made it so that right after boot the single reset that was required would no longer be required. This reset was only once per port per boot. Regards, Lida> > Thanks. > > > Oct 26 16:25:34 thumper2 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx3: device on port 1 reset: no matching NCQ I/O found > Oct 26 16:25:34 thumper2 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx3: device on port 1 reset: device disconnected or device error > Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: > Oct 26 16:25:34 thumper2 port 1: device reset > Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: > Oct 26 16:25:34 thumper2 port 1: link lost > Oct 26 16:25:34 thumper2 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1: > Oct 26 16:25:34 thumper2 port 1: link established > Oct 26 16:25:34 thumper2 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx3: error on port 1: > Oct 26 16:25:34 thumper2 marvell88sx: [ID 517869 kern.info] device disconnected > Oct 26 16:25:34 thumper2 marvell88sx: [ID 517869 kern.info] device connected > Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 1,0 (sd25): > Oct 26 16:25:34 thumper2 Error for Command: read(10) Error Level: Retryable > Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 521002402 Error Block: 521002402 > Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense > Oct 26 16:25:34 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 > >
I got it too. Its a brand new x4500 (my 2nd eval box after the other one use to freeze up). I got this while running a java program that tries and reads a 128G file while writing a 100G file in 2 threads with 128K blocks. Oct 29 00:56:28 zeta1 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 1 reset: no matching NCQ I/O found Oct 29 00:56:28 zeta1 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 1 reset: device disconnected or device error Oct 29 00:56:28 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 3/pci11ab,11ab at 1: Oct 29 00:56:28 zeta1 port 1: device reset Oct 29 00:56:28 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 3/pci11ab,11ab at 1: Oct 29 00:56:28 zeta1 port 1: link lost Oct 29 00:56:28 zeta1 sata: [ID 801593 kern.notice] NOTICE: /pci at 1,0/pci1022,7458 at 3/pci11ab,11ab at 1: Oct 29 00:56:28 zeta1 port 1: link established Oct 29 00:56:28 zeta1 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 1: Oct 29 00:56:28 zeta1 marvell88sx: [ID 517869 kern.info] device disconnected Oct 29 00:56:28 zeta1 marvell88sx: [ID 517869 kern.info] device connected Oct 29 00:56:28 zeta1 scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7458 at 3/pci11ab,11ab at 1/disk at 1,0 (sd13): Oct 29 00:56:28 zeta1 Error for Command: read(10) Error Level: Retryable Oct 29 00:56:28 zeta1 scsi: [ID 107833 kern.notice] Requested Block: 186994359 Error Block: 186994359 Oct 29 00:56:28 zeta1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 29 00:56:28 zeta1 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 29 00:56:28 zeta1 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 I did a complete reinstall of Solaris 10 U4 and then applied NAME: Solaris 10_x86 Recommended Patch Cluster DATE: Oct/26/07 Release: 5.10 Kernel architecture: i86pc Application architecture: i386 Hardware provider: Domain: Kernel version: SunOS 5.10 Generic_127112-02 Looks like its back to set sata:sata_func_enable = 0x5 for me. This message posted from opensolaris.org
Willi Burmeister
2007-Oct-29 06:45 UTC
[zfs-discuss] X4500 device disconnect problem persists
Hi, we have the same problem. Our X4500 has Solaris 10 11/06 and (nearly) every kernel and driver related patch installed. Nothing set in /etc/system fmdump is not showing any errors ---------------------------------------------------------------------- # fmdump TIME UUID SUNW-MSG-ID fmdump: /var/fm/fmd/fltlog is empty ---------------------------------------------------------------------- from /var/adm/messages: ---------------------------------------------------------------------- Oct 29 04:49:49 celeborn marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx4: device on port 1 reset: DMA command timeout Oct 29 04:49:49 celeborn sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 29 04:49:49 celeborn port 1: device reset Oct 29 04:49:49 celeborn marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx4: device on port 1 reset: device disconnected or device error Oct 29 04:49:49 celeborn sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 29 04:49:49 celeborn port 1: device reset Oct 29 04:49:49 celeborn sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 29 04:49:49 celeborn port 1: link lost Oct 29 04:49:49 celeborn sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1: Oct 29 04:49:49 celeborn port 1: link established Oct 29 04:49:49 celeborn marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx4: error on port 1: Oct 29 04:49:49 celeborn marvell88sx: [ID 517869 kern.info] device disconnected Oct 29 04:49:49 celeborn marvell88sx: [ID 517869 kern.info] device connected Oct 29 04:49:49 celeborn scsi: [ID 107833 kern.warning] WARNING: /pci at 2,0/pci1022,7458 at 7/pci11ab,11ab at 1/disk at 1,0 (sd15): Oct 29 04:49:49 celeborn Error for Command: write(10) Error Level: Retryable Oct 29 04:49:49 celeborn scsi: [ID 107833 kern.notice] Requested Block: 107400869 Error Block: 107400869 Oct 29 04:49:49 celeborn scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 29 04:49:49 celeborn scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 29 04:49:49 celeborn scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 ---------------------------------------------------------------------- ---------------------------------------------------------------------- # pca -l missing Using /usr/local/etc/patchdiag.xref from Oct/18/07 Host: celeborn (SunOS 5.10/Generic_127112-01/i386/i86pc) Patch IR CR RSB Age Synopsis ------ -- - -- --- --- ------------------------------------------------------- 125333 01 < 02 -S- 14 JDS 3_x86: Macromedia Flash Player Plugin Patch 125546 -- < 01 --- 19 GNOME 2.6.0_x86: GNOME Performance Meter 127733 -- < 01 --- 11 SunOS 5.10_x86: sd Patch 127748 -- < 01 --- 11 SunOS 5.10_x86: pciehpc patch 127887 -- < 01 --- 11 SunOS 5.10_x86: ipf patch ---------------------------------------------------------------------- Willi
We have this identical problem on all 10 or so of our thumpers. They''re running stock Solaris 10, whatever came with them. We think it''s starting to cause problems, as we will see a rash of those errors on one of our machines, and then NFS will stop serving. This message posted from opensolaris.org
Dan Poltawski
2007-Nov-08 10:18 UTC
[zfs-discuss] X4500 device disconnect problem persists
That is interesting, again we''re having the same problem with our X4500s. I am trying to work out what is causing the problem with NFS, restarting the service causes it to try and stop and not bring it back up. Rebooting the whole box fails and it just hangs till a hard reset.. This message posted from opensolaris.org
Peter Eriksson
2007-Nov-08 14:38 UTC
[zfs-discuss] X4500 device disconnect problem persists
We too are seeing this problem on some of our Thumpers - the ones with U4 and/or all the latest patches installed. We have one which we stopped patching before the kernel patch that introduced this problem that works fine... Works: [0] andromeda:/<2>ncri86pc/sbin# uname -a SunOS andromeda 5.10 Generic_125101-03 i86pc i386 i86pc Have problems: # uname -a SunOS sagittarius-a 5.10 Generic_127112-01 i86pc i386 i86pc This message posted from opensolaris.org
Dan Poltawski
2007-Nov-13 11:50 UTC
[zfs-discuss] X4500 device disconnect problem persists
I''ve just discovered patch 125205-07, which wasn''t installed on our system because we don''t have SUNWhea.. Has anyone with problems tried this patch, and has it helped at all? This message posted from opensolaris.org
Peter Tribble
2007-Nov-13 21:10 UTC
[zfs-discuss] X4500 device disconnect problem persists
On 11/13/07, Dan Poltawski <talktodan at gmail.com> wrote:> I''ve just discovered patch 125205-07, which wasn''t installed on our system because we don''t have SUNWhea.. > > Has anyone with problems tried this patch, and has it helped at all?We were having a pretty rough time running S10U4. While I was away on vacation 125205-06 was applied and apparently made some difference, although the problem doesn''t seem to have entirely vanished. (It''s gone far enough away that users aren''t complaining, but I think we still want to put the -07 version of the patch on when we can and I too would like confirmation that it''s helping and hasn''t introduced any other regressions..) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
The "reset: no matching NCQ I/O found" issue appears to be related to the error recovery for bad blocks on the disk. In general it should be harmless, but I have looked into this. If there is someone out there who; 1) Is hitting this issue, and; 2) Is running recent Solaris Nevada bits (not Solaris 10) and; 3) Is willing to try out an experimental driver I can provide a new binary (with which I''ve done some testing already) which would appear to deal with this issue and do better and quicker error recovery. Remember that the underlying problem still appears to be bad blocks on the disk, so until those blocks are re-written or mapped away there will still be slow response and error messages generated each and every time those blocks are read. Regards, Lida This message posted from opensolaris.org
Peter Eriksson
2007-Nov-15 13:20 UTC
[zfs-discuss] X4500 device disconnect problem persists
Speaking of error recovery due to bad blocks - anyone know if the SATA disks that are delivered with the Thumper have "enterprise" or "desktop" firmware/settings by default? If I''m not mistaken one of the differences is that the "enterrprise" variant more quickly gives up with bad blocks and reports those to the operating system compared to the "desktop" variant that will keep on retrying forever (or almost atleast)... This message posted from opensolaris.org
Richard Elling
2007-Nov-15 17:52 UTC
[zfs-discuss] X4500 device disconnect problem persists
Peter Eriksson wrote:> Speaking of error recovery due to bad blocks - anyone know if the SATA disks that are delivered with the Thumper have "enterprise" or "desktop" firmware/settings by default? If I''m not mistaken one of the differences is that the "enterrprise" variant more quickly gives up with bad blocks and reports those to the operating system compared to the "desktop" variant that will keep on retrying forever (or almost atleast)...The 500 GByte is the Enterprise version from Hitachi, E7K500. However, I do not believe there is an industry standard definition of "enterprise" disks. -- richard
We are having the same problem. First with 125025-05 and then also with 125205-07 Solaris 10 update 4 - Know with all Patchesx We opened a Case and got T-PATCH 127871-02 we installed the Marvell Driver Binary 3 Days ago. T127871-02/SUNWckr/reloc/kernel/misc/sata T127871-02/SUNWmv88sx/reloc/kernel/drv/marvell88sx T127871-02/SUNWmv88sx/reloc/kernel/drv/amd64/marvell88sx T127871-02/SUNWsi3124/reloc/kernel/drv/si3124 T127871-02/SUNWsi3124/reloc/kernel/drv/amd64/si3124 It seems that this resolve the device reset problem and the nfsd crash on x4500 with one raidz2 pool and a lot of zfs Filesystems This message posted from opensolaris.org
Peter Eriksson
2007-Dec-29 11:25 UTC
[zfs-discuss] X4500 device disconnect problem persists
Still no news when a real patch will be released for this issue? This message posted from opensolaris.org
Gerry Haskins
2008-Mar-11 11:03 UTC
[zfs-discuss] X4500 device disconnect problem persists
Do *NOT* install 127871-02 on a Solaris 10 system. 127871-02 is an immature Feature patch associated with Solaris 10 Update 5. It''s only purpose is for constructing pre-release "builds" of Solaris 10 Update 5 for internal Sun testing. It is *not* to be installed on pre-U5 systems. 127871-02 comes from a difference internal source code branch to normal Sustaining (bug fix) patches. Installing 127871-02 on a Solaris 10 system will leave the system in an undefined state. Please see the warnings in the patch README file. If this patch has been installed on a production system, please back it out immediately. Please let me know who gave you this patch as you should not have been given it. A later revision of this patch (or an accumulating patch) will be released to SunSolve once Solaris 10 Update 5 ships in April/May. This later revision will be a normal Sustaining patch which you can install on any Solaris 10 system. But until then, it is not safe to install 127871. Best Wishes, Gerry Haskins Senior Engineering Manager Software Product Engineering This message posted from opensolaris.org