Ray Van Dolson
2010-Mar-25 18:25 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02) (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev 1.1) backplane. We have four Intel X-25E''s attached to the backplane with two acting as ZIL and two as L2ARC. The remaining 21 drives are 1TB SATA. The system is being used as an NFS datastore for VMware ESX, and, while not too heavily loaded, we''ll occasionally see these pop up in the logs: Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0 (mpt0): Feb 28 22:46:22 prodsys-t2-zfs1 Log info 31126000 received for target 31. Feb 28 22:46:22 prodsys-t2-zfs1 scsi_status=0, ioc_status=804b, scsi_state=c Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0 (mpt0): Feb 28 22:46:22 prodsys-t2-zfs1 Log info 31126000 received for target 31. Feb 28 22:46:22 prodsys-t2-zfs1 scsi_status=0, ioc_status=804b, scsi_state=c Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0/sd at 1f,0 (sd24): Feb 28 22:46:22 prodsys-t2-zfs1 Error for Command: write Error Level: Retryable Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Requested Block: 591744 Error Block: 591744 Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM002600FD Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Mar 1 01:10:40 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0 (mpt0): Mar 1 01:10:40 prodsys-t2-zfs1 Log info 31126000 received for target 30. Mar 1 01:10:40 prodsys-t2-zfs1 scsi_status=0, ioc_status=804b, scsi_state=c Mar 1 01:10:40 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0 (mpt0): Mar 1 01:10:40 prodsys-t2-zfs1 Log info 31126000 received for target 30. Mar 1 01:10:40 prodsys-t2-zfs1 scsi_status=0, ioc_status=804b, scsi_state=c Mar 1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,340f at 8/pci15d9,1 at 0/sd at 1e,0 (sd23): Mar 1 01:10:41 prodsys-t2-zfs1 Error for Command: write Error Level: Retryable Mar 1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Requested Block: 958744 Error Block: 958744 Mar 1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM0033003T Mar 1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Mar 1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 The errors _only_ correspond with whichever drives are being used for ZIL. The system is fully patched Solaris 10 U8, and the mpt driver is version 1.92: # modinfo | grep mpt 40 ffffffffef8bc000 3b5f0 169 1 mpt (MPT HBA Driver v1.92) The error messages above aren''t fatal -- aparently the OS just retries the write and all is well. We haven''t seen any performance impact either, but would like to track the problem down. We''ve already swapped out the SSD drives. The retries continue to occur as above.... The only thing that "solves" the problem is to either attach the SSD drives to the motherboard''s SATA controllers or to attach them directly to the LSI controller (bypassing the backplane). This would seem to point the finger at the backplane, however, the other 21 SATA drives never throw errors and neither to the two SSD''s being used for L2ARC. Could there be some sort of latency or timing issue with the mpt driver that might be causing this that only manifests itself with a high level of writes to SSD devices hanging off a backplane (potentially longer latency path?)? Are there some SCSI command timeout settings I can tweak to perhaps "mask" these errors for the mpt driver? The vendor will probably want to send us a backplane, but I''m not convinced it will fix the issue. Suggestions or thoughts? Thanks, Ray
Marion Hakanson
2010-Mar-25 18:51 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
rvandolson at esri.com said:> We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02) > (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev > 1.1) backplane. > . . . > The system is fully patched Solaris 10 U8, and the mpt driver is > version 1.92:Since you''re running on Solaris-10 (and its mpt driver), have you tried the firmware that Sun recommends for their own 1068E-based HBA''s? There are a couple of versions depending on your usage, but they''re all earlier revs than the 1.28.02.00 you have: http://www.lsi.com/support/sun/sg_xpci8sas_e_sRoHS.html Regards, Marion
Ray Van Dolson
2010-Mar-25 18:55 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
On Thu, Mar 25, 2010 at 11:51:25AM -0700, Marion Hakanson wrote:> rvandolson at esri.com said: > > We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02) > > (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev > > 1.1) backplane. > > . . . > > The system is fully patched Solaris 10 U8, and the mpt driver is > > version 1.92: > > Since you''re running on Solaris-10 (and its mpt driver), have you tried > the firmware that Sun recommends for their own 1068E-based HBA''s? There > are a couple of versions depending on your usage, but they''re all earlier > revs than the 1.28.02.00 you have: > > http://www.lsi.com/support/sun/sg_xpci8sas_e_sRoHS.htmlNo, I haven''t. Looks like something that would be worthwhile to try. Thanks for the suggestion, Ray
Ray Van Dolson
2010-Apr-02 02:08 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
On Thu, Mar 25, 2010 at 11:55:29AM -0700, Ray Van Dolson wrote:> On Thu, Mar 25, 2010 at 11:51:25AM -0700, Marion Hakanson wrote: > > rvandolson at esri.com said: > > > We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02) > > > (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev > > > 1.1) backplane. > > > . . . > > > The system is fully patched Solaris 10 U8, and the mpt driver is > > > version 1.92: > > > > Since you''re running on Solaris-10 (and its mpt driver), have you tried > > the firmware that Sun recommends for their own 1068E-based HBA''s? There > > are a couple of versions depending on your usage, but they''re all earlier > > revs than the 1.28.02.00 you have: > > > > http://www.lsi.com/support/sun/sg_xpci8sas_e_sRoHS.html > > No, I haven''t. Looks like something that would be worthwhile to try. > > Thanks for the suggestion, >Well, haven''t yet been able to try the firmware suggestion, but we did replace the backplane. No change. I''m not sure the firmware change would do any good either. As it is now, as long as the SSD drives are attached directly to the LSI controller (no intermediary backplane), everything works fine -- no errors. As soon as the backplane is put in the equation -- and *only* for SSD devices used as ZIL, we begin seeing the timeout/retries. Seems like if it were a 1068E firmware issue we''d be seeing the issue whether or not the backplane is in place... but maybe I''m missing something. Ray
Eric D. Mudama
2010-Apr-02 02:27 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
On Thu, Apr 1 at 19:08, Ray Van Dolson wrote:>Well, haven''t yet been able to try the firmware suggestion, but we did >replace the backplane. No change. > >I''m not sure the firmware change would do any good either. As it is >now, as long as the SSD drives are attached directly to the LSI >controller (no intermediary backplane), everything works fine -- no >errors. > >As soon as the backplane is put in the equation -- and *only* for SSD >devices used as ZIL, we begin seeing the timeout/retries. > >Seems like if it were a 1068E firmware issue we''d be seeing the issue >whether or not the backplane is in place... but maybe I''m missing >something.It''s possible that the backplane leads to enough signal degredation that the setup is now stressing error paths that simply aren''t hit with the direct-connect cabling. This is the sort of issue that adapter (or device or expander) firmware changes can mitigate or exacerbate. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Ray
2010-Oct-06 15:55 UTC
[zfs-discuss] Write retry errors to SSD''s on SAS backplane (mpt)
Hi, I came across this exact same problem when using Intel X25-E Extreme 32GB SSD disks as ZIL and L2ARC devices in a T5220 server. Since I didn''t see a definitive solution here, I opened a support case with Oracle. They told me to upgrade the firmware on my SSD disks and LSI Expander, and the problem went away. Here''s the solution they gave me after an analysis of my system: [i]1. customer is using systemboard 540-7970 (showfru). According to SSH (http://sunsolve.central/handbook_internal/Devices/System_Board/SYSBD_SE_T5120_T5220.html#7970) this is 1068E B2 board. Customer''s LSI firmware is at latest already (1.27.02.00). So, 140952-02 not needed. 2. SSD is at 8850 (diskinfo), need to go to 8855->8862 (143211-01, obsolete by 143211-02, rev 8862). No Cougar Card I can see, hence aac drive patch not needed. see README in 143211-02 3. btw, Customer already at KUP 139555-08 & 140796-01. So, here is what I suggest, Most importantly, the ssd disk fw patch is the most critical as readme from patch 143211-02 states the possible bug fixes (CR 6918513 & 6827668). Follow the Install Instructions there. As far as the patch 141043-01 (LSI Expander firmware for 16 Disk backplane on Sun SPARC Enterprise T5220 and T5240 platforms), your disk backplane version maybe already there. But, there is no way to tell as far as I know until one goes through the update process. Perhaps customer can skip for now. Otherwise, they can go through it but skip step 1-3 Install.info from patch 141043-01 1. # patchadd <filepath>126419-02 2. # patchadd <filepath>138888-05 3. # reboot **After reboot, need to install firmwareflash package for SPARC systems 4. # pkgadd -d <filepath-to-SUNWfirmwareflash> 5. # firmwareflash -l **To list all available ses devices in the system 6. # firmwareflash -d <ses device path> -f LSI_X28EXPDR_16DISK_BootRec_REV5-SPARC_Enterprise_T5220+T5240.rxp 7. # reboot 8. # You must now power cycle the system to run the new boot record firmware just loaded. [/i] So the main thing you need to do is apply firmware upgrade 143211-02, which specifically addresses the issue of retryable writes on SSD disks. -- This message posted from opensolaris.org