Hi, Sorry for not replying to one of the already open threads on this topic; I''ve just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. The LSI cards are flashed with IT firmware from Feb 2009 (I think, I can double check if it''s important). The drives are Samsung HD154UI 1.5TB disks. I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. I hadn''t known at the time about the current mpt issues, or I may have held off on upgrading. I installed Solaris Nevada build 127 from the DVD image. I then proceed to setup a raidz3 pool with the same disks as before, of a slightly smaller size (obviously) than the former raidz2 pool. I started a moderately long-running and heavy load rsync to copy my data back to the pool from another host. Several times during the day (sometimes a couple times an hour, or it could go up to a few hours with no errors), I get several syslog errors and warnings about mpt, similiar but not identical to what I''ve seen reported here by others. Also, iostat -en shows several hw and trn errors of varying amounts for all the drives (in OpenSolaris 2009.06 I never had any iostat errors). After awhile the machine will hang in a variety of ways. The first time it was pingable, and I could authenticate through ssh but it would never spawn a shell. The second time it crashed it was unpingable from the network, and the display was black, although the numlock key was still toggling properly the numlock light on the console. Here''s a sample of my errors. I''ve included the complete series of errors from one timestampe, and a few lines from a subsequent series of errors a couple minutes later: (if there''s any other info I can provide or more things to test just let me know. Thanks, --Chad ) Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:44:25 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 29 04:44:25 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:44:25 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 29 04:44:25 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0):
Chad Cantwell wrote:> Hi, > > Sorry for not replying to one of the already open threads on this topic; > I''ve just joined the list for the purposes of this discussion and have > nothing in my client to reply to yet. > > I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 > platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with > 8 drives each.Are these disks internal to your server''s chassis, or external in a jbod? If in a jbod, which one? Also, which cables are you using? thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It''s a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Chad On Tue, Dec 01, 2009 at 01:02:34PM +1000, James C. McPherson wrote:> Chad Cantwell wrote: > >Hi, > > > >Sorry for not replying to one of the already open threads on this topic; > >I''ve just joined the list for the purposes of this discussion and have > >nothing in my client to reply to yet. > > > >I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 > >platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with > >8 drives each. > > Are these disks internal to your server''s chassis, or external in > a jbod? If in a jbod, which one? Also, which cables are you using? > > > thankyou, > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Chad Cantwell wrote:> Hi, > > Replied to your previous general query already, but in summary, they are in the > server chassis. It''s a Chenbro 16 hotswap bay case. It has 4 mini backplanes > that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives > per card).Hi Chad, thanks for the followup. Just to confirm - you''ve got this Chenbro chassis connected to the actual server chassis (where the cpu is), or do you have the cpu inside the Chenbro chassis? thankyou, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi, The Chenbro chassis contains everything - the motherboard/CPU, and the disks. As far as I know the chenbro backplanes are basically electrical jumpers that the LSI cards shouldn''t be aware of. They pass through the SATA signals directly from SFF-8087 cables to the disks. Thanks, Chad On Tue, Dec 01, 2009 at 01:43:06PM +1000, James C. McPherson wrote:> Chad Cantwell wrote: > >Hi, > > > >Replied to your previous general query already, but in summary, they are in the > >server chassis. It''s a Chenbro 16 hotswap bay case. It has 4 mini backplanes > >that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives > >per card). > > Hi Chad, > thanks for the followup. Just to confirm - you''ve got this > Chenbro chassis connected to the actual server chassis (where > the cpu is), or do you have the cpu inside the Chenbro chassis? > > > thankyou, > James > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: Nov 30 20:26:11 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:26:11 the-vault Disconnected command timeout for Target 10 Nov 30 20:59:12 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:12 the-vault mpt_send_handshake_msg task 3 failed Nov 30 20:59:13 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 20:59:13 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:13 the-vault mpt_config_space_init failed Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:15 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:15 the-vault mpt_config_space_init failed Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:32:17 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:17 the-vault mpt_send_handshake_msg task 4 failed Nov 30 21:32:18 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:18 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 21:32:18 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:18 the-vault mpt_config_space_init failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault mpt_config_space_init failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault mpt_restart_ioc failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault Rejecting future commands Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 21:32:19 the-vault Disconnected command timeout for Target 14 Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked (this keeps flooding like that for a _long_ time, skipping most of it) Nov 30 21:32:21 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:21 the-vault rejecting command, throttle choked Nov 30 21:32:21 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:21 the-vault rejecting command, throttle choked Nov 30 21:32:21 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): Nov 30 21:32:21 the-vault rejecting command, throttle choked Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Nov 30 21:33:10 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:10 the-vault EVENT-TIME: Mon Nov 30 21:33:09 PST 2009 Nov 30 21:33:10 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:10 the-vault SOURCE: zfs-diagnosis, REV: 1.0 Nov 30 21:33:10 the-vault EVENT-ID: af2c0212-2915-4eee-e9b3-e2a8fe70efba Nov 30 21:33:10 the-vault DESC: The number of I/O errors associated with a ZFS device exceeded Nov 30 21:33:10 the-vault acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Nov 30 21:33:10 the-vault AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Nov 30 21:33:10 the-vault will be made to activate a hot spare if available. Nov 30 21:33:10 the-vault IMPACT: Fault tolerance of the pool may be compromised. Nov 30 21:33:10 the-vault REC-ACTION: Run ''zpool status -x'' and replace the bad device. Nov 30 21:33:10 the-vault genunix: [ID 846333 kern.warning] WARNING: constraints forbid retire: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 Nov 30 22:05:18 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:18 the-vault mpt_send_handshake_msg task 3 failed Nov 30 22:05:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:19 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 22:05:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:19 the-vault mpt_config_space_init failed Nov 30 22:05:20 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:20 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 22:05:20 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:20 the-vault mpt_config_space_init failed Nov 30 22:05:20 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:05:20 the-vault mpt_restart_ioc failed Nov 30 22:38:20 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:38:20 the-vault mpt_send_handshake_msg task 4 failed Nov 30 22:38:21 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:38:21 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 22:38:21 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:38:21 the-vault mpt_config_space_init failed Nov 30 22:38:22 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:38:22 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 22:38:22 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 22:38:22 the-vault mpt_config_space_init failed Nov 30 22:38:46 the-vault sshd[636]: [ID 800047 auth.crit] monitor fatal: protocol error during kex, no DH_GEX_REQUEST: 254 Nov 30 22:38:46 the-vault sshd[637]: [ID 800047 auth.crit] fatal: Protocol error in privilege separation; expected packet type 254, got 20 Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:23 the-vault mpt_send_handshake_msg task 3 failed Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:23 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:23 the-vault mpt_config_space_init failed Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:25 the-vault LSI PCI device (1000,ffff) not supported. Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:25 the-vault mpt_config_space_init failed Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): Nov 30 23:11:25 the-vault mpt_restart_ioc failed (and that''s the last message before I hit the reset button. Host was unpingable, and just moving the mouse around on the screen was extremely delayed) Nov 30 23:32:05 the-vault genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_127 64-bit Nov 30 23:32:05 the-vault genunix: [ID 943908 kern.notice] Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Also, it says it resilvered some data; this is the first time I''ve seen any notes next to a devices. Still no zpool errors though. # zpool status vault pool: vault state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Mon Nov 30 23:33:16 2009 config: NAME STATE READ WRITE CKSUM vault ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 11.5K resilvered c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 errors: No known data errors # On Mon, Nov 30, 2009 at 06:46:13PM -0800, Chad Cantwell wrote:> Hi, > > Sorry for not replying to one of the already open threads on this topic; > I''ve just joined the list for the purposes of this discussion and have > nothing in my client to reply to yet. > > I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 > platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with > 8 drives each. The LSI cards are flashed with IT firmware from Feb 2009 > (I think, I can double check if it''s important). The drives are Samsung > HD154UI 1.5TB disks. I was using for quite awhile OpenSolaris 2009.06 > with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of > about ~20T and this worked perfectly fine (no issues or device errors > logged for several months, no hanging). A few days ago I decided to > reinstall with the latest OpenSolaris in order to take advantage of > raidz3. I hadn''t known at the time about the current mpt issues, or I > may have held off on upgrading. I installed Solaris Nevada build 127 > from the DVD image. I then proceed to setup a raidz3 pool with the > same disks as before, of a slightly smaller size (obviously) than the > former raidz2 pool. I started a moderately long-running and heavy > load rsync to copy my data back to the pool from another host. Several > times during the day (sometimes a couple times an hour, or it could go up > to a few hours with no errors), I get several syslog errors and warnings > about mpt, similiar but not identical to what I''ve seen reported here by > others. Also, iostat -en shows several hw and trn errors of varying > amounts for all the drives (in OpenSolaris 2009.06 I never had any iostat > errors). After awhile the machine will hang in a variety of ways. The > first time it was pingable, and I could authenticate through ssh but it > would never spawn a shell. The second time it crashed it was unpingable > from the network, and the display was black, although the numlock key was > still toggling properly the numlock light on the console. Here''s a > sample of my errors. I''ve included the complete series of errors from > one timestampe, and a few lines from a subsequent series of errors a > couple minutes later: > > (if there''s any other info I can provide or more things to test just let me > know. Thanks, --Chad ) > > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 6. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 9. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 7. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:42:55 the-vault scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:42:55 the-vault Log info 0x31120200 received for target 8. > Nov 29 04:42:55 the-vault scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:44:25 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > Nov 29 04:44:25 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:44:25 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 1/pci1000,3140 at 0 (mpt1): > Nov 29 04:44:25 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 > Nov 29 04:44:25 the-vault scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,277a at 3/pci111d,801c at 0/pci111d,801c at 0/pci1000,3140 at 0 (mpt0): > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Chad Cantwell wrote:> After another crash I checked the syslog and there were some different errors than the ones > I saw previously during operation:...> Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported....> Nov 30 20:59:13 the-vault mpt_config_space_init failed...> Nov 30 20:59:15 the-vault mpt_restart_ioc failed....> Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > e fmadm faulty to identify the devices or contact Sun for support.Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can''t init the config space (that''s the pci bus config space), then you''ve got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
I don''t think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It''s still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn''t realize at first when you said "non-Sun JBOD" that this didn''t apply to me (in regards to the msi=0 fix) because I didn''t realize JBOD was shorthand for an external expander device. Since I''m just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I''ve put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I''ve resumed my rsync, and so far there have been no errors, but it''s only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I''ll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote:> Chad Cantwell wrote: > >After another crash I checked the syslog and there were some different errors than the ones > >I saw previously during operation: > ... > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. > ... > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > ... > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > .... > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > >Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > >e fmadm faulty to identify the devices or contact Sun for support. > > > Sorry to have to tell you, but that HBA is dead. Or at > least dying horribly. If you can''t init the config space > (that''s the pci bus config space), then you''ve got about > 1/2 the nails in the coffin hammered in. Then the failure > to restart the IOC (io controller unit) == the rest of > the lid hammered down. > > > best regards, > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Well, ok, the msi=0 thing didn''t help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote:> I don''t think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. > It''s still working fine again now after a reboot. Actually, I reread one of your earlier messages, > and I didn''t realize at first when you said "non-Sun JBOD" that this didn''t apply to me (in regards to > the msi=0 fix) because I didn''t realize JBOD was shorthand for an external expander device. Since > I''m just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on > what you wrote earlier, anyway I''ve put > set mpt:mpt_enable_msi = 0 > now in /etc/system and rebooted as it was suggested earlier. I''ve resumed my rsync, and so far there > have been no errors, but it''s only been 20 minutes or so. I should have a good idea by tomorrow if this > definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors > fairly rapidly) > > Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. > I''ll let you know tomorrow either way. > > Chad > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > Chad Cantwell wrote: > > >After another crash I checked the syslog and there were some different errors than the ones > > >I saw previously during operation: > > ... > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. > > ... > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > ... > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > .... > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > >Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > least dying horribly. If you can''t init the config space > > (that''s the pci bus config space), then you''ve got about > > 1/2 the nails in the coffin hammered in. Then the failure > > to restart the IOC (io controller unit) == the rest of > > the lid hammered down. > > > > > > best regards, > > James C. McPherson > > -- > > Senior Kernel Software Engineer, Solaris > > Sun Microsystems > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
This is basically just a me too. I''m using different hardware but essentially the same problems. The relevant hardware I have is: --- SuperMicro MBD-H8Di3+-F-O motherboard with LSI 1068E onboard SuperMicro SC846E2-R900B 4U chassis with two LSI SASx36 expander chips on the backplane 24 Western Digital RE4-GP 2TB 7.2k RPM SATA drives --- I have two SFF-8087 to SFF-8087 cables running from the two ports on the motherboard (4 channels each) to two ports on the backplane, each port going to one of the LSI expander chips. The backplane has four additional ports which support cascading additional enclosures together, but I''m not making use of any of this at the moment. The machine is currently dead at the data center, and it''s late, so if you want anything more from me, just let me know and I''ll run stuff tomorrow on the machine. But otherwise, the behavior sounds the same as all of the other MPT reports recently. I was not seeing these types of problems with 2009.06, but also wanted to upgrade to get raidz3 support. Just tell me what other commands you might want output from to help diagnose the problem. -- This message posted from opensolaris.org
Chad Cantwell wrote:> Hi, > > I was using for quite awhile OpenSolaris 2009.06 > with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of > about ~20T and this worked perfectly fine (no issues or device errors > logged for several months, no hanging). A few days ago I decided to > reinstall with the latest OpenSolaris in order to take advantage of > raidz3.Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? MRJ
Mark Johnson wrote:> > > Chad Cantwell wrote: >> Hi, >> >> I was using for quite awhile OpenSolaris 2009.06 >> with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of >> about ~20T and this worked perfectly fine (no issues or device errors >> logged for several months, no hanging). A few days ago I decided to >> reinstall with the latest OpenSolaris in order to take advantage of >> raidz3. > > Just to be clear... The same setup was working fine on osol2009.06, > you upgraded to b127 and it started failing? > > Did you keep the osol2009.06 be around so you can reboot back to it? > > If so, have you tried the osol2009.06 mpt driver in the > BE with the latest bits (make sure you make a backup copy > of the mpt driver)?What''s the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? I have no idea if the old mpt drivers will work on a new kernel... But if someone wants to try... Something like the following should work... # first, I would work out of a test BE in case you # mess something up. beadm create test-be beadm activate test-be reboot # assuming your lasted BE is call snv127, mount it and backup # the stock mpt driver and conf file. beadm mount snv127 /mnt cp /mnt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf.orig cp /mnt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt.orig # see what builds are out there... pkg search /kernel/drv/amd64/mpt # There''s probably an easier way to do this... # grab an older mpt. This will take a while since it''s # not in it''s own package and ckr has some dependencies # so it will pull in a bunch of other packages. # change out 118 with the build you want to grab. mkdir /tmp/mpt pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/dev /tmp/mpt pkg -R /tmp/mpt/ install SUNWckr at 0.5.11-0.118 cp /tmp/mpt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf cp /tmp/mpt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt rm -rf /tmp/mpt/ bootadm update-archive -R /mnt MRJ
We actually tried this, although using sol10-version of mpt-driver. Surprisingly it didn''t work :-) Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Mark Johnson Sent: 1. joulukuuta 2009 15:57 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] mpt errors on snv 127 Mark Johnson wrote:> > > Chad Cantwell wrote: >> Hi, >> >> I was using for quite awhile OpenSolaris 2009.06 >> with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of >> about ~20T and this worked perfectly fine (no issues or device errors >> logged for several months, no hanging). A few days ago I decided to >> reinstall with the latest OpenSolaris in order to take advantage of >> raidz3. > > Just to be clear... The same setup was working fine on osol2009.06, > you upgraded to b127 and it started failing? > > Did you keep the osol2009.06 be around so you can reboot back to it? > > If so, have you tried the osol2009.06 mpt driver in the > BE with the latest bits (make sure you make a backup copy > of the mpt driver)?What''s the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? I have no idea if the old mpt drivers will work on a new kernel... But if someone wants to try... Something like the following should work... # first, I would work out of a test BE in case you # mess something up. beadm create test-be beadm activate test-be reboot # assuming your lasted BE is call snv127, mount it and backup # the stock mpt driver and conf file. beadm mount snv127 /mnt cp /mnt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf.orig cp /mnt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt.orig # see what builds are out there... pkg search /kernel/drv/amd64/mpt # There''s probably an easier way to do this... # grab an older mpt. This will take a while since it''s # not in it''s own package and ckr has some dependencies # so it will pull in a bunch of other packages. # change out 118 with the build you want to grab. mkdir /tmp/mpt pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/dev /tmp/mpt pkg -R /tmp/mpt/ install SUNWckr at 0.5.11-0.118 cp /tmp/mpt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf cp /tmp/mpt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt rm -rf /tmp/mpt/ bootadm update-archive -R /mnt MRJ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > What''s the earliest build someone has seen this > problem? i.e. if we binary chop, has anyone seen it > in > b118? >We have used every "stable" build from b118 up, as b118 was the first reliable one that could be used is a CIFS-heavy environment. The problem occurs on all of them. - Adam -- This message posted from opensolaris.org
If someone from Sun will confirm that it should work to use the mpt driver from 2009.06, I''d be willing to set up a BE and try it. I still have the snapshot from my 2009.06 install, so I should be able to mount that and grab the files easily enough. -- This message posted from opensolaris.org
Travis Tabbal wrote:> If someone from Sun will confirm that it should work to use the mpt > driver from 2009.06, I''d be willing to set up a BE and try it. I > still have the snapshot from my 2009.06 install, so I should be able > to mount that and grab the files easily enough.I tried, it doesn''t work. It''s interesting to note that the itmpt driver (much, much older) works just fine. It seems someone has gotten "creative" with the mpt driver''s use of the DDI. -- Carson
First I tried just upgrading to b127, that had a few issues besides the mpt driver. After that I did a clean install of b127, but no I don''t have my osol2009.06 root still there. I wasn''t sure how to install another copy and leave it there (I suspect it is possible, since I saw when doing upgrades it creates a second root environment, but my forte isn''t solaris so I just reformatted the root device) On Tue, Dec 01, 2009 at 08:09:32AM -0500, Mark Johnson wrote:> > > Chad Cantwell wrote: > >Hi, > > > > I was using for quite awhile OpenSolaris 2009.06 > >with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of > >about ~20T and this worked perfectly fine (no issues or device errors > >logged for several months, no hanging). A few days ago I decided to > >reinstall with the latest OpenSolaris in order to take advantage of > >raidz3. > > Just to be clear... The same setup was working fine on osol2009.06, > you upgraded to b127 and it started failing? > > Did you keep the osol2009.06 be around so you can reboot back to it? > > If so, have you tried the osol2009.06 mpt driver in the > BE with the latest bits (make sure you make a backup copy > of the mpt driver)? > > > > MRJ > >
To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of instability. If I were to start copying a lot of data to the filesystem again though, I''m sure it would generate errors and crash again. Chad On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote:> Well, ok, the msi=0 thing didn''t help after all. A few minutes after my last message a few errors showed > up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just > doing a scrub instead of my rsync process and see how that does. > > Chad > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > I don''t think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. > > It''s still working fine again now after a reboot. Actually, I reread one of your earlier messages, > > and I didn''t realize at first when you said "non-Sun JBOD" that this didn''t apply to me (in regards to > > the msi=0 fix) because I didn''t realize JBOD was shorthand for an external expander device. Since > > I''m just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on > > what you wrote earlier, anyway I''ve put > > set mpt:mpt_enable_msi = 0 > > now in /etc/system and rebooted as it was suggested earlier. I''ve resumed my rsync, and so far there > > have been no errors, but it''s only been 20 minutes or so. I should have a good idea by tomorrow if this > > definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors > > fairly rapidly) > > > > Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. > > I''ll let you know tomorrow either way. > > > > Chad > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > Chad Cantwell wrote: > > > >After another crash I checked the syslog and there were some different errors than the ones > > > >I saw previously during operation: > > > ... > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. > > > ... > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > ... > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > .... > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > >Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > > > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > > least dying horribly. If you can''t init the config space > > > (that''s the pci bus config space), then you''ve got about > > > 1/2 the nails in the coffin hammered in. Then the failure > > > to restart the IOC (io controller unit) == the rest of > > > the lid hammered down. > > > > > > > > > best regards, > > > James C. McPherson > > > -- > > > Senior Kernel Software Engineer, Solaris > > > Sun Microsystems > > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don''t know if it''s significant or not, is that under the recent builds and under 2009.06, I had run "echo ''::interrupts'' | mdb -k" to check the interrupts used. (I don''t have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don''t know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I''ll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it''s not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote:> To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of > data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of > instability. If I were to start copying a lot of data to the filesystem again though, I''m sure it would > generate errors and crash again. > > Chad > > > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: > > Well, ok, the msi=0 thing didn''t help after all. A few minutes after my last message a few errors showed > > up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just > > doing a scrub instead of my rsync process and see how that does. > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > > I don''t think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. > > > It''s still working fine again now after a reboot. Actually, I reread one of your earlier messages, > > > and I didn''t realize at first when you said "non-Sun JBOD" that this didn''t apply to me (in regards to > > > the msi=0 fix) because I didn''t realize JBOD was shorthand for an external expander device. Since > > > I''m just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on > > > what you wrote earlier, anyway I''ve put > > > set mpt:mpt_enable_msi = 0 > > > now in /etc/system and rebooted as it was suggested earlier. I''ve resumed my rsync, and so far there > > > have been no errors, but it''s only been 20 minutes or so. I should have a good idea by tomorrow if this > > > definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors > > > fairly rapidly) > > > > > > Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. > > > I''ll let you know tomorrow either way. > > > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > > Chad Cantwell wrote: > > > > >After another crash I checked the syslog and there were some different errors than the ones > > > > >I saw previously during operation: > > > > ... > > > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. > > > > ... > > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > > ... > > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > > .... > > > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > > >Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > > > > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > > > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > > > least dying horribly. If you can''t init the config space > > > > (that''s the pci bus config space), then you''ve got about > > > > 1/2 the nails in the coffin hammered in. Then the failure > > > > to restart the IOC (io controller unit) == the rest of > > > > the lid hammered down. > > > > > > > > > > > > best regards, > > > > James C. McPherson > > > > -- > > > > Senior Kernel Software Engineer, Solaris > > > > Sun Microsystems > > > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi all, Unfortunately for me, there does seem to be a hardware component to my problem. Although my rsync copied almost 4TB of data with no iostat errors after going back to OpenSolaris 2009.06, I/O on one of my mpt cards did eventually hang, with 6 disk lights on and 2 off, until rebooting. There are a few hardware changes made since the last time I did a full backup, so it''s possible that whatever problem was introduced didn''t happen frequently enough in low i/o usage for me to detect until now when I was reinstalling and copying massive amounts of data back. The changes I had made since originally installing osol2009.06 several months ago are: - stop using marvel yukon2 ethernet onboard driver (which used a 3rd party driver) in favor of intel 1000 pt dual port, which necessesitated an extra pci-e slot, prompting the following item: - swapped motherboards between 2 machines (they were similiar though, with similiar onboard hardware and shouldn''t have been a major change). Originally was an Asus P5Q Deluxe w/3 pci-e slots, now is a slightly older Asus P5W64 w/4 pci-e slots. - the intel 1000 pt dual port card has been aggregated as aggr0 since it was installed (the older yukon2 was a basic interface) the above changes were what was done awhile ago before upgrading opensolaris to 127, and things seemed to be working fine for at least 2-3 months with rsync updating (never hung, or had a fatal zfs error or lost access to data requiring a reboot) new changes since troubleshooting snv 127 mpt issues: - upgrade LSI 3081 firmware from 1.28.2 (or was it .02) to 1.29, the latest. If this turns out to be an issue, I do have the previous IT firmware that I was using before which I can flash back. another, albeit unlikely factor: when I originally copied all my data to my first opensolaris raidz2 pool, I didn''t use rsync at all, I used netcat & tar, and only setup rsync later for updates. perhaps the huge initial single rsync of the large tree does something strange that the original intiial netcat & tar copy did not (i know, unlikely, but I''m grasping at straws here to determine what has happened). I''ll work on ruling out the potential sources of hardware problems before I report any more on the mpt issues, since my test case would probably confound things at this point. I am affected by the mpt bugs since I would get the timeouts almost constantly in snv 127+, but since I''m also apparently affected by some other unknown hardware issue, my data on the mpt problems might lead people in the wrong direction at this point. I will first try to go back to the non-aggregated yukon ethernet, remove the intel dual port pci-e network adapter, then if the problem persists try half of my drives on each LSI controller individually to confirm if one controller has a problem the other does not, or one drive in one set is causing a new problem to a particular controller. I hope to have some kind of answer at that point and not have to resort to motherboard swapping again. Chad On Thu, Dec 03, 2009 at 10:44:53PM -0800, Chad Cantwell wrote:> I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the > itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in > snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and > everything is working fine. The system has been running a few hours and copied a lot of data and not had any > trouble, mpt syslog events, or iostat errors. > > One thing I found interesting, and I don''t know if it''s significant or not, is that under the recent builds and > under 2009.06, I had run "echo ''::interrupts'' | mdb -k" to check the interrupts used. (I don''t have the printout > handy for snv 127+, though). > > I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of > my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris > 2009.06, all 4 devices are on different IRQs. I don''t know if this is significant, but most of my testing when > I encountered errors was data transfer via the network, so it could have potentially been interfering with the > mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying > from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool > throughput as it is to be related to the network traffic. > > I''ll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some > more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another > array so it''s not too big a deal to test later when the mpt drivers are looking better and wipe again in the event > of problems) > > Chad > > On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: > > To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of > > data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of > > instability. If I were to start copying a lot of data to the filesystem again though, I''m sure it would > > generate errors and crash again. > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: > > > Well, ok, the msi=0 thing didn''t help after all. A few minutes after my last message a few errors showed > > > up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just > > > doing a scrub instead of my rsync process and see how that does. > > > > > > Chad > > > > > > > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > > > I don''t think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. > > > > It''s still working fine again now after a reboot. Actually, I reread one of your earlier messages, > > > > and I didn''t realize at first when you said "non-Sun JBOD" that this didn''t apply to me (in regards to > > > > the msi=0 fix) because I didn''t realize JBOD was shorthand for an external expander device. Since > > > > I''m just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on > > > > what you wrote earlier, anyway I''ve put > > > > set mpt:mpt_enable_msi = 0 > > > > now in /etc/system and rebooted as it was suggested earlier. I''ve resumed my rsync, and so far there > > > > have been no errors, but it''s only been 20 minutes or so. I should have a good idea by tomorrow if this > > > > definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors > > > > fairly rapidly) > > > > > > > > Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. > > > > I''ll let you know tomorrow either way. > > > > > > > > Chad > > > > > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > > > Chad Cantwell wrote: > > > > > >After another crash I checked the syslog and there were some different errors than the ones > > > > > >I saw previously during operation: > > > > > ... > > > > > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not supported. > > > > > ... > > > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > > > ... > > > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > > > .... > > > > > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault > > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > > > >Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 > > > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. > > > > > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. > > > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled > > > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault > > > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us > > > > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > > > > > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > > > > least dying horribly. If you can''t init the config space > > > > > (that''s the pci bus config space), then you''ve got about > > > > > 1/2 the nails in the coffin hammered in. Then the failure > > > > > to restart the IOC (io controller unit) == the rest of > > > > > the lid hammered down. > > > > > > > > > > > > > > > best regards, > > > > > James C. McPherson > > > > > -- > > > > > Senior Kernel Software Engineer, Solaris > > > > > Sun Microsystems > > > > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > > > > _______________________________________________ > > > > zfs-discuss mailing list > > > > zfs-discuss at opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Gday Chad, the more swaptronics you partake in, the more difficult it is going to be for us (collectively) to figure out what is going wrong on your system. Btw, since you''re running a build past 124, you can use the "yge" driver instead of the yukonx (from Marvell) or myk (from Murayama-san) drivers. As another comment in this thread has mentioned, a full scrub can be a serious test of your hardware depending on how much data you''ve got to walk over. If you can keep the hardware variables to a minimum then clarity will be more achievable. thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Thanks for the info on the yukon driver. I realize too many variables makes things impossible to determine, but I had made these hardware changes awhile back, and they seemed to work fine at the time. Since they aren''t now, even in the older OpenSolaris (i''ve tried 2009.06 and 2008.11 now), the problem seems to be a hardware quirk, and the only way to narrow that down is to change hardware back until it works like it used to in at least the older snv builds. I''ve ruled out the ethernet controller. I''m leaning toward the current motherboard (Asus P5W64) not playing nicely with the LSI cards, but it will probably be several days until I get to the bottom of this since it takes awhile to test after making a change... Thanks, Chad On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote:> > > Gday Chad, > the more swaptronics you partake in, the more difficult it > is going to be for us (collectively) to figure out what is > going wrong on your system. Btw, since you''re running a build > past 124, you can use the "yge" driver instead of the yukonx > (from Marvell) or myk (from Murayama-san) drivers. > > As another comment in this thread has mentioned, a full scrub > can be a serious test of your hardware depending on how much > data you''ve got to walk over. If you can keep the hardware > variables to a minimum then clarity will be more achievable. > > > thankyou, > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi, I would like to add , yet another, mpt timeout report. Suddently the system started to get slow, noticeable due to the fact that some linux clients where complaining about nfs server timeout, and after some time i saw alot of reset bus messages in the /var/adm/messsages file. I quickly took a look to the JBOD chassis, and one of the disks had a fixed light, and after the physical removal of this disk, the system re-started to respond and the resilver process kicked in, due to a spare disk took the place of the disconnected disk, as seen with the zpool status -v : zpool status -v DATAPOOL04 pool: DATAPOOL04 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 1h40m, 8.26% done, 18h32m to go config: NAME STATE READ WRITE CKSUM DATAPOOL04 DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c5t27d0 ONLINE 0 0 0 105M resilvered c5t29d0 ONLINE 0 0 0 105M resilvered c5t30d0 ONLINE 0 0 0 105M resilvered spare DEGRADED 0 0 0 c5t31d0 REMOVED 0 423K 0 c5t28d0 ONLINE 0 0 0 9.83G resilvered c5t32d0 ONLINE 0 0 0 105M resilvered spares c5t28d0 INUSE currently in use errors: No known data errors At this moment the system is doing the resilvering, but the messages regarding disk/disk controller still appear in the log. Could this messages appear due to the fact that the resilver process is a heavy one, or more disks are probably affected? In cases such as this one, what''s the best procedure to do? * shutdown server and JBOD , including power off/power on and see how it goes * replace HBA/disk ? * other ? Thanks for the time, and if any other information is required (even ssh access can be granted) please feel free to ask it. Best regards, Bruno Sousa System specs : * OpenSolaris snv_101b, with two Dual-Core AMD, and 16 GB Ram * LSI Logic SAS1068E, revision B3 , MPT Rev 105, Firmware Rev 011a0000 * 24 disks are attached to this HBA, the disks are Seagate Sata 1TB "Enterprise" class (ATA-ST31000340NS-SN06-931.51GB ) * the LSI HBA is connect with 1 SFF 8087 connector cable (SAS 846EL1 BP 1-Port Internal Cascading Cable) to a Supermicro Chassis SC 846 with a SAS / SATA Expander Backplane with single LSI SASX36 Expander Chip /var/adm/messages content Dec 7 13:57:12 san01 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0/sd at 17,0 (sd18): Dec 7 13:57:12 san01 Error for Command: write(10) Error Level: Retryable Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Requested Block: 48696432 Error Block: 48696432 Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Dec 7 13:57:15 san01 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:15 san01 mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31123000 Dec 7 13:57:15 san01 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:15 san01 mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31123000 Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:45 san01 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0/sd at 15,0 (sd16): Dec 7 13:57:45 san01 Error for Command: write(10) Error Level: Retryable Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Requested Block: 445125208 Error Block: 445125208 Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Dec 7 13:57:50 san01 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:50 san01 mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31123000 Dec 7 13:57:50 san01 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:50 san01 mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31123000 Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): iostat -En results Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 125 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t22d0 Soft Errors: 18 Hard Errors: 106 Transport Errors: 686 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 106 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t23d0 Soft Errors: 18 Hard Errors: 80 Transport Errors: 339 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 80 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t24d0 Soft Errors: 18 Hard Errors: 59 Transport Errors: 228 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 59 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t25d0 Soft Errors: 18 Hard Errors: 55 Transport Errors: 219 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 55 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t26d0 Soft Errors: 18 Hard Errors: 63 Transport Errors: 249 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 63 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t27d0 Soft Errors: 18 Hard Errors: 11 Transport Errors: 274 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t28d0 Soft Errors: 18 Hard Errors: 182 Transport Errors: 1255 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 182 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t29d0 Soft Errors: 18 Hard Errors: 8 Transport Errors: 201 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 8 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t30d0 Soft Errors: 18 Hard Errors: 10 Transport Errors: 249 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 c5t31d0 Soft Errors: 12 Hard Errors: 0 Transport Errors: 115 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12 Illegal Request: 4 Predictive Failure Analysis: 0 c5t32d0 Soft Errors: 18 Hard Errors: 11 Transport Errors: 222 Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: Size: 1000.20GB <1000204886016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 11 Recoverable: 18 Illegal Request: 6 Predictive Failure Analysis: 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/0a87250c/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/0a87250c/attachment.bin>
Hi all, During this problem i did a power-off/power-on in the server and the bus reset/scsi timeout issue persisted. After that i decided to poweroff/power on the jbod array, and after that everything became normal. No scsi timeouts, normal performance, everything is okay now. With this is it safe to assume that the problem may becaused by the SAS expander (one single LSI SASX36 Expander Chip) used by the supermicro jbod chassis, and not by the hba/mpt driver? Thank you for your time, Bruno Based in what we seen in our environment, we believe that the problem is caused by the SAS expander (one single LSI SASX36 Expander Chip), in the jbod chassis. Bruno Sousa wrote:> Hi, > > I would like to add , yet another, mpt timeout report. > Suddently the system started to get slow, noticeable due to the fact > that some linux clients where complaining about nfs server timeout, > and after some time i saw alot of reset bus messages in the > /var/adm/messsages file. > I quickly took a look to the JBOD chassis, and one of the disks had a > fixed light, and after the physical removal of this disk, the system > re-started to respond and the resilver process kicked in, due to a > spare disk took the place of the disconnected disk, as seen with the > zpool status -v : > > zpool status -v DATAPOOL04 > pool: DATAPOOL04 > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver in progress for 1h40m, 8.26% done, 18h32m to go > config: > > NAME STATE READ WRITE CKSUM > DATAPOOL04 DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > c5t27d0 ONLINE 0 0 0 105M resilvered > c5t29d0 ONLINE 0 0 0 105M resilvered > c5t30d0 ONLINE 0 0 0 105M resilvered > spare DEGRADED 0 0 0 > c5t31d0 REMOVED 0 423K 0 > c5t28d0 ONLINE 0 0 0 9.83G resilvered > c5t32d0 ONLINE 0 0 0 105M resilvered > spares > c5t28d0 INUSE currently in use > > errors: No known data errors > > At this moment the system is doing the resilvering, but the messages > regarding disk/disk controller still appear in the log. Could this > messages appear due to the fact that the resilver process is a heavy > one, or more disks are probably affected? > In cases such as this one, what''s the best procedure to do? > > * shutdown server and JBOD , including power off/power on and see > how it goes > * replace HBA/disk ? > * other ? > > Thanks for the time, and if any other information is required (even > ssh access can be granted) please feel free to ask it. > > Best regards, > Bruno Sousa > > > > System specs : > > * OpenSolaris snv_101b, with two Dual-Core AMD, and 16 GB Ram > * LSI Logic SAS1068E, revision B3 , MPT Rev 105, Firmware Rev 011a0000 > * 24 disks are attached to this HBA, the disks are Seagate Sata > 1TB "Enterprise" class (ATA-ST31000340NS-SN06-931.51GB ) > * the LSI HBA is connect with 1 SFF 8087 connector cable (SAS > 846EL1 BP 1-Port Internal Cascading Cable) to a Supermicro > Chassis SC 846 with a SAS / SATA Expander Backplane with single > LSI SASX36 Expander Chip > > > /var/adm/messages content > > Dec 7 13:57:12 san01 scsi: [ID 107833 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0/sd at 17,0 (sd18): > Dec 7 13:57:12 san01 Error for Command: write(10) > Error Level: Retryable > Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Requested > Block: 48696432 Error Block: 48696432 > Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Vendor: > ATA Serial Number: > Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] Sense Key: > Unit_Attention > Dec 7 13:57:12 san01 scsi: [ID 107833 kern.notice] ASC: 0x29 > (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 > Dec 7 13:57:15 san01 scsi: [ID 243001 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:15 san01 mpt_handle_event_sync: IOCStatus=0x8000, > IOCLogInfo=0x31123000 > Dec 7 13:57:15 san01 scsi: [ID 243001 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:15 san01 mpt_handle_event: IOCStatus=0x8000, > IOCLogInfo=0x31123000 > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:45 san01 Log info 0x31123000 received for target 21. > Dec 7 13:57:45 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:45 san01 scsi: [ID 107833 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0/sd at 15,0 (sd16): > Dec 7 13:57:45 san01 Error for Command: write(10) > Error Level: Retryable > Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Requested > Block: 445125208 Error Block: 445125208 > Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Vendor: > ATA Serial Number: > Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] Sense Key: > Unit_Attention > Dec 7 13:57:45 san01 scsi: [ID 107833 kern.notice] ASC: 0x29 > (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 > Dec 7 13:57:50 san01 scsi: [ID 243001 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:50 san01 mpt_handle_event_sync: IOCStatus=0x8000, > IOCLogInfo=0x31123000 > Dec 7 13:57:50 san01 scsi: [ID 243001 kern.warning] WARNING: > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:50 san01 mpt_handle_event: IOCStatus=0x8000, > IOCLogInfo=0x31123000 > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > Dec 7 13:57:52 san01 Log info 0x31123000 received for target 28. > Dec 7 13:57:52 san01 scsi_status=0x0, ioc_status=0x804b, > scsi_state=0xc > Dec 7 13:57:52 san01 scsi: [ID 365881 kern.notice] > /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0): > > > iostat -En results > > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 125 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t22d0 Soft Errors: 18 Hard Errors: 106 Transport Errors: 686 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 106 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t23d0 Soft Errors: 18 Hard Errors: 80 Transport Errors: 339 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 80 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t24d0 Soft Errors: 18 Hard Errors: 59 Transport Errors: 228 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 59 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t25d0 Soft Errors: 18 Hard Errors: 55 Transport Errors: 219 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 55 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t26d0 Soft Errors: 18 Hard Errors: 63 Transport Errors: 249 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 63 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t27d0 Soft Errors: 18 Hard Errors: 11 Transport Errors: 274 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t28d0 Soft Errors: 18 Hard Errors: 182 Transport Errors: 1255 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 182 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t29d0 Soft Errors: 18 Hard Errors: 8 Transport Errors: 201 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 8 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t30d0 Soft Errors: 18 Hard Errors: 10 Transport Errors: 249 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 > c5t31d0 Soft Errors: 12 Hard Errors: 0 Transport Errors: 115 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12 > Illegal Request: 4 Predictive Failure Analysis: 0 > c5t32d0 Soft Errors: 18 Hard Errors: 11 Transport Errors: 222 > Vendor: ATA Product: ST31000340NS Revision: SN06 Serial No: > Size: 1000.20GB <1000204886016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 11 Recoverable: 18 > Illegal Request: 6 Predictive Failure Analysis: 0 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/7e58260a/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: scsi-transport-error.jpg Type: image/jpeg Size: 167429 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/7e58260a/attachment.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/7e58260a/attachment.bin>
Bruno Sousa wrote:> Hi all, > > During this problem i did a power-off/power-on in the server and the bus > reset/scsi timeout issue persisted. After that i decided to > poweroff/power on the jbod array, and after that everything became normal. > No scsi timeouts, normal performance, everything is okay now. > With this is it safe to assume that the problem may becaused by the SAS > expander (one single LSI SASX36 Expander Chip) used by the supermicro > jbod chassis, and not by the hba/mpt driver?Hi Bruno, that is indeed what I, personally, suspect is the case. Tracking that down and conclusively proving so is, however, another thing entirely. Could you send the output from prtconf -v for your host please, so that we can have a look at the vital information for the enclosure services and SMP nodes that the SAS Expander presents/ thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi James, Thank you for your feedback, and i will send the prtconf -v output for your email. I also have another system where i can test something if that''s the case, and if you need extra information or even access to the system, please let me know it. Thank you, Bruno James C. McPherson wrote:> Bruno Sousa wrote: >> Hi all, >> >> During this problem i did a power-off/power-on in the server and the >> bus reset/scsi timeout issue persisted. After that i decided to >> poweroff/power on the jbod array, and after that everything became >> normal. >> No scsi timeouts, normal performance, everything is okay now. >> With this is it safe to assume that the problem may becaused by the >> SAS expander (one single LSI SASX36 Expander Chip) used by the >> supermicro jbod chassis, and not by the hba/mpt driver? > > Hi Bruno, > that is indeed what I, personally, suspect is the case. Tracking > that down and conclusively proving so is, however, another thing > entirely. > > Could you send the output from prtconf -v for your host please, > so that we can have a look at the vital information for the > enclosure services and SMP nodes that the SAS Expander presents/ > > > > thankyou, > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/d53d50bb/attachment.bin>
I can report io errors with Chenbro based LSI SASx36 IC based expanders tested with 111b/121/128a/129. The HBA was LSI 1068 based. If I bypass expander by adding more HBA controllers, mpt does not have io errors. -nola On Dec 8, 2009, at 6:48 AM, Bruno Sousa wrote:> Hi James, > > Thank you for your feedback, and i will send the prtconf -v output for > your email. > I also have another system where i can test something if that''s the > case, and if you need extra information or even access to the system, > please let me know it. > > Thank you, > Bruno > > James C. McPherson wrote: >> Bruno Sousa wrote: >>> Hi all, >>> >>> During this problem i did a power-off/power-on in the server and the >>> bus reset/scsi timeout issue persisted. After that i decided to >>> poweroff/power on the jbod array, and after that everything became >>> normal. >>> No scsi timeouts, normal performance, everything is okay now. >>> With this is it safe to assume that the problem may becaused by the >>> SAS expander (one single LSI SASX36 Expander Chip) used by the >>> supermicro jbod chassis, and not by the hba/mpt driver? >> >> Hi Bruno, >> that is indeed what I, personally, suspect is the case. Tracking >> that down and conclusively proving so is, however, another thing >> entirely. >> >> Could you send the output from prtconf -v for your host please, >> so that we can have a look at the vital information for the >> enclosure services and SMP nodes that the SAS Expander presents/ >> >> >> >> thankyou, >> James C. McPherson >> -- >> Senior Kernel Software Engineer, Solaris >> Sun Microsystems >> http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
fyi to everyone, the Asus P5W64 motherboard previously in my opensolaris machine was the culprit, and not the general mpt issues. At the time the motherboard was originally put in that machine, there was not enough zfs i/o load to trigger the problem which led to the false impression the hardware was fine. I''m using a 5400 chipset xeon board now (asus dseb-gh) and my LSI cards are working perfectly again; over 2 hours of heavy I/O and no errors or warnings with snv 127 (with the P5W64/LSI combo with build 127 it would never run more than 15 minutes without warnings). I chose this board partly since it has PCI-X slots and I thought those might be useful for AOC-SAT2-MV8 cards if I couldn''t shake the mpt issues, but now that the mpt issues are gone I can continue with that controller if I want. Thanks everyone for your help, Chad On Sun, Dec 06, 2009 at 11:12:50PM -0800, Chad Cantwell wrote:> Thanks for the info on the yukon driver. I realize too many variables makes > things impossible to determine, but I had made these hardware changes awhile > back, and they seemed to work fine at the time. Since they aren''t now, even > in the older OpenSolaris (i''ve tried 2009.06 and 2008.11 now), the problem > seems to be a hardware quirk, and the only way to narrow that down is to > change hardware back until it works like it used to in at least the older > snv builds. I''ve ruled out the ethernet controller. I''m leaning toward > the current motherboard (Asus P5W64) not playing nicely with the LSI cards, > but it will probably be several days until I get to the bottom of this since > it takes awhile to test after making a change... > > Thanks, > Chad > > On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote: > > > > > > Gday Chad, > > the more swaptronics you partake in, the more difficult it > > is going to be for us (collectively) to figure out what is > > going wrong on your system. Btw, since you''re running a build > > past 124, you can use the "yge" driver instead of the yukonx > > (from Marvell) or myk (from Murayama-san) drivers. > > > > As another comment in this thread has mentioned, a full scrub > > can be a serious test of your hardware depending on how much > > data you''ve got to walk over. If you can keep the hardware > > variables to a minimum then clarity will be more achievable. > > > > > > thankyou, > > James C. McPherson > > -- > > Senior Kernel Software Engineer, Solaris > > Sun Microsystems > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss