Hi, I''m having trouble with scsi timeouts, but it appears to only happen when I use ZFS. I''ve tried to replicate with SVM, but I can''t get the timeouts to happen when that is the underlying volume manager, however the performance with ZFS is much better when it does work. The symptom is that at some point when the system is somewhat busy, the disk I/O seems to hang for about a minute or so (with iostat showing the %busy column at 100%), then I see a flood of messages like below, then it resets the bus and retries the transaction and continues on where it left off. The messages look like: Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 1 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 1 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula last message repeated 1 time Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 4,0 (sd38): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 225914045 Error Block: 225914045 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80ARRWK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 2,0 (sd36): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882344 Error Block: 90882344 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80BNNFK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 3,0 (sd37): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 225914045 Error Block: 225914045 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80BDCKK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 0,0 (sd34): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882394 Error Block: 90882394 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 3KR0VPBF Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x2 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 1,0 (sd35): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882348 Error Block: 90882348 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 3KR0WLM4 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x2 I have a Dell 2850 with an Adaptec ASC-39320A U320 Dual SCSI 39320A card. I''ve connected both channels to a split bus Dell PowerVault 220S disk array with 11 300GB 10K drives via 2 cables. I have already swapped the HBA and both cables. I''ve moved disks around, tried subsets of disks, but it still seems to give the problems regardless of the disk configuration, or whether one or both controllers are used I''ve tried raidz2, raidz1, and mirrors, but it eventually gets hung and issues a timeout (and it does this several times a day). I''ve tried both raid5 and mirror using SVM, but it never gets the timeout (but the raid5 quite a bit slower, so I''d like to stick with ZFS). There''s no problem if you just put UFS on the raw disks. I''ve run diskomizer for many hours using without a problem using raw disks, and UFS on the disks. I had planned on making this system a master database server, however I''m still getting with it running as a slave, so I don''t have any comfort to promote this system to the master with the timeouts. Any suggestions? Thanks, Brian
Alexandru Pirvulescu
2010-Jan-14 13:57 UTC
[zfs-discuss] adpu320 scsi timeouts only with ZFS
Any news regarding this issue? I''m having the same problems. I''m using an external Axus SCSI enclosure (Yotta with 16 drives) and it timed out on scanning LUNs (16 of them b/c Yotta is configured as JBOD). I''ve performed firmware upgrade to the Yotta system and now the scanning works, the pool build works, I''m even able to perform some transfers but after some time everything hangs, iostat shows 100% busy on drives and timeouts in dmesg. Still, if I keep the LUN number to <=8 everything works correctly. I think that eliminates the problems of cabling or any other hardware involved. All the problems start when using >8 LUNs (sd.conf is configured correctly, adpu320.conf is configured with maximum 16 LUNs). Linux kernel 2.6.18 had the same problems in the past, but current kernel works without any problem, so they''ve changed something in the driver which eliminated the issue. If any extra info is needed I''m ready to provide it. -- This message posted from opensolaris.org
> Any news regarding this issue? I''m having the same > problems.Me too. My v40z with U320 drives in the internal bay will lock up partway through a scrub. I backed the whole SCSI chain down to U160, but it seems a shame that U320 speeds can''t be used. -- This message posted from opensolaris.org
I was frustrated with this problem for months. I''ve tried different disks, cables, even disk cabinets. The driver hasn''t been updated in a long time. When the timeouts occurred, they would freeze for about a minute or two (showing the 100% busy). I even had the problem with less than 8 LUNs (and I''m seeing similar symptoms on another box connected to a hardware RAID). I even swapped out controllers (for the same type - 3 of them, all Adaptec). To fix it, I swapped out the Adaptec controller and put in LSI Logic and all the problems went away. On Jan 14, 2010, at 8:57 AM, Alexandru Pirvulescu wrote:> Any news regarding this issue? I''m having the same problems. > > I''m using an external Axus SCSI enclosure (Yotta with 16 drives) and > it timed out on scanning LUNs (16 of them b/c Yotta is configured as > JBOD). > > I''ve performed firmware upgrade to the Yotta system and now the > scanning works, the pool build works, I''m even able to perform some > transfers but after some time everything hangs, iostat shows 100% > busy on drives and timeouts in dmesg. > > Still, if I keep the LUN number to <=8 everything works correctly. I > think that eliminates the problems of cabling or any other hardware > involved. > All the problems start when using >8 LUNs (sd.conf is configured > correctly, adpu320.conf is configured with maximum 16 LUNs). > > Linux kernel 2.6.18 had the same problems in the past, but current > kernel works without any problem, so they''ve changed something in > the driver which eliminated the issue. > > If any extra info is needed I''m ready to provide it. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> To fix it, I swapped out the Adaptec controller and > put in LSI Logic > and all the problems went away.I''m using Sun''s built-in LSI controller with (I presume) the original internal cable shipped by Sun. Still, no joy for me at U320 speeds. To be precise, when the controller is set at U320, it runs amazingly fast until it freezes, at which point it is quite slow. -- This message posted from opensolaris.org
Andreas GrĂ¼ninger
2010-Jan-19 17:02 UTC
[zfs-discuss] adpu320 scsi timeouts only with ZFS
Maybe there are too many I/Os for this controller. You may try this settings B130 echo zfs_txg_synctime_ms/W0t2000 | mdb -kw echo zfs_vdev_max_pending/W0t5 | mdb -kw older versions echo zfs_txg_synctime/W0t2 | mdb -kw echo zfs_vdev_max_pending/W0t5 | mdb -kw Andreas -- This message posted from opensolaris.org