thr3ads.net - zfs discuss - [zfs-discuss] adpu320 scsi timeouts only with ZFS [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Brian Kolaci

2009-Nov-23 06:00 UTC

[zfs-discuss] adpu320 scsi timeouts only with ZFS

Hi,

I''m having trouble with scsi timeouts, but it appears to only happen 
when I use ZFS.
I''ve tried to replicate with SVM, but I can''t get the timeouts
to happen
when that is the underlying volume manager, however the performance with 
ZFS is much better when it does work.

The symptom is that at some point when the system is somewhat busy, the 
disk I/O seems to hang for about a minute or so (with iostat showing the 
%busy column at 100%), then I see a flood of messages like below, then 
it resets the bus and retries the transaction and continues on where it 
left off.  The messages look like:

Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: 
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 4,0
(sd38):
Nov 22 18:55:23 nebula  Error for Command: write(10)               Error 
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Requested Block: 
225914045                 Error Block: 225914045
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Vendor: 
MAXTOR                             Serial Number: J80ARRWK   
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Sense Key: Unit 
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    ASC: 0x29 (scsi 
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 2,0
(sd36):
Nov 22 18:55:23 nebula  Error for Command: write(10)               Error 
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Requested Block: 
90882344                  Error Block: 90882344
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Vendor: 
MAXTOR                             Serial Number: J80BNNFK   
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Sense Key: Unit 
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    ASC: 0x29 (scsi 
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 3,0
(sd37):
Nov 22 18:55:23 nebula  Error for Command: write(10)               Error 
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Requested Block: 
225914045                 Error Block: 225914045
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Vendor: 
MAXTOR                             Serial Number: J80BDCKK   
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Sense Key: Unit 
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    ASC: 0x29 (scsi 
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 0,0
(sd34):
Nov 22 18:55:23 nebula  Error for Command: write(10)               Error 
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Requested Block: 
90882394                  Error Block: 90882394
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Vendor: 
SEAGATE                            Serial Number: 3KR0VPBF   
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Sense Key: Unit 
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    ASC: 0x29 (scsi 
bus reset occurred), ASCQ: 0x2, FRU: 0x2
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3599 at 6/pci8086,32a at 0,2/pci9005,40 at 3/sd at 1,0
(sd35):
Nov 22 18:55:23 nebula  Error for Command: write(10)               Error 
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Requested Block: 
90882348                  Error Block: 90882348
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Vendor: 
SEAGATE                            Serial Number: 3KR0WLM4   
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    Sense Key: Unit 
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice]    ASC: 0x29 (scsi 
bus reset occurred), ASCQ: 0x2, FRU: 0x2

I have a Dell 2850 with an Adaptec ASC-39320A U320 Dual SCSI 39320A 
card.  I''ve connected both channels to a split bus Dell PowerVault 220S
disk array with 11 300GB 10K drives via 2 cables.  I have already 
swapped the HBA and both cables.  I''ve moved disks around, tried
subsets
of disks, but it still seems to give the problems regardless of the disk 
configuration, or whether one or both controllers are used

I''ve tried raidz2, raidz1, and mirrors, but it eventually gets hung and
issues a timeout (and it does this several times a day).
I''ve tried both raid5 and mirror using SVM, but it never gets the 
timeout (but the raid5 quite a bit slower, so I''d like to stick with
ZFS).
There''s no problem if you just put UFS on the raw disks.
I''ve run diskomizer for many hours using without a problem using raw 
disks, and UFS on the disks.

I had planned on making this system a master database server, however 
I''m still getting with it running as a slave, so I don''t have
any
comfort to promote this system to the master with the timeouts.

Any suggestions?

Thanks,

Brian

Alexandru Pirvulescu

2010-Jan-14 13:57 UTC

head link

[zfs-discuss] adpu320 scsi timeouts only with ZFS

Any news regarding this issue? I''m having the same problems.

I''m using an external Axus SCSI enclosure (Yotta with 16 drives) and it
timed out on scanning LUNs (16 of them b/c Yotta is configured as JBOD).

I''ve performed firmware upgrade to the Yotta system and now the
scanning works, the pool build works, I''m even able to perform some
transfers but after some time everything hangs, iostat shows 100% busy on drives
and timeouts in dmesg.

Still, if I keep the LUN number to <=8 everything works correctly. I think
that eliminates the problems of cabling or any other hardware involved.
All the problems start when using >8 LUNs (sd.conf is configured correctly,
adpu320.conf is configured with maximum 16 LUNs).

Linux kernel 2.6.18 had the same problems in the past, but current kernel works
without any problem, so they''ve changed something in the driver which
eliminated the issue.

If any extra info is needed I''m ready to provide it.
-- 
This message posted from opensolaris.org

Marty Scholes

2010-Jan-14 16:32 UTC

head link

[zfs-discuss] adpu320 scsi timeouts only with ZFS

> Any news regarding this issue? I''m having the same
> problems.
Me too.  My v40z with U320 drives in the internal bay will lock up partway
through a scrub.

I backed the whole SCSI chain down to U160, but it seems a shame that U320
speeds can''t be used.
-- 
This message posted from opensolaris.org

Brian Kolaci

2010-Jan-14 16:35 UTC

head link

[zfs-discuss] adpu320 scsi timeouts only with ZFS

I was frustrated with this problem for months.  I''ve tried different  
disks, cables, even disk cabinets.  The driver hasn''t been updated in  
a long time.
When the timeouts occurred, they would freeze for about a minute or  
two (showing the 100% busy).  I even had the problem with less than 8  
LUNs (and I''m seeing similar symptoms on another box connected to a  
hardware RAID).  I even swapped out controllers (for the same type - 3  
of them, all Adaptec).

To fix it, I swapped out the Adaptec controller and put in LSI Logic  
and all the problems went away.

On Jan 14, 2010, at 8:57 AM, Alexandru Pirvulescu wrote:
> Any news regarding this issue? I''m having the same problems.
>
> I''m using an external Axus SCSI enclosure (Yotta with 16 drives)
and
> it timed out on scanning LUNs (16 of them b/c Yotta is configured as  
> JBOD).
>
> I''ve performed firmware upgrade to the Yotta system and now the  
> scanning works, the pool build works, I''m even able to perform
some
> transfers but after some time everything hangs, iostat shows 100%  
> busy on drives and timeouts in dmesg.
>
> Still, if I keep the LUN number to <=8 everything works correctly. I  
> think that eliminates the problems of cabling or any other hardware  
> involved.
> All the problems start when using >8 LUNs (sd.conf is configured  
> correctly, adpu320.conf is configured with maximum 16 LUNs).
>
> Linux kernel 2.6.18 had the same problems in the past, but current  
> kernel works without any problem, so they''ve changed something in
> the driver which eliminated the issue.
>
> If any extra info is needed I''m ready to provide it.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Marty Scholes

2010-Jan-15 16:48 UTC

head link

[zfs-discuss] adpu320 scsi timeouts only with ZFS

> To fix it, I swapped out the Adaptec controller and
> put in LSI Logic  
> and all the problems went away.
I''m using Sun''s built-in LSI controller with (I presume) the
original internal cable shipped by Sun.

Still, no joy for me at U320 speeds.  To be precise, when the controller is set
at U320, it runs amazingly fast until it freezes, at which point it is quite
slow.
-- 
This message posted from opensolaris.org

Andreas Grüninger

2010-Jan-19 17:02 UTC

head link

[zfs-discuss] adpu320 scsi timeouts only with ZFS

Maybe there are too many I/Os for this controller.

You may try this settings
B130
echo zfs_txg_synctime_ms/W0t2000 | mdb -kw 
echo zfs_vdev_max_pending/W0t5 | mdb -kw 

older versions
echo zfs_txg_synctime/W0t2 | mdb -kw 
echo zfs_vdev_max_pending/W0t5 | mdb -kw 

Andreas
-- 
This message posted from opensolaris.org

zfs discuss - Nov 2009 - adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS

[zfs-discuss] adpu320 scsi timeouts only with ZFS