thr3ads.net - zfs discuss - [zfs-discuss] [perf-discuss] help diagnosing system hang [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Peter Eriksson

2009-Jul-07 20:48 UTC

[zfs-discuss] [perf-discuss] help diagnosing system hang

Interresting... I wonder what differs between your system and mine. With my
dirt-simple stress-test:

server1# zpool create X25E c1t15d0
server1# zfs set sharenfs=rw X25E
server1# chmod a+w /X25E

server2# cd /net/server1/X25E
server2# gtar zxf /var/tmp/emacs-22.3.tar.gz

and a fully patched X42420 running Solaris 10 U7 I still see these errors:

Jul  7 22:35:04 merope  Error for Command: write(10)               Error Level:
Retryable
Jul  7 22:35:04 merope scsi:    Requested Block: 5301376                   Error
Block: 5301376
Jul  7 22:35:04 merope scsi:    Vendor: ATA                               
Serial Number: CVEM849300BM
Jul  7 22:35:04 merope scsi:    Sense Key: Unit Attention
Jul  7 22:35:04 merope scsi:    ASC: 0x29 (power on, reset, or bus reset
occurred), ASCQ: 0x0, FRU: 0x0
Jul  7 22:35:09 merope scsi: WARNING: /pci at 0,0/pci10de,375 at f/pci1000,3150
at 0/sd at f,0 (sd32):
Jul  7 22:35:09 merope  Error for Command: write(10)               Error Level:
Retryable
Jul  7 22:35:09 merope scsi:    Requested Block: 5315248                   Error
Block: 5315248
Jul  7 22:35:09 merope scsi:    Vendor: ATA                               
Serial Number: CVEM849300BM
Jul  7 22:35:09 merope scsi:    Sense Key: Unit Attention
Jul  7 22:35:09 merope scsi:    ASC: 0x29 (power on, reset, or bus reset
occurred), ASCQ: 0x0, FRU: 0x0

I had an idea that this might be due to NCQ overruns since if I''m not
mistaken the X25E only supports 32 outstanding commands, so I''ve
started testing various things. Setting sd_max_throttle in /etc/system
doesn''t seem to make any difference.

However...  tuning zfs_vdev_max_pendning down from 35 to 10 made a difference.
Whereas if I used to see long "hickups" in "zpool iostat X25E
10", with that one tuned down to 10 things run *much* more smoothly - no
hickups:

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
X25E         284K  29.7G      0      5    332   549K
X25E         284K  29.7G      0    197      0  5.70M
X25E         284K  29.7G      0    197      0  2.28M
X25E        59.8M  29.7G      0    322      0  10.9M
X25E        59.8M  29.7G      0    418      0  7.97M
X25E        59.8M  29.7G      0    588      0  10.3M

Still a lot of of the same errors on the console though 
(more often actually...)

Output from iostat -zx 10 if it is of interest:

                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd32      0.0  718.7    0.0 1437.4  0.0  0.0    0.0   0   3 
sd32      0.0  401.1    0.0 6089.1  0.0  0.8    2.1   0  43 
sd32      0.0 1187.5    0.0 12341.7  0.0  0.7    0.6   2  37 
sd32      0.0  758.2    0.0 14835.1  0.0  1.7    2.3   4  66 
sd32      0.0  403.1    0.0 4606.8  0.0  1.5    3.9   4  77 
sd32      0.0  350.8    0.0 3420.8  0.0  1.6    4.6   4  80 
sd32      0.0  315.9    0.0 8578.2  0.0  0.4    1.1   0   6 

I''m really curious what is causing these errors... It''s almost
like it''s something else that is causing them. Perhaps some
"flush cache" command that is executed after the writes to the
device has been done (since I see this error more often with 
''zfs_vdev_max_pendning'' tuned down). 

Another interresting thing is that I only see this for I/O issued 
from a remote server over NFS. If I write directly to the X25E
volume on the server things work really smooth.
-- 
This message posted from opensolaris.org

Seemingly Similar Threads

Search for more possibly parallel threads

zfs discuss - Jul 2009 - [perf-discuss] help diagnosing system hang

[zfs-discuss] [perf-discuss] help diagnosing system hang

Seemingly Similar Threads

Wisdom of the Ancients