Alexander Zagrebin
2009-Dec-16  00:59 UTC
8.0-RELEASE: disk IO temporarily hangs up (ZFS or ATA related problem)
Hi!
I use onboard ICH7 SATA controller with two disks attached:
atapci1: <Intel ICH7 SATA300 controller> port
0x30c8-0x30cf,0x30ec-0x30ef,0x30c0-0x30c7,0x30e8-0x30eb,0x30a0-0x30af irq 19
at device 31.2 on pci0
atapci1: [ITHREAD]
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]
ad4: 1430799MB <Seagate ST31500541AS CC34> at ata2-master SATA150
ad6: 1430799MB <WDC WD15EADS-00P8B0 01.00A01> at ata3-master SATA150
The disks are used for mirrored ZFS pool.
I have noticed that the system periodically locks up on disk operations.
After approx. 10 min of very slow disk i/o (several KB/s) the speed of disk
operations restores to normal.
gstat has shown that the problem is in ad6.
For example, there is a filtered output of iostat -x 1:
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad6      818.6   0.0 10840.2     0.0    0   0.4  34
ad6      300.6 642.0  3518.5 24830.3   50  24.8  72
ad6        1.0 639.3    63.7 17118.3    0  62.1  98
ad6      404.5   4.0  6837.7     4.0    0   0.5  18
ad6      504.5   0.0 13667.2     0.0    1   0.7  32
ad6      633.3   0.0 13190.3     0.0    1   0.7  38
ad6      416.3 384.5  8134.7 24606.2    0  16.3  57
ad6      538.9  76.7  9772.8  2982.2   55   2.9  40
ad6       31.9 929.5   801.0 37498.6    0  27.2  82
ad6      635.5   0.0 13087.1     0.0    1   0.6  35
ad6      579.6   0.0 16669.8     0.0    0   0.8  43
ad6      603.6   0.0 11697.4     0.0    1   0.7  40
ad6      538.0   0.0 10438.7     0.0    0   0.9  47
ad6       30.9 898.4   868.6 40585.4    0  36.6  78
ad6      653.3  86.6  8566.6   202.7    1   0.8  40
ad6      737.1   0.0  6429.4     0.0    1   0.6  42
ad6      717.1   0.0  3958.7     0.0    0   0.5  36
ad6     1179.5   0.0  2058.9     0.0    0   0.1  15
ad6     1191.2   0.0  1079.6     0.0    1   0.1  15
ad6      985.1   0.0  5093.9     0.0    0   0.2  23
ad6      761.8   0.0  9801.3     0.0    1   0.4  31
ad6      698.7   0.0  9215.1     0.0    0   0.4  30
ad6      434.2 513.9  5903.1 13658.3   48  10.2  55
ad6        3.0 762.8   191.2 28732.3    0  57.6  99
ad6       10.0   4.0   163.9     4.0    1   1.6   2
Before this line we have a normal operations.
Then the behaviour of ad6 changes (pay attention to high average access time
and percent of "busy" significantly greater than 100):
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6        1.0   0.0     0.5     0.0    1 1798.3 179
ad6        1.0   0.0     1.5     0.0    1 1775.4 177
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6       10.0   0.0    75.2     0.0    1 180.3 180
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6       83.7   0.0   862.9     0.0    1  21.4 179
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6        1.0   0.0    63.7     0.0    1 1707.4 170
ad6        1.0   0.0     9.0     0.0    0 1791.0 178
ad6       10.9   0.0   172.2     0.0    2   0.2   0
ad6       24.9   0.0   553.7     0.0    1 143.3 179
ad6        0.0   0.0     0.0     0.0    7   0.0   0
ad6        2.0  23.9    32.4  1529.9    1 336.3 177
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6       68.7   0.0  1322.8     0.0    1  26.3 181
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6       27.9   0.0   193.7     0.0    1  61.6 172
ad6        1.0   0.0     2.5     0.0    1 1777.4 177
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6        1.0   0.0     2.0     0.0    1 1786.9 178
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6        2.0   0.0     6.5     0.0    1 899.4 179
ad6        0.0   0.0     0.0     0.0    1   0.0   0
ad6        1.0   0.0     2.0     0.0    1 1786.7 178
ad6        0.0   0.0     0.0     0.0    1   0.0   0
And so on for about 10 minutes.
Then the disk i/o is reverted to normal:
ad6      139.4   0.0  8860.5     0.0    1   4.4  61
ad6      167.3   0.0 10528.7     0.0    1   3.3  55
ad6       60.8 411.5  3707.6  8574.8    1  19.6  87
ad6      163.4   0.0 10334.9     0.0    1   4.4  72
ad6      157.4   0.0  9770.7     0.0    1   5.0  78
ad6      108.5   0.0  6886.8     0.0    0   3.9  43
ad6      101.6   0.0  6381.4     0.0    0   2.6  27
ad6      109.6   0.0  7013.9     0.0    0   2.0  22
ad6      121.4   0.0  7769.7     0.0    0   2.4  29
ad6       92.5   0.0  5922.6     0.0    1   3.4  31
ad6      122.4  19.9  7833.0  1273.7    0   3.9  54
ad6       83.6   0.0  5349.5     0.0    0   3.9  33
ad6        5.0   0.0   318.4     0.0    0   8.1   4
There are no ata error messages neither in the system log, nor on the
console.
The manufacture's diagnostic test is passed on ad6 without any errors.
The ad6 also contains swap partition.
I have tried to run several (10..20) instances of dd, which read and write
data
from and to the swap partition simultaneously, but it has not called the
lockup.
So there is a probability that this problem is ZFS related.
I have been forced to switch ad6 to the offline state... :(
Any suggestions on this problem?
-- 
Alexander Zagrebin
Jeremy Chadwick
2009-Dec-16  01:14 UTC
8.0-RELEASE: disk IO temporarily hangs up (ZFS or ATA related problem)
On Wed, Dec 16, 2009 at 11:59:31AM +0300, Alexander Zagrebin wrote:> {snip} > > There are no ata error messages neither in the system log, nor on the > console. The manufacture's diagnostic test is passed on ad6 without > any errors. The ad6 also contains swap partition. I have tried to > run several (10..20) instances of dd, which read and write data from > and to the swap partition simultaneously, but it has not called the > lockup. So there is a probability that this problem is ZFS related. > > I have been forced to switch ad6 to the offline state... :( > > Any suggestions on this problem?My first guess would be that yes, this is a sign of something ZFS is doing, but I'm not positive. It'd be useful to see "zpool status" output from your system (when ad6 is in the ONLINE state). Have you tried doing "zpool scrub <pool>" to make sure there's no read, write, or checksum errors causing problems behind the scenes? I don't know how this would manifest itself in that manner, but worth a shot. Otherwise, if you could install sysutils/smartmontools and provide the full output of "smartctl -a /dev/ad6" I'll be happy to perform a brief analysis of the drive itself and if it's performing badly or oddly. Please note that SMART stats don't keep track of anything pertaining to the drive's on-PCB cache (which could be going bad, but you'd almost certainly see that at all times not, occasionally). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |