Carsten Aulbert
2010-Feb-04 15:30 UTC
[zfs-discuss] zfs/sol10u8 less stable than in sol10u5?
Hi all,
it might not be a ZFS issue (and thus on the wrong list), but maybe
there''s
someone here who might be able to give us a good hint:
We are operating 13 x4500 and started to play with non-Sun blessed SSDs in
there. As we were running Solaris 10u5 before and wanted to use them as log
devices we upgraded to the latest and greatest 10u8 and changed the zpool
layout[1]. However, on the first machine we found many, many problems with
various disks "failing" in different vdevs (I wrote about this in
December on
this list IIRC).
After going through this with Sun they gave us hints but mostly blamed (maybe
rightfully the Intel X25e in there), we considered the 2.5" to 2.5"
converter
to be at fault as well. Thus we did the next test by placing the SSD into the
tray without a conversion unit, but that box (a different one) failed with the
same problems.
Now, we "learned" from this experience and did the same to another box
but
without the SSD, i.e. jumpstarted the box and installed 10u8, redid the zpool
and started to fill data in. In today''s scrub suddenly this happened:
s09:~# zpool status
pool: atlashome
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ''zpool clear'' or replace the device with
''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 0h9m, 3.89% done, 4h2m to go
config:
NAME STATE READ WRITE CKSUM
atlashome DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 1
c0t2d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 2
c4t2d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
raidz1 DEGRADED 0 0 0
c5t3d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 1
spare DEGRADED 0 0 0
c4t4d0 DEGRADED 5 0 11 too many errors
c0t4d0 ONLINE 0 0 0 5.38G resilvered
raidz1 ONLINE 0 0 0
c5t4d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 1
raidz1 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 1
raidz1 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
spares
c0t4d0 INUSE currently in use
c7t7d0 AVAIL
Also similar to the other hosts were the much, much higher Soft/Hard error
count in iostat:
s09:~# iostat -En|grep Soft
c2t0d0 Soft Errors: 1 Hard Errors: 2 Transport Errors: 0
c3t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
c5t0d0 Soft Errors: 2805 Hard Errors: 0 Transport Errors: 0
c6t0d0 Soft Errors: 4003 Hard Errors: 1 Transport Errors: 0
c4t0d0 Soft Errors: 4003 Hard Errors: 2 Transport Errors: 0
c1t0d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c6t1d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c1t1d0 Soft Errors: 4002 Hard Errors: 1 Transport Errors: 0
c4t1d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c5t1d0 Soft Errors: 4003 Hard Errors: 1 Transport Errors: 0
c1t2d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c0t0d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c5t2d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c4t2d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c1t3d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c6t2d0 Soft Errors: 4002 Hard Errors: 1 Transport Errors: 0
c0t1d0 Soft Errors: 4002 Hard Errors: 2 Transport Errors: 0
c4t3d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c5t3d0 Soft Errors: 4001 Hard Errors: 1 Transport Errors: 0
c6t3d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c1t4d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c0t2d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c4t4d0 Soft Errors: 4004 Hard Errors: 6 Transport Errors: 0
c5t4d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c5t5d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c6t4d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c4t5d0 Soft Errors: 4003 Hard Errors: 2 Transport Errors: 0
c1t5d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c0t3d0 Soft Errors: 4001 Hard Errors: 1 Transport Errors: 0
c5t6d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c1t6d0 Soft Errors: 4003 Hard Errors: 1 Transport Errors: 0
c4t6d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c0t4d0 Soft Errors: 4001 Hard Errors: 0 Transport Errors: 0
c6t5d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c5t7d0 Soft Errors: 4000 Hard Errors: 1 Transport Errors: 0
c4t7d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c0t5d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c6t6d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c1t7d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c0t6d0 Soft Errors: 4001 Hard Errors: 1 Transport Errors: 0
c6t7d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c0t7d0 Soft Errors: 4000 Hard Errors: 0 Transport Errors: 0
c7t0d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c7t1d0 Soft Errors: 4003 Hard Errors: 2 Transport Errors: 0
c7t2d0 Soft Errors: 4003 Hard Errors: 1 Transport Errors: 0
c7t3d0 Soft Errors: 4001 Hard Errors: 1 Transport Errors: 0
c7t4d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c7t5d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c7t6d0 Soft Errors: 4002 Hard Errors: 0 Transport Errors: 0
c7t7d0 Soft Errors: 3997 Hard Errors: 0 Transport Errors: 0
(after an uptime of only a couple of days):
s09:~# uptime
4:27pm up 2 day(s), 21:31, 1 user, load average: 0.17, 0.34, 1.45
s09:~# uname -a
SunOS s09 5.10 Generic_142901-03 i86pc i386 i86pc
We checked these numbers before the upgrade and we had no hard errors and only
an order of magnitude less soft errors after 10s of days of uptime.
Is there anyone aware of some regression when going to 10u8? Might it be ZFS
related or can the hardware of 3 x4500 rot away after an upgrade within days
when the environmental state is not changed at all?
Thanks a lot in advance for any hint
Carsten
[1] before we used 3 vdevs with 15, 15 and 16 disks inside, now we are using 9
vdevs with 5 disks each (plus 2 hot spares)