gtirloni at sysdroid.com
2010-Jan-27 12:07 UTC
[zfs-discuss] Strange random errors getting automatically repaired
Hello,
Has anyone ever seen vdev''s getting removed and added back to the
pool very quickly ? That seems to be what''s happening here.
This has started to happen on dozens of machines at different
locations since a few days ago. They are running OpenSolaris b111 and
a few b126.
Could this be bit rot and/or silent corruption getting detected and fixed ?
Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-4M for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: Some system components
offlined because of the original fault may have been brought back
online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault may have been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to
identify the repaired components.
Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-6U for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: All system components offlined
because of the original fault have been brought back online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault has been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to
identify the repaired components.
# fmdump -e -t 23Jan2010
TIME CLASS
#
# fmdump
TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved
# fmdump -V
TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired
TIME CLASS ENA
Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data 0x533bf0e964a01801
Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure 0xe87b448c8ba00c01
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f00001
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401
Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01
nvlist version: 0
version = 0x0
class = list.repaired
uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
code = FMD-8000-4M
diag-time = 1261651834 766268
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = X7DB8
chassis-id = 0123456789
server-id = hostname
(end authority)
mod-name = fmd
mod-version = 1.2
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end asru)
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end resource)
(end fault-list[0])
fault-status = 0x6
__ttl = 0x1
__tod = 0x4b5fb069 0xe23eb38
TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved
TIME CLASS ENA
Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data 0x533bf0e964a01801
Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure 0xe87b448c8ba00c01
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f00001
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401
Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001
Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01
nvlist version: 0
version = 0x0
class = list.resolved
uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
code = FMD-8000-6U
diag-time = 1261651834 766268
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = X7DB8
chassis-id = 0123456789
server-id = hostname
(end authority)
mod-name = fmd
mod-version = 1.2
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end asru)
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end resource)
(end fault-list[0])
fault-status = 0x6
__ttl = 0x1
__tod = 0x4b5fb069 0xe411fc8
Thanks,
--
Giovanni
Mark Bennett
2010-Jan-28 00:11 UTC
[zfs-discuss] Strange random errors getting automatically repaired
Hi Giovanni, I have seen these while testing the mpt timeout issue, and on other systems during resilvering of failed disks and while running a scrub. Once so far on this test scrub, and several on yesterdays. I checked the iostat errors, and they weren''t that high on that device, compared to other disks. c2t34d0 ONLINE 0 0 1 25.5K repaired ---- errors --- s/w h/w trn tot device 0 8 61 69 c2t30d0 0 2 17 19 c2t31d0 0 5 41 46 c2t32d0 0 5 33 38 c2t33d0 0 3 31 34 c2t34d0 <<<<<< 0 10 81 91 c2t35d0 0 4 22 26 c2t36d0 0 6 44 50 c2t37d0 0 3 21 24 c2t38d0 0 5 49 54 c2t39d0 0 9 77 86 c2t40d0 0 6 58 64 c2t41d0 0 5 50 55 c2t42d0 0 4 34 38 c2t43d0 0 6 37 43 c2t44d0 0 9 75 84 c2t45d0 0 13 82 95 c2t46d0 0 7 57 64 c2t47d0 -- This message posted from opensolaris.org
gtirloni at sysdroid.com
2010-Jan-28 12:26 UTC
[zfs-discuss] Strange random errors getting automatically repaired
On Wed, Jan 27, 2010 at 10:11 PM, Mark Bennett <mark.bennett at public.co.nz> wrote:> Hi Giovanni, > > I have seen these while testing the mpt timeout issue, and on other systems during resilvering of failed disks and while running a scrub. > > Once so far on this test scrub, and several on yesterdays. > > I checked the iostat errors, and they weren''t that high on that device, compared to other disks. > > c2t34d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 1 ?25.5K repairedI''m not seeing any errors at all (and the servers are very loaded): # iostat -eXn ---- errors --- s/w h/w trn tot device 0 0 0 0 c3t0d0 0 0 0 0 c3t1d0 0 0 0 0 c3t2d0 0 0 0 0 c3t3d0 0 0 0 0 c3t4d0 0 0 0 0 c3t5d0 0 0 0 0 c3t6d0 0 0 0 0 c3t7d0 0 0 0 0 c3t8d0 0 0 0 0 c3t9d0 0 0 0 0 c3t10d0 0 0 0 0 c3t11d0 0 0 0 0 c3t12d0 0 0 0 0 c3t13d0 0 0 0 0 c3t14d0 0 0 0 0 c3t15d0 0 0 0 0 c3t16d0 0 0 0 0 c3t17d0 0 0 0 0 c3t18d0 0 0 0 0 c3t19d0 0 0 0 0 c3t20d0 0 0 0 0 c3t21d0 Right now this is a mystery but I''m reading more about FMA and how it could have decided something was wrong (since I can''t find anything in its error log). -- Giovanni