gtirloni at sysdroid.com
2010-Jan-27 12:07 UTC
[zfs-discuss] Strange random errors getting automatically repaired
Hello, Has anyone ever seen vdev''s getting removed and added back to the pool very quickly ? That seems to be what''s happening here. This has started to happen on dozens of machines at different locations since a few days ago. They are running OpenSolaris b111 and a few b126. Could this be bit rot and/or silent corruption getting detected and fixed ? Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID: FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009 Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2 Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd Jan 27 01:18:01 hostname DESC: All faults associated with an event id have been addressed. Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-4M for more information. Jan 27 01:18:01 hostname AUTO-RESPONSE: Some system components offlined because of the original fault may have been brought back online. Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system due to the original fault may have been recovered. Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to identify the repaired components. Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID: FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009 Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2 Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd Jan 27 01:18:01 hostname DESC: All faults associated with an event id have been addressed. Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-6U for more information. Jan 27 01:18:01 hostname AUTO-RESPONSE: All system components offlined because of the original fault have been brought back online. Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system due to the original fault has been recovered. Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to identify the repaired components. # fmdump -e -t 23Jan2010 TIME CLASS # # fmdump TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved # fmdump -V TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired TIME CLASS ENA Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data 0x533bf0e964a01801 Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure 0xe87b448c8ba00c01 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f00001 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401 Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01 nvlist version: 0 version = 0x0 class = list.repaired uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd code = FMD-8000-4M diag-time = 1261651834 766268 de = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 product-id = X7DB8 chassis-id = 0123456789 server-id = hostname (end authority) mod-name = fmd mod-version = 1.2 (end de) fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = fault.fs.zfs.device certainty = 0x64 asru = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end asru) resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end resource) (end fault-list[0]) fault-status = 0x6 __ttl = 0x1 __tod = 0x4b5fb069 0xe23eb38 TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved TIME CLASS ENA Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data 0x533bf0e964a01801 Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure 0xe87b448c8ba00c01 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f00001 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401 Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01 nvlist version: 0 version = 0x0 class = list.resolved uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd code = FMD-8000-6U diag-time = 1261651834 766268 de = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 product-id = X7DB8 chassis-id = 0123456789 server-id = hostname (end authority) mod-name = fmd mod-version = 1.2 (end de) fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = fault.fs.zfs.device certainty = 0x64 asru = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end asru) resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end resource) (end fault-list[0]) fault-status = 0x6 __ttl = 0x1 __tod = 0x4b5fb069 0xe411fc8 Thanks, -- Giovanni
Mark Bennett
2010-Jan-28 00:11 UTC
[zfs-discuss] Strange random errors getting automatically repaired
Hi Giovanni, I have seen these while testing the mpt timeout issue, and on other systems during resilvering of failed disks and while running a scrub. Once so far on this test scrub, and several on yesterdays. I checked the iostat errors, and they weren''t that high on that device, compared to other disks. c2t34d0 ONLINE 0 0 1 25.5K repaired ---- errors --- s/w h/w trn tot device 0 8 61 69 c2t30d0 0 2 17 19 c2t31d0 0 5 41 46 c2t32d0 0 5 33 38 c2t33d0 0 3 31 34 c2t34d0 <<<<<< 0 10 81 91 c2t35d0 0 4 22 26 c2t36d0 0 6 44 50 c2t37d0 0 3 21 24 c2t38d0 0 5 49 54 c2t39d0 0 9 77 86 c2t40d0 0 6 58 64 c2t41d0 0 5 50 55 c2t42d0 0 4 34 38 c2t43d0 0 6 37 43 c2t44d0 0 9 75 84 c2t45d0 0 13 82 95 c2t46d0 0 7 57 64 c2t47d0 -- This message posted from opensolaris.org
gtirloni at sysdroid.com
2010-Jan-28 12:26 UTC
[zfs-discuss] Strange random errors getting automatically repaired
On Wed, Jan 27, 2010 at 10:11 PM, Mark Bennett <mark.bennett at public.co.nz> wrote:> Hi Giovanni, > > I have seen these while testing the mpt timeout issue, and on other systems during resilvering of failed disks and while running a scrub. > > Once so far on this test scrub, and several on yesterdays. > > I checked the iostat errors, and they weren''t that high on that device, compared to other disks. > > c2t34d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 1 ?25.5K repairedI''m not seeing any errors at all (and the servers are very loaded): # iostat -eXn ---- errors --- s/w h/w trn tot device 0 0 0 0 c3t0d0 0 0 0 0 c3t1d0 0 0 0 0 c3t2d0 0 0 0 0 c3t3d0 0 0 0 0 c3t4d0 0 0 0 0 c3t5d0 0 0 0 0 c3t6d0 0 0 0 0 c3t7d0 0 0 0 0 c3t8d0 0 0 0 0 c3t9d0 0 0 0 0 c3t10d0 0 0 0 0 c3t11d0 0 0 0 0 c3t12d0 0 0 0 0 c3t13d0 0 0 0 0 c3t14d0 0 0 0 0 c3t15d0 0 0 0 0 c3t16d0 0 0 0 0 c3t17d0 0 0 0 0 c3t18d0 0 0 0 0 c3t19d0 0 0 0 0 c3t20d0 0 0 0 0 c3t21d0 Right now this is a mystery but I''m reading more about FMA and how it could have decided something was wrong (since I can''t find anything in its error log). -- Giovanni