thr3ads.net - zfs discuss - [zfs-discuss] Strange random errors getting automatically repaired [Jan 2010]

If this information is useful, please help other people find it:
Share via:

gtirloni at sysdroid.com

2010-Jan-27 12:07 UTC

[zfs-discuss] Strange random errors getting automatically repaired

Hello,

 Has anyone ever seen vdev''s getting removed and added back to the
pool very quickly ? That seems to be what''s happening here.

 This has started to happen on dozens of machines at different
locations since a few days ago. They are running OpenSolaris b111 and
a few b126.

 Could this be bit rot and/or silent corruption getting detected and fixed ?

Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname   Refer to http://sun.com/msg/FMD-8000-4M for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: Some system components
offlined because of the original fault may have been brought back
online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault may have been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to
identify the repaired components.
Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname   Refer to http://sun.com/msg/FMD-8000-6U for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: All system components offlined
because of the original fault have been brought back online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault has been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u <EVENT-ID> to
identify the repaired components.


# fmdump -e -t 23Jan2010
TIME                 CLASS
#

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved

# fmdump -V
TIME                 UUID                                 SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired

  TIME                 CLASS                                 ENA
  Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data      0x533bf0e964a01801
  Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure          0xe87b448c8ba00c01
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b446b04f00001
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b44664b300401
  Dec 23 16:08:42.0738 ereport.fs.zfs.io                     0xe87b445710a01001
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b4461a4d00c01

nvlist version: 0
        version = 0x0
        class = list.repaired
        uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
        code = FMD-8000-4M
        diag-time = 1261651834 766268
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = X7DB8
                        chassis-id = 0123456789
                        server-id = hostname
                (end authority)

                mod-name = fmd
                mod-version = 1.2
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x9f4842f183c4c7cc
                        vdev = 0xd207014426714df9
                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x9f4842f183c4c7cc
                        vdev = 0xd207014426714df9
                (end resource)

        (end fault-list[0])

        fault-status = 0x6
        __ttl = 0x1
        __tod = 0x4b5fb069 0xe23eb38

TIME                 UUID                                 SUNW-MSG-ID
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved

  TIME                 CLASS                                 ENA
  Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data      0x533bf0e964a01801
  Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure          0xe87b448c8ba00c01
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b446b04f00001
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b44664b300401
  Dec 23 16:08:42.0738 ereport.fs.zfs.io                     0xe87b445710a01001
  Dec 23 16:08:42.0739 ereport.fs.zfs.io                     0xe87b4461a4d00c01

nvlist version: 0
        version = 0x0
        class = list.resolved
        uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
        code = FMD-8000-6U
        diag-time = 1261651834 766268
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = X7DB8
                        chassis-id = 0123456789
                        server-id = hostname
                (end authority)

                mod-name = fmd
                mod-version = 1.2
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x9f4842f183c4c7cc
                        vdev = 0xd207014426714df9
                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x9f4842f183c4c7cc
                        vdev = 0xd207014426714df9
                (end resource)

        (end fault-list[0])

        fault-status = 0x6
        __ttl = 0x1
        __tod = 0x4b5fb069 0xe411fc8

Thanks,

-- 
Giovanni

Mark Bennett

2010-Jan-28 00:11 UTC

head link

[zfs-discuss] Strange random errors getting automatically repaired

Hi Giovanni,

I have seen these while testing the mpt timeout issue, and on other systems
during resilvering of failed disks and while running a scrub.

Once so far on this test scrub, and several on yesterdays.

I checked the iostat errors, and they weren''t that high on that device,
compared to other disks.

c2t34d0  ONLINE       0     0     1  25.5K repaired

 ---- errors ---
  s/w h/w trn tot device
  0   8  61  69 c2t30d0
  0   2  17  19 c2t31d0
  0   5  41  46 c2t32d0
  0   5  33  38 c2t33d0
  0   3  31  34 c2t34d0 <<<<<<
  0  10  81  91 c2t35d0
  0   4  22  26 c2t36d0
  0   6  44  50 c2t37d0
  0   3  21  24 c2t38d0
  0   5  49  54 c2t39d0
  0   9  77  86 c2t40d0
  0   6  58  64 c2t41d0
  0   5  50  55 c2t42d0
  0   4  34  38 c2t43d0
  0   6  37  43 c2t44d0
  0   9  75  84 c2t45d0
  0  13  82  95 c2t46d0
  0   7  57  64 c2t47d0
-- 
This message posted from opensolaris.org

gtirloni at sysdroid.com

2010-Jan-28 12:26 UTC

head link

[zfs-discuss] Strange random errors getting automatically repaired

On Wed, Jan 27, 2010 at 10:11 PM, Mark Bennett
<mark.bennett at public.co.nz> wrote:> Hi Giovanni,
>
> I have seen these while testing the mpt timeout issue, and on other systems
during resilvering of failed disks and while running a scrub.
>
> Once so far on this test scrub, and several on yesterdays.
>
> I checked the iostat errors, and they weren''t that high on that
device, compared to other disks.
>
> c2t34d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 1 ?25.5K repaired
I''m not seeing any errors at all (and the servers are very loaded):

# iostat -eXn
 ---- errors ---
  s/w h/w trn tot device
  0   0   0   0 c3t0d0
  0   0   0   0 c3t1d0
  0   0   0   0 c3t2d0
  0   0   0   0 c3t3d0
  0   0   0   0 c3t4d0
  0   0   0   0 c3t5d0
  0   0   0   0 c3t6d0
  0   0   0   0 c3t7d0
  0   0   0   0 c3t8d0
  0   0   0   0 c3t9d0
  0   0   0   0 c3t10d0
  0   0   0   0 c3t11d0
  0   0   0   0 c3t12d0
  0   0   0   0 c3t13d0
  0   0   0   0 c3t14d0
  0   0   0   0 c3t15d0
  0   0   0   0 c3t16d0
  0   0   0   0 c3t17d0
  0   0   0   0 c3t18d0
  0   0   0   0 c3t19d0
  0   0   0   0 c3t20d0
  0   0   0   0 c3t21d0

Right now this is a mystery but I''m reading more about FMA and how it
could have decided something was wrong (since I can''t find anything in
its error log).

-- 
Giovanni

zfs discuss - Jan 2010 - Strange random errors getting automatically repaired

[zfs-discuss] Strange random errors getting automatically repaired

[zfs-discuss] Strange random errors getting automatically repaired

[zfs-discuss] Strange random errors getting automatically repaired