Hi, today I issued a scrub on one of my zpools and after some time I noticed that one of the vdevs became degraded due to some drive having cksum errors. The spare kicked in and the drive got resilvered, but why does the spare drive now also show almost the same number of cksum errors, as the degraded drive? root at solaris11c:~# zpool status obelixData pool: obelixData state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 21:15:32 2012 config: NAME STATE READ WRITE CKSUM obelixData DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 c9t2100001378AC02DDd1 ONLINE 0 0 0 c9t2100001378AC02F4d1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c9t2100001378AC02F4d0 ONLINE 0 0 0 c9t2100001378AC02DDd0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c9t2100001378AC02DDd2 ONLINE 0 0 0 c9t2100001378AC02F4d2 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c9t2100001378AC02DDd3 ONLINE 0 0 0 c9t2100001378AC02F4d3 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c9t2100001378AC02DDd5 ONLINE 0 0 0 c9t2100001378AC02F4d5 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c9t2100001378AC02DDd4 ONLINE 0 0 0 c9t2100001378AC02F4d4 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 c9t2100001378AC02DDd6 ONLINE 0 0 0 c9t2100001378AC02F4d6 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 c9t2100001378AC02DDd7 ONLINE 0 0 0 c9t2100001378AC02F4d7 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 c9t2100001378AC02DDd8 ONLINE 0 0 0 c9t2100001378AC02F4d8 ONLINE 0 0 0 mirror-9 DEGRADED 0 0 0 c9t2100001378AC02DDd9 ONLINE 0 0 0 spare-1 DEGRADED 0 0 10 c9t2100001378AC02F4d9 DEGRADED 0 0 22 too many errors c9t2100001378AC02BFd1 ONLINE 0 0 23 mirror-10 ONLINE 0 0 0 c9t2100001378AC02DDd10 ONLINE 0 0 0 c9t2100001378AC02F4d10 ONLINE 0 0 0 mirror-11 ONLINE 0 0 0 c9t2100001378AC02DDd11 ONLINE 0 0 0 c9t2100001378AC02F4d11 ONLINE 0 0 0 mirror-12 ONLINE 0 0 0 c9t2100001378AC02DDd12 ONLINE 0 0 0 c9t2100001378AC02F4d12 ONLINE 0 0 0 mirror-13 ONLINE 0 0 0 c9t2100001378AC02DDd13 ONLINE 0 0 0 c9t2100001378AC02F4d13 ONLINE 0 0 0 mirror-14 ONLINE 0 0 0 c9t2100001378AC02DDd14 ONLINE 0 0 0 c9t2100001378AC02F4d14 ONLINE 0 0 0 logs mirror-15 ONLINE 0 0 0 c9t2100001378AC02D9d0 ONLINE 0 0 0 c9t2100001378AC02BFd0 ONLINE 0 0 0 spares c9t2100001378AC02BFd1 INUSE currently in use What would be the best way to proceed? The drive c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors, while the drive that became degraded only shows 22 cksum errors. What would be the best procedure to continue? Would one now first run another scrub and detach the degraded drive afterwards, or detach the degrades drive immediately and run a scrub afterwards? Thanks, budy -- Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: http://www.jvm.com Gesch?ftsf?hrer: Frank Wilhelm, Stephan Budach (stellv.) AG HH HRB 98380 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120527/838a93f2/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > today I issued a scrub on one of my zpools and after some time I noticed that > one of the vdevs became degraded due to some drive having cksum errors. > The spare kicked in and the drive got resilvered, but why does the spare > drive now also show almost the same number of cksum errors, as the > degraded drive? > > What would be the best way to proceed? The drive c9t2100001378AC02BFd1 > is the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors, > while the drive that became degraded only shows 22 cksum errors. > > What would be the best procedure to continue? Would one now first run > another scrub and detach the degraded drive afterwards, or detach the > degrades drive immediately and run a scrub afterwards?Either you have two bad disks, or you have a problem (or had a problem) somewhere that can span disks (such as bus, or host bus adapter, or ram.) Remember, you don''t necessarily need to have the problem now - If there was a problem in the past and corrupted data got written to disk, then later (meaning now) you would run your scrub and get cksum errors, because of having previously written corrupted data. Your first step should be to look for obvious hardware conditions, like overheat or huge piles of dust. Assuming you find none, get a new disk, and zpool replace it. If the problem persists, you have to assume you have (or had) a problem with something that spans multiple disks.
On May 27, 2012, at 12:52 PM, Stephan Budach wrote:> Hi, > > today I issued a scrub on one of my zpools and after some time I noticed that one of the vdevs became degraded due to some drive having cksum errors. The spare kicked in and the drive got resilvered, but why does the spare drive now also show almost the same number of cksum errors, as the degraded drive?The answer is not available via zpool status. You will need to look at the FMA diagnosis: fmadm faulty more clues can be found in the FMA error reports: fmdump -eV -- richard> > root at solaris11c:~# zpool status obelixData > pool: obelixData > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 21:15:32 2012 > config: > > NAME STATE READ WRITE CKSUM > obelixData DEGRADED 0 0 0 > mirror-0 ONLINE 0 0 0 > c9t2100001378AC02DDd1 ONLINE 0 0 0 > c9t2100001378AC02F4d1 ONLINE 0 0 0 > mirror-1 ONLINE 0 0 0 > c9t2100001378AC02F4d0 ONLINE 0 0 0 > c9t2100001378AC02DDd0 ONLINE 0 0 0 > mirror-2 ONLINE 0 0 0 > c9t2100001378AC02DDd2 ONLINE 0 0 0 > c9t2100001378AC02F4d2 ONLINE 0 0 0 > mirror-3 ONLINE 0 0 0 > c9t2100001378AC02DDd3 ONLINE 0 0 0 > c9t2100001378AC02F4d3 ONLINE 0 0 0 > mirror-4 ONLINE 0 0 0 > c9t2100001378AC02DDd5 ONLINE 0 0 0 > c9t2100001378AC02F4d5 ONLINE 0 0 0 > mirror-5 ONLINE 0 0 0 > c9t2100001378AC02DDd4 ONLINE 0 0 0 > c9t2100001378AC02F4d4 ONLINE 0 0 0 > mirror-6 ONLINE 0 0 0 > c9t2100001378AC02DDd6 ONLINE 0 0 0 > c9t2100001378AC02F4d6 ONLINE 0 0 0 > mirror-7 ONLINE 0 0 0 > c9t2100001378AC02DDd7 ONLINE 0 0 0 > c9t2100001378AC02F4d7 ONLINE 0 0 0 > mirror-8 ONLINE 0 0 0 > c9t2100001378AC02DDd8 ONLINE 0 0 0 > c9t2100001378AC02F4d8 ONLINE 0 0 0 > mirror-9 DEGRADED 0 0 0 > c9t2100001378AC02DDd9 ONLINE 0 0 0 > spare-1 DEGRADED 0 0 10 > c9t2100001378AC02F4d9 DEGRADED 0 0 22 too many errors > c9t2100001378AC02BFd1 ONLINE 0 0 23 > mirror-10 ONLINE 0 0 0 > c9t2100001378AC02DDd10 ONLINE 0 0 0 > c9t2100001378AC02F4d10 ONLINE 0 0 0 > mirror-11 ONLINE 0 0 0 > c9t2100001378AC02DDd11 ONLINE 0 0 0 > c9t2100001378AC02F4d11 ONLINE 0 0 0 > mirror-12 ONLINE 0 0 0 > c9t2100001378AC02DDd12 ONLINE 0 0 0 > c9t2100001378AC02F4d12 ONLINE 0 0 0 > mirror-13 ONLINE 0 0 0 > c9t2100001378AC02DDd13 ONLINE 0 0 0 > c9t2100001378AC02F4d13 ONLINE 0 0 0 > mirror-14 ONLINE 0 0 0 > c9t2100001378AC02DDd14 ONLINE 0 0 0 > c9t2100001378AC02F4d14 ONLINE 0 0 0 > logs > mirror-15 ONLINE 0 0 0 > c9t2100001378AC02D9d0 ONLINE 0 0 0 > c9t2100001378AC02BFd0 ONLINE 0 0 0 > spares > c9t2100001378AC02BFd1 INUSE currently in use > > > What would be the best way to proceed? The drive c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors, while the drive that became degraded only shows 22 cksum errors. > > What would be the best procedure to continue? Would one now first run another scrub and detach the degraded drive afterwards, or detach the degrades drive immediately and run a scrub afterwards? > > Thanks, > budy > > > > -- > Stephan Budach > Jung von Matt/it-services GmbH > Glash?ttenstra?e 79 > 20357 Hamburg > > > Tel: +49 40-4321-1353 > Fax: +49 40-4321-1114 > E-Mail: stephan.budach at jvm.de > Internet: http://www.jvm.com > > Gesch?ftsf?hrer: Frank Wilhelm, Stephan Budach (stellv.) > AG HH HRB 98380 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120527/8170a82a/attachment-0001.html>
Am 28.05.12 00:35, schrieb Richard Elling:> > On May 27, 2012, at 12:52 PM, Stephan Budach wrote: > >> Hi, >> >> today I issued a scrub on one of my zpools and after some time I >> noticed that one of the vdevs became degraded due to some drive >> having cksum errors. The spare kicked in and the drive got >> resilvered, but why does the spare drive now also show almost the >> same number of cksum errors, as the degraded drive? > > The answer is not available via zpool status. You will need to look at > the FMA diagnosis: > fmadm faulty > > more clues can be found in the FMA error reports: > fmdump -eV >Thanks - I had taken a look at the FMA diagnosis, but hadn''t shared it in my first post. FMA only shows one instance as of yesterday: root at solaris11c:~# fmadm faulty |less --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Mai 27 10:24:24 f0601f5f-cb8b-67bc-bd63-e71948ea8428 ZFS-8000-GH Major Host : solaris11c Platform : SUN-FIRE-X4170-M2-SERVER Chassis_id : 1046FMM0NH Product_sn : 1046FMM0NH Fault class : fault.fs.zfs.vdev.checksum Affects : zfs://pool=obelixData/vdev=52e3ca377dbdbec9 faulted but still providing degraded service Problem in : zfs://pool=obelixData/vdev=52e3ca377dbdbec9 faulted but still providing degraded service Description : The number of checksum errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. Response : The device has been marked as degraded. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- M?r 15 16:34:52 5ad04cb0-af03-e84b-cd8a-a07aff7aec2c PCIEX-8000-J5 Major I thought this to be the instance when the vdev initially got degraded and there have been no more errors afterwards, while the resilver took place, so I tend to think that the spare drive is indeed okay. Thanks, budy> -- richard > > >> >> root at solaris11c:~# zpool status obelixData >> pool: obelixData >> state: DEGRADED >> status: One or more devices has experienced an unrecoverable error. An >> attempt was made to correct the error. Applications are >> unaffected. >> action: Determine if the device needs to be replaced, and clear the >> errors >> using ''zpool clear'' or replace the device with ''zpool replace''. >> see: http://www.sun.com/msg/ZFS-8000-9P >> scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 >> 21:15:32 2012 >> config: >> >> NAME STATE READ WRITE CKSUM >> obelixData DEGRADED 0 0 0 >> mirror-0 ONLINE 0 0 0 >> c9t2100001378AC02DDd1 ONLINE 0 0 0 >> c9t2100001378AC02F4d1 ONLINE 0 0 0 >> mirror-1 ONLINE 0 0 0 >> c9t2100001378AC02F4d0 ONLINE 0 0 0 >> c9t2100001378AC02DDd0 ONLINE 0 0 0 >> mirror-2 ONLINE 0 0 0 >> c9t2100001378AC02DDd2 ONLINE 0 0 0 >> c9t2100001378AC02F4d2 ONLINE 0 0 0 >> mirror-3 ONLINE 0 0 0 >> c9t2100001378AC02DDd3 ONLINE 0 0 0 >> c9t2100001378AC02F4d3 ONLINE 0 0 0 >> mirror-4 ONLINE 0 0 0 >> c9t2100001378AC02DDd5 ONLINE 0 0 0 >> c9t2100001378AC02F4d5 ONLINE 0 0 0 >> mirror-5 ONLINE 0 0 0 >> c9t2100001378AC02DDd4 ONLINE 0 0 0 >> c9t2100001378AC02F4d4 ONLINE 0 0 0 >> mirror-6 ONLINE 0 0 0 >> c9t2100001378AC02DDd6 ONLINE 0 0 0 >> c9t2100001378AC02F4d6 ONLINE 0 0 0 >> mirror-7 ONLINE 0 0 0 >> c9t2100001378AC02DDd7 ONLINE 0 0 0 >> c9t2100001378AC02F4d7 ONLINE 0 0 0 >> mirror-8 ONLINE 0 0 0 >> c9t2100001378AC02DDd8 ONLINE 0 0 0 >> c9t2100001378AC02F4d8 ONLINE 0 0 0 >> mirror-9 DEGRADED 0 0 0 >> c9t2100001378AC02DDd9 ONLINE 0 0 0 >> spare-1 DEGRADED 0 0 10 >> c9t2100001378AC02F4d9 DEGRADED 0 0 22 too many errors >> c9t2100001378AC02BFd1 ONLINE 0 0 23 >> mirror-10 ONLINE 0 0 0 >> c9t2100001378AC02DDd10 ONLINE 0 0 0 >> c9t2100001378AC02F4d10 ONLINE 0 0 0 >> mirror-11 ONLINE 0 0 0 >> c9t2100001378AC02DDd11 ONLINE 0 0 0 >> c9t2100001378AC02F4d11 ONLINE 0 0 0 >> mirror-12 ONLINE 0 0 0 >> c9t2100001378AC02DDd12 ONLINE 0 0 0 >> c9t2100001378AC02F4d12 ONLINE 0 0 0 >> mirror-13 ONLINE 0 0 0 >> c9t2100001378AC02DDd13 ONLINE 0 0 0 >> c9t2100001378AC02F4d13 ONLINE 0 0 0 >> mirror-14 ONLINE 0 0 0 >> c9t2100001378AC02DDd14 ONLINE 0 0 0 >> c9t2100001378AC02F4d14 ONLINE 0 0 0 >> logs >> mirror-15 ONLINE 0 0 0 >> c9t2100001378AC02D9d0 ONLINE 0 0 0 >> c9t2100001378AC02BFd0 ONLINE 0 0 0 >> spares >> c9t2100001378AC02BFd1 INUSE currently in use >> >> >> What would be the best way to proceed? The drive >> c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE, >> but it shows 23 cksum errors, while the drive that became degraded >> only shows 22 cksum errors. >> >> What would be the best procedure to continue? Would one now first run >> another scrub and detach the degraded drive afterwards, or detach the >> degrades drive immediately and run a scrub afterwards? >> >> Thanks, >> budy >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/f48113ba/attachment.html>
Hi all, just to wrap this issue up: as FMA didn''t report any other error than the one which led to the degradation of the one mirror, I detached the original drive from the zpool which flagged the mirror vdev as ONLINE (although there was still a cksum error count of 23 on the spare drive). Afterwards I attached the formerly degraded drive again to the good drive in that mirror and let the resilver finish, which didn''t show any errors at all. Finally I detached the former spare drive and re-added it as a spare drive again. Now, I will run a scrub once more to veryfy the zpool. Cheers, budy
On May 28, 2012, at 9:21 PM, Stephan Budach wrote:> Hi all, > > just to wrap this issue up: as FMA didn''t report any other error than the one which led to the degradation of the one mirror, I detached the original drive from the zpool which flagged the mirror vdev as ONLINE (although there was still a cksum error count of 23 on the spare drive).You showed the result of the FMA diagnosis, but not the error reports. One feature of the error reports on modern Solaris is that the expected and reported bit images are described, showing the nature and extent of the corruption.> > Afterwards I attached the formerly degraded drive again to the good drive in that mirror and let the resilver finish, which didn''t show any errors at all. Finally I detached the former spare drive and re-added it as a spare drive again.Good. Perhaps they were transient.> > Now, I will run a scrub once more to veryfy the zpool.Good idea. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/4b39d784/attachment.html>
Hi Richard, Am 29.05.12 06:54, schrieb Richard Elling:> > On May 28, 2012, at 9:21 PM, Stephan Budach wrote: > >> Hi all, >> >> just to wrap this issue up: as FMA didn''t report any other error than >> the one which led to the degradation of the one mirror, I detached >> the original drive from the zpool which flagged the mirror vdev as >> ONLINE (although there was still a cksum error count of 23 on the >> spare drive). > > You showed the result of the FMA diagnosis, but not the error reports. > One feature of the error reports on modern Solaris is that the > expected and reported > bit images are described, showing the nature and extent of the corruption.Are you referring to these errors: root at solaris11c:~# fmdump -e -u f0601f5f-cb8b-67bc-bd63-e71948ea8428 TIME CLASS Mai 27 10:24:23.3654 ereport.fs.zfs.checksum Mai 27 10:24:23.3652 ereport.fs.zfs.checksum Mai 27 10:24:23.3650 ereport.fs.zfs.checksum Mai 27 10:24:23.3648 ereport.fs.zfs.checksum Mai 27 10:24:23.3646 ereport.fs.zfs.checksum Mai 27 10:24:23.2696 ereport.fs.zfs.checksum Mai 27 10:24:23.2694 ereport.fs.zfs.checksum Mai 27 10:24:23.2692 ereport.fs.zfs.checksum Mai 27 10:24:23.2690 ereport.fs.zfs.checksum Mai 27 10:24:23.2688 ereport.fs.zfs.checksum Mai 27 10:24:23.2686 ereport.fs.zfs.checksum And to pick one in detail: root at solaris11c:~# fmdump -eV -u f0601f5f-cb8b-67bc-bd63-e71948ea8428 TIME CLASS Mai 27 2012 10:24:23.365451280 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xdfb23b0bc9700001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x855ebc6738ef6dd6 vdev = 0x52e3ca377dbdbec9 (end detector) pool = obelixData pool_guid = 0x855ebc6738ef6dd6 pool_context = 0 pool_failmode = wait vdev_guid = 0x52e3ca377dbdbec9 vdev_type = disk vdev_path = /dev/dsk/c9t2100001378AC02F4d9s0 vdev_devid = id1,sd at n2047001378ac02f4/a parent_guid = 0x695bf14bdabd6714 parent_type = mirror zio_err = 50 zio_offset = 0x2d8b974600 zio_size = 0x20000 zio_objset = 0x81ea9 zio_object = 0x5594 zio_level = 0 zio_blkid = 0x3c cksum_expected = 0x12869460bd5d 0x49e4661395e6973 0xc974c2622ce7a035 0x81fe9ef14082a245 cksum_actual = 0x1bba2b185478 0x707883eac587dd3 0x54de998365cc6a8d 0x6822e5f4add45237 cksum_algorithm = fletcher4 bad_ranges = 0x0 0x20000 bad_ranges_min_gap = 0x8 bad_range_sets = 0x357a5 bad_range_clears = 0x3935b bad_set_histogram = 0x8f3 0xdd4 0x52c 0x13d0 0xd76 0xea0 0xec1 0x100f 0x8f0 0xdc7 0x51e 0x13e7 0xd6b 0xe87 0xf30 0xf9c 0x8cd 0xddc 0x51a 0x1458 0xd93 0xf0a 0xf04 0x102d 0x8b4 0xdea 0x51a 0x141d 0xdd3 0xefc 0xf18 0x1003 0x8bc 0xde9 0x52f 0x13a4 0xdd9 0xf07 0xea2 0x100d 0x8c1 0xdf4 0x4e6 0x1368 0xdce 0xed9 0xf27 0x1002 0x8bf 0xdf4 0x4fe 0x1396 0xd7d 0xee0 0xf2b 0xfcc 0x8d8 0xdd7 0x4fc 0x13b8 0xd8e 0xe8b 0xedb 0x100e bad_cleared_histogram = 0x0 0x46 0x211a 0xc77 0x124f 0x1146 0x113b 0x1020 0x0 0x35 0x20df 0xc9f 0x12dc 0x110c 0x10fc 0x1018 0x0 0x37 0x2103 0xcbb 0x12a9 0x113d 0x1100 0xf8d 0x0 0x35 0x210d 0xc6e 0x121a 0x1171 0x108f 0x1020 0x0 0x46 0x20ec 0xc3f 0x12ba 0x10ce 0x1172 0x1009 0x0 0x47 0x20a4 0xc5e 0x129f 0x1102 0x112e 0x1031 0x0 0x4a 0x20d1 0xc64 0x126b 0x1159 0x111c 0x1074 0x0 0x3a 0x20ed 0xc5b 0x1245 0x1160 0x111c 0xfc0 __ttl = 0x1 __tod = 0x4fc1e4b7 0x15c85810 They were all from the same vdev_path and ranged through these block IDs: zio_blkid = 0x3c zio_blkid = 0x3e zio_blkid = 0x40 zio_blkid = 0x3a zio_blkid = 0x3d zio_blkid = 0xf zio_blkid = 0xc zio_blkid = 0x10 zio_blkid = 0x12 zio_blkid = 0x14 zio_blkid = 0x11 I really was a bit surprised by the cksum errors on the spare drive, especially when no errors had been logged for the spare drive while it was resilvering. We''ll see what the scrub will tell us. Thanks, budy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/f3d18c75/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > Now, I will run a scrub once more to veryfy the zpool.If you have a drive (or two drives) with bad sectors, they will only be detected as long as the bad sectors get used. Given that your pool is less than 100% full, it means you might still have bad hardware going undetected, if you pass your scrub. You might consider creating a big file (dd if=/dev/zero of=bigfile.junk bs=1024k) and then when you''re out of disk space, scrub again. (Obviously, you would be unable to make new writes to pool as long as it''s filled...) And since certain types of checksum errors will only occur when you *change* the bits on disk, repeat the same test. rm bigfile.junk ; dd if=/dev/urandom of=bigfile.junk bs=1024k and then scrub again.
Hi-- You don''t see what release this is but I think that seeing the checkum error accumulation on the spare was a zpool status formatting bug that I have seen myself. This is fixed in a later Solaris release. Thanks, Cindy On 05/28/12 22:21, Stephan Budach wrote:> Hi all, > > just to wrap this issue up: as FMA didn''t report any other error than > the one which led to the degradation of the one mirror, I detached the > original drive from the zpool which flagged the mirror vdev as ONLINE > (although there was still a cksum error count of 23 on the spare drive). > > Afterwards I attached the formerly degraded drive again to the good > drive in that mirror and let the resilver finish, which didn''t show any > errors at all. Finally I detached the former spare drive and re-added it > as a spare drive again. > > Now, I will run a scrub once more to veryfy the zpool. > > Cheers, > budy > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On May 29, 2012, at 8:12 AM, Cindy Swearingen wrote:> Hi-- > > You don''t see what release this is but I think that seeing the checkum > error accumulation on the spare was a zpool status formatting bug that > I have seen myself. This is fixed in a later Solaris release. >Once again, Cindy beats me to it :-) Verify that the ereports are logged against the original device and not the spare. If there are no ereports for the spare, then Cindy gets the prize :-) -- richard> Thanks, > > Cindy > > On 05/28/12 22:21, Stephan Budach wrote: >> Hi all, >> >> just to wrap this issue up: as FMA didn''t report any other error than >> the one which led to the degradation of the one mirror, I detached the >> original drive from the zpool which flagged the mirror vdev as ONLINE >> (although there was still a cksum error count of 23 on the spare drive). >> >> Afterwards I attached the formerly degraded drive again to the good >> drive in that mirror and let the resilver finish, which didn''t show any >> errors at all. Finally I detached the former spare drive and re-added it >> as a spare drive again. >> >> Now, I will run a scrub once more to veryfy the zpool. >> >> Cheers, >> budy >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/c6773bf2/attachment-0001.html>
Am 29.05.12 18:59, schrieb Richard Elling:> On May 29, 2012, at 8:12 AM, Cindy Swearingen wrote: > >> Hi-- >> >> You don''t see what release this is but I think that seeing the checkum >> error accumulation on the spare was a zpool status formatting bug that >> I have seen myself. This is fixed in a later Solaris release. >> > > Once again, Cindy beats me to it :-) > > Verify that the ereports are logged against the original device and > not the > spare. If there are no ereports for the spare, then Cindy gets the > prize :-) > -- richardYeah, I verified that the error reports were only logged against the original device, as I stated earlier, so Cindy wins! :) If now I''d only knew how to get the actual S11 release level of my box. Neither uname -a nor cat /etc/release does give me a clue, since they display all the same data when run on different hosts that are on different updates. Thanks, budy>> Thanks, >> >> Cindy >> >> On 05/28/12 22:21, Stephan Budach wrote: >>> Hi all, >>> >>> just to wrap this issue up: as FMA didn''t report any other error than >>> the one which led to the degradation of the one mirror, I detached the >>> original drive from the zpool which flagged the mirror vdev as ONLINE >>> (although there was still a cksum error count of 23 on the spare drive). >>> >>> Afterwards I attached the formerly degraded drive again to the good >>> drive in that mirror and let the resilver finish, which didn''t show any >>> errors at all. Finally I detached the former spare drive and re-added it >>> as a spare drive again. >>> >>> Now, I will run a scrub once more to veryfy the zpool. >>> >>> Cheers, >>> budy >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > ZFS Performance and Training > Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com> > +1-760-896-4422 > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/0bed3745/attachment.html>
On 2012-May-29 22:04:39 +1000, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>If you have a drive (or two drives) with bad sectors, they will only be >detected as long as the bad sectors get used. Given that your pool is less >than 100% full, it means you might still have bad hardware going undetected, >if you pass your scrub.One way around this is to ''dd'' each drive to /dev/null (or do a "long" test using smartmontools). This ensures that the drive thinks all sectors are readable.>You might consider creating a big file (dd if=/dev/zero of=bigfile.junk >bs=1024k) and then when you''re out of disk space, scrub again. (Obviously, >you would be unable to make new writes to pool as long as it''s filled...)I''m not sure how ZFS handles "no large free blocks", so you might need to repeat this more than once to fill the disk. This could leave your drive seriously fragmented. If you do try this, I''d recommend creating a snapshot first and then rolling back to it, rather than just deleting the junk file. Also, this (obviously) won''t work at all on a filesystem with compression enabled. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120530/3c114d43/attachment.bin>
In message <4FC509E8.8080600 at jvm.de>, Stephan Budach writes:>If now I''d only knew how to get the actual S11 release level of my box. >Neither uname -a nor cat /etc/release does give me a clue, since they >display all the same data when run on different hosts that are on >different updates.$ pkg info entire John groenveld at acm.org