Bill Sommerfeld
2008-Jul-18 00:34 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 2 mirror ONLINE 0 0 2 c4t0d0s0 ONLINE 0 0 4 c4t1d0s0 ONLINE 0 0 4 I ran it again, and it''s now reporting the same errors, but still says "applications are unaffected": pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 4 mirror ONLINE 0 0 4 c4t0d0s0 ONLINE 0 0 8 c4t1d0s0 ONLINE 0 0 8 errors: No known data errors I wonder if I''m running into some combination of: 6725341 Running ''zpool scrub'' repeatedly on a pool show an ever increasing error count and maybe: 6437568 ditto block repair is incorrectly propagated to root vdev Any way to dig further to determine what''s going on? - Bill
Jürgen Keil
2008-Jul-18 17:28 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
> I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors:Hmm, after reading this, I started a zpool scrub on my mirrored pool, on a system that is running post snv_94 bits: It also found checksum errors # zpool status files pool: files state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008 config: NAME STATE READ WRITE CKSUM files DEGRADED 0 0 18 mirror DEGRADED 0 0 18 c8t0d0s6 DEGRADED 0 0 36 too many errors c9t0d0s6 DEGRADED 0 0 36 too many errors errors: No known data errors Addding the -v option to zpool status returned: errors: Permanent errors have been detected in the following files: <metadata>:<0x0> OTOH, trying to verify checksums with zdb -c didn''t find any problems: # zdb -cvv files Traversing all blocks to verify checksums and verify nothing leaked ... No leaks (block sum matches space maps exactly) bp count: 2804880 bp logical: 121461614592 avg: 43303 bp physical: 84585684992 avg: 30156 compression: 1.44 bp allocated: 85146115584 avg: 30356 compression: 1.43 SPA allocated: 85146115584 used: 79.30% 951.08u 419.55s 2:24:34.32 15.8% # This message posted from opensolaris.org
Jürgen Keil
2008-Jul-18 18:28 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
> > I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > on a system that is running post snv_94 bits: It also found checksum errors...> OTOH, trying to verify checksums with zdb -c didn''t > find any problems:And a zpool scrub under snv_85 doesn''t find checksum errors, either. This message posted from opensolaris.org
Rustam Aliyev
2008-Jul-18 19:49 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
I''m living with this error for almost 4 months and probably have record number of checksum errors: core# zpool status -xv pool: box5 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM box5 ONLINE 0 0 856 mirror ONLINE 0 0 428 c1d0 ONLINE 0 0 856 c2d0 ONLINE 0 0 856 mirror ONLINE 0 0 428 c2d1 ONLINE 0 0 856 c1d1 ONLINE 0 0 856 errors: Permanent errors have been detected in the following files: box5:<0x0> I''ve Sol 10 U5 though. -- Rustam. J?rgen Keil wrote:>> I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: >> > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > on a system that is running post snv_94 bits: It also found checksum errors > > # zpool status files > pool: files > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008 > config: > > NAME STATE READ WRITE CKSUM > files DEGRADED 0 0 18 > mirror DEGRADED 0 0 18 > c8t0d0s6 DEGRADED 0 0 36 too many errors > c9t0d0s6 DEGRADED 0 0 36 too many errors > > errors: No known data errors > > > Addding the -v option to zpool status returned: > > > errors: Permanent errors have been detected in the following files: > > <metadata>:<0x0> > > > > OTOH, trying to verify checksums with zdb -c didn''t find any problems: > > # zdb -cvv files > > Traversing all blocks to verify checksums and verify nothing leaked ... > > No leaks (block sum matches space maps exactly) > > bp count: 2804880 > bp logical: 121461614592 avg: 43303 > bp physical: 84585684992 avg: 30156 compression: 1.44 > bp allocated: 85146115584 avg: 30356 compression: 1.43 > SPA allocated: 85146115584 used: 79.30% > > 951.08u 419.55s 2:24:34.32 15.8% > # > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080718/a65acc5d/attachment.html>
Miles Nordin
2008-Jul-20 11:26 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
>>>>> "jk" == J?rgen Keil <jk at tools.de> writes:jk> And a zpool scrub under snv_85 doesn''t find checksum errors, jk> either. how about a second scrub with snv_94? are the checksum errors gone the second time around? I get checksum errors counted all the time when it is really just resilvering. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080720/a839af35/attachment.bin>
Bill Sommerfeld
2008-Jul-20 18:26 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
On Fri, 2008-07-18 at 10:28 -0700, J??rgen Keil wrote:> > I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > on a system that is running post snv_94 bits: It also found checksum errors > > # zpool status files > pool: files > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008 > config: > > NAME STATE READ WRITE CKSUM > files DEGRADED 0 0 18 > mirror DEGRADED 0 0 18 > c8t0d0s6 DEGRADED 0 0 36 too many errors > c9t0d0s6 DEGRADED 0 0 36 too many errors > > errors: No known data errorsout of curiosity, is this a root pool? A second system of mine with a mirrored root pool (and an additional large multi-raidz pool) shows the same symptoms on the mirrored root pool only. once is accident. twice is coincidence. three times is enemy action :-) I''ll file a bug as soon as I can (I''m travelling at the moment with spotty connectivity), citing my and your reports. - Bill
dick hoogendijk
2008-Jul-20 18:43 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
On Sun, 20 Jul 2008 11:26:16 -0700 Bill Sommerfeld <sommerfeld at sun.com> wrote:> once is accident. twice is coincidence. three times is enemy > action :-)I have no access to b94 yet, but as it is, it probably is better to skip this one when it comes out then. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D ++ http://nagual.nl/ + SunOS sxce snv91 ++
Jürgen Keil
2008-Jul-21 08:28 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
Miles Nordin wrote:> "jk" == J?rgen Keil <jk at tools.de> writes: > jk> And a zpool scrub under snv_85 doesn''t find checksum errors, either. > how about a second scrub with snv_94? are the checksum errors gone > the second time around?Nope. I''ve now seen this problem on 4 zpools on three different systems. Post snv_94 (bfu''ed) reports checksum errors during scrub, and the scrub under the original nevada release (snv_85, snv_89 and snv_91) didn''t report checksum errors. This message posted from opensolaris.org
Jürgen Keil
2008-Jul-21 09:18 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
Bill Sommerfeld wrote:> On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote: > > > I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: > > > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > > on a system that is running post snv_94 bits: It also found checksum errors > > > > out of curiosity, is this a root pool?It started as standard pool, and is using version 3 zpool format. I''m using a small ufs root, and have /usr as a zfs filesystem on that pool. At some point in the past i did setup a zfs root and /usr filesystem for experimenting with xVM unstable bits.> A second system of mine with a mirrored root pool (and an additional > large multi-raidz pool) shows the same symptoms on the mirrored root > pool only. > > once is accident. twice is coincidence. three times is enemy action :-) > > I''ll file a bug as soon as I can (I''m travelling at the moment with > spotty connectivity), citing my and your reports.Btw. I also found the scrub checksum errors on a non-mirrored zpool (laptop with only one hdd). And on one zpool that was using a non-mirrored, striped pool on two S-ATA drives. I think that in my case the cause for the scrub checksum errors is an open ZIL transaction on an *unmounted* zfs filesystem. In the past such a zfs state prevented creating snapshots for the unmounted zfs, see bug 6482985, 6462803. That is still the case. But now it also seems to trigger checksum errors for a zpool scrub. Stack backtrace for the ECKSUM (which gets translated into EIO errors in arc_read_done()): 1 64703 arc_read_nolock:return, rval 5 zfs`zil_read_log_block+0x140 zfs`zil_parse+0x155 zfs`traverse_zil+0x55 zfs`scrub_visitbp+0x284 zfs`scrub_visit_rootbp+0x4e zfs`scrub_visitds+0x82 zfs`dsl_pool_scrub_sync+0x109 zfs`dsl_pool_sync+0x158 zfs`spa_sync+0x254 zfs`txg_sync_thread+0x226 unix`thread_start+0x8 Does a "zdb -ivv {pool}" report any ZIL headers with a claim_txg != 0 on your pools? Is the dataset that is associated with such a ZIL an unmounted zfs? # zdb -ivv files | grep claim_txg ZIL header: claim_txg 5164405, seq 0 ZIL header: claim_txg 0, seq 0 ZIL header: claim_txg 0, seq 0 ZIL header: claim_txg 0, seq 0 ZIL header: claim_txg 0, seq 0 ZIL header: claim_txg 5164405, seq 0 ZIL header: claim_txg 0, seq 0 # zdb -ivvvv files/matrix-usr Dataset files/matrix-usr [ZPL], ID 216, cr_txg 5091978, 2.39G, 192089 objects ZIL header: claim_txg 5164405, seq 0 first block: [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:12421e0000:1000> zilog uncompressed LE contiguous birth=5163908 fill=0 cksum=c368086f1485f7c4:39a549a81d769386:d8:3 Block seqno 3, already claimed, [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:12421e0000:1000> zilog uncompressed LE contiguous birth=5163908 fill=0 cksum=c368086f1485f7c4:39a549a81d769386:d8:3 On two of my zpools I''ve eliminated the zpool scrub checksum errors by mounting / unmounting the zfs with the unplayed ZIL. This message posted from opensolaris.org
Jürgen Keil
2008-Jul-21 14:57 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
Rustam wrote:> I''m living with this error for almost 4 months and probably have record > number of checksum errors:> # zpool status -xv > pool: box5...> errors: Permanent errors have been detected in the > following files: > > box5:<0x0> > > I''ve Sol 10 U5 though.I suspect that this (S10u5) is a different issue, because for my system''s pool it seems to be caused by the opensolaris putback on July 07th for these fixes: 6343667 scrub/resilver has to start over when a snapshot is taken 6343693 ''zpool status'' gives delayed start for ''zpool scrub'' 6670746 scrub on degraded pool return the status of ''resilver completed''? 6675685 DTL entries are lost resulting in checksum errors 6706404 get_history_one() can dereference off end of hist_event_table[] 6715414 assertion failed: ds->ds_owner != tag in dsl_dataset_rele() 6716437 ztest gets SEGV in arc_released() 6722838 bfu does not update grub This message posted from opensolaris.org
Jürgen Keil
2008-Jul-22 08:57 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
Bill Sommerfeld wrote:> On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote: > > > I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: > > > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > > on a system that is running post snv_94 bits: It also found checksum errors > > > once is accident. twice is coincidence. three times is enemy action :-) > > I''ll file a bug as soon as I canI filed 6727872, for the problem with zpool scrub checksum errors on unmounted zfs filesystems with an unplayed ZIL. This message posted from opensolaris.org
Jürgen Keil
2008-Jul-23 16:49 UTC
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
I wrote:> Bill Sommerfeld wrote: > > On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote: > > > > I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: > > > > > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > > > on a system that is running post snv_94 bits: It also found checksum errors > > > > > once is accident. twice is coincidence. three times is enemy action :-) > > > > I''ll file a bug as soon as I can > > I filed 6727872, for the problem with zpool scrub checksum errors > on unmounted zfs filesystems with an unplayed ZIL.6727872 has already been fixed, in what will become snv_96. For my zpool, zpool scrub doesn''t report checksum errors any more. But: something is still a bit strange with the data reported by zpool status. The error counts displayed by zpool status are all 0 (during the scrub, and when the scrub has completed), but when zpool scrub completes it tells me that "scrub completed after 0h58m with 6 errors". But it doesn''t list the errors. # zpool status -v files pool: files state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub in progress for 0h57m, 99.39% done, 0h0m to go config: NAME STATE READ WRITE CKSUM files ONLINE 0 0 0 mirror ONLINE 0 0 0 c8t0d0s6 ONLINE 0 0 0 c9t0d0s6 ONLINE 0 0 0 errors: No known data errors # zpool status -v files pool: files state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 0h58m with 6 errors on Wed Jul 23 18:23:00 2008 config: NAME STATE READ WRITE CKSUM files ONLINE 0 0 0 mirror ONLINE 0 0 0 c8t0d0s6 ONLINE 0 0 0 c9t0d0s6 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org