thr3ads.net - zfs discuss - [zfs-discuss] checksum errors on root pool after upgrade to snv

If this information is useful, please help other people find it:
Share via:

Bill Sommerfeld

2008-Jul-18 00:34 UTC

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14
2008
config:

        NAME          STATE     READ WRITE CKSUM
        r00t          ONLINE       0     0     2
          mirror      ONLINE       0     0     2
            c4t0d0s0  ONLINE       0     0     4
            c4t1d0s0  ONLINE       0     0     4

I ran it again, and it''s now reporting the same errors, but still says
"applications are unaffected":

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008
config:

        NAME          STATE     READ WRITE CKSUM
        r00t          ONLINE       0     0     4
          mirror      ONLINE       0     0     4
            c4t0d0s0  ONLINE       0     0     8
            c4t1d0s0  ONLINE       0     0     8

errors: No known data errors


I wonder if I''m running into some combination of:

6725341 Running ''zpool scrub'' repeatedly on a pool show an
ever
increasing error count

and maybe:

6437568 ditto block repair is incorrectly propagated to root vdev

Any way to dig further to determine what''s going on?

					- Bill

Jürgen Keil

2008-Jul-18 17:28 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

> I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:
Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
on a system that is running post snv_94 bits:  It also found checksum errors

# zpool status files
  pool: files
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
config:

	NAME          STATE     READ WRITE CKSUM
	files         DEGRADED     0     0    18
	  mirror      DEGRADED     0     0    18
	    c8t0d0s6  DEGRADED     0     0    36  too many errors
	    c9t0d0s6  DEGRADED     0     0    36  too many errors

errors: No known data errors


Addding the -v option to zpool status returned:


errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>



OTOH, trying to verify checksums with zdb -c didn''t find any problems:

# zdb -cvv files

Traversing all blocks to verify checksums and verify nothing leaked ...

	No leaks (block sum matches space maps exactly)

	bp count:         2804880
	bp logical:    121461614592	 avg:  43303
	bp physical:   84585684992	 avg:  30156	compression:   1.44
	bp allocated:  85146115584	 avg:  30356	compression:   1.43
	SPA allocated: 85146115584	used: 79.30%

951.08u 419.55s 2:24:34.32 15.8%
#
 
 
This message posted from opensolaris.org

Jürgen Keil

2008-Jul-18 18:28 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

> > I ran a scrub on a root pool after upgrading to snv_94, and got
checksum errors:
> 
> Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> on a system that is running post snv_94 bits:  It also found checksum
errors
...> OTOH, trying to verify checksums with zdb -c didn''t
> find any problems:
And  a zpool scrub under snv_85 doesn''t find checksum errors, either.
 
 
This message posted from opensolaris.org

Rustam Aliyev

2008-Jul-18 19:49 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

I''m living with this error for almost 4 months and probably have record
number of checksum errors:

core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
        NAME        STATE     READ WRITE CKSUM
        box5        ONLINE       0     0   856
          mirror    ONLINE       0     0   428
            c1d0    ONLINE       0     0   856
            c2d0    ONLINE       0     0   856
          mirror    ONLINE       0     0   428
            c2d1    ONLINE       0     0   856
            c1d1    ONLINE       0     0   856
 
errors: Permanent errors have been detected in the following files:
 
        box5:<0x0>


I''ve Sol 10 U5 though.

--
Rustam.


J?rgen Keil wrote:>> I ran a scrub on a root pool after upgrading to snv_94, and got
checksum errors:
>>     
>
> Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> on a system that is running post snv_94 bits:  It also found checksum
errors
>
> # zpool status files
>   pool: files
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56
2008
> config:
>
> 	NAME          STATE     READ WRITE CKSUM
> 	files         DEGRADED     0     0    18
> 	  mirror      DEGRADED     0     0    18
> 	    c8t0d0s6  DEGRADED     0     0    36  too many errors
> 	    c9t0d0s6  DEGRADED     0     0    36  too many errors
>
> errors: No known data errors
>
>
> Addding the -v option to zpool status returned:
>
>
> errors: Permanent errors have been detected in the following files:
>
>         <metadata>:<0x0>
>
>
>
> OTOH, trying to verify checksums with zdb -c didn''t find any
problems:
>
> # zdb -cvv files
>
> Traversing all blocks to verify checksums and verify nothing leaked ...
>
> 	No leaks (block sum matches space maps exactly)
>
> 	bp count:         2804880
> 	bp logical:    121461614592	 avg:  43303
> 	bp physical:   84585684992	 avg:  30156	compression:   1.44
> 	bp allocated:  85146115584	 avg:  30356	compression:   1.43
> 	SPA allocated: 85146115584	used: 79.30%
>
> 951.08u 419.55s 2:24:34.32 15.8%
> #
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080718/a65acc5d/attachment.html>

Miles Nordin

2008-Jul-20 11:26 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

>>>>> "jk" == J?rgen Keil <jk at tools.de>
writes:
    jk> And a zpool scrub under snv_85 doesn''t find checksum errors,
    jk> either.

how about a second scrub with snv_94?  are the checksum errors gone
the second time around?

I get checksum errors counted all the time when it is really just
resilvering.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080720/a839af35/attachment.bin>

Bill Sommerfeld

2008-Jul-20 18:26 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

On Fri, 2008-07-18 at 10:28 -0700, J??rgen Keil wrote:> > I ran a scrub on a root pool after upgrading to snv_94, and got
checksum errors:
> 
> Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> on a system that is running post snv_94 bits:  It also found checksum
errors
> 
> # zpool status files
>   pool: files
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56
2008
> config:
> 
> 	NAME          STATE     READ WRITE CKSUM
> 	files         DEGRADED     0     0    18
> 	  mirror      DEGRADED     0     0    18
> 	    c8t0d0s6  DEGRADED     0     0    36  too many errors
> 	    c9t0d0s6  DEGRADED     0     0    36  too many errors
> 
> errors: No known data errors
out of curiosity, is this a root pool?  

A second system of mine with a mirrored root pool (and an additional
large multi-raidz pool) shows the same symptoms on the mirrored root
pool only.

once is accident.  twice is coincidence.  three times is enemy
action :-)

I''ll file a bug as soon as I can (I''m travelling at the moment
with
spotty connectivity), citing my and your reports.

					- Bill

dick hoogendijk

2008-Jul-20 18:43 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

On Sun, 20 Jul 2008 11:26:16 -0700
Bill Sommerfeld <sommerfeld at sun.com> wrote:
> once is accident.  twice is coincidence.  three times is enemy
> action :-)
I have no access to b94 yet, but as it is, it probably is better to
skip this one when it comes out then.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv91 ++

Jürgen Keil

2008-Jul-21 08:28 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

Miles Nordin wrote:
>  "jk" == J?rgen Keil <jk at tools.de> writes:
> jk> And a zpool scrub under snv_85 doesn''t find  checksum
errors, either.
> how about a second scrub with snv_94?  are the checksum errors gone
> the second time around?
Nope.

I''ve now seen this problem on 4 zpools on three different systems.
Post snv_94 (bfu''ed) reports checksum errors during scrub, and the
scrub under the original nevada release (snv_85, snv_89 and snv_91)
didn''t report checksum errors.

This message posted from opensolaris.org

Jürgen Keil

2008-Jul-21 09:18 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

Bill Sommerfeld wrote:
> On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote:
> > > I ran a scrub on a root pool after upgrading to snv_94, and got
checksum errors:
> > 
> > Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> > on a system that is running post snv_94 bits:  It also found checksum
errors
> > 
> 
> out of curiosity, is this a root pool?  
It started as standard pool, and is using version 3 zpool format.

I''m using a small ufs root, and have /usr as a zfs filesystem on
that pool.

At some point in the past i did setup a zfs root and /usr filesystem
for experimenting with xVM unstable bits.

> A second system of mine with a mirrored root pool (and an additional
> large multi-raidz pool) shows the same symptoms on the mirrored root
> pool only.
> 
> once is accident.  twice is coincidence.  three times is enemy action :-)
> 
> I''ll file a bug as soon as I can (I''m travelling at the
moment with
> spotty connectivity), citing my and your reports.
Btw. I also found the scrub checksum errors on a non-mirrored zpool
(laptop with only one hdd).

And on one zpool that was using a non-mirrored, striped pool on two
S-ATA drives.


I think that in my case the cause for the scrub checksum errors is an
open ZIL transaction on an *unmounted* zfs filesystem.  In the past
such a zfs state prevented creating snapshots for the unmounted zfs,
see bug 6482985, 6462803.  That is still the case.  But now it also
seems to trigger checksum errors for a zpool scrub.

Stack backtrace for the ECKSUM (which gets translated into EIO errors
in arc_read_done()):

  1  64703           arc_read_nolock:return, rval 5
              zfs`zil_read_log_block+0x140
              zfs`zil_parse+0x155
              zfs`traverse_zil+0x55
              zfs`scrub_visitbp+0x284
              zfs`scrub_visit_rootbp+0x4e
              zfs`scrub_visitds+0x82
              zfs`dsl_pool_scrub_sync+0x109
              zfs`dsl_pool_sync+0x158
              zfs`spa_sync+0x254
              zfs`txg_sync_thread+0x226
              unix`thread_start+0x8




Does a "zdb -ivv {pool}" report any ZIL headers with a claim_txg != 0
on your pools?  Is the dataset that is associated with such a ZIL an
unmounted zfs?

    # zdb -ivv files | grep claim_txg
    ZIL header: claim_txg 5164405, seq 0
    ZIL header: claim_txg 0, seq 0
    ZIL header: claim_txg 0, seq 0
    ZIL header: claim_txg 0, seq 0
    ZIL header: claim_txg 0, seq 0
    ZIL header: claim_txg 5164405, seq 0
    ZIL header: claim_txg 0, seq 0


# zdb -ivvvv files/matrix-usr
Dataset files/matrix-usr [ZPL], ID 216, cr_txg 5091978, 2.39G, 192089 objects

    ZIL header: claim_txg 5164405, seq 0

	first block: [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:12421e0000:1000>
zilog uncompressed LE contiguous birth=5163908 fill=0
cksum=c368086f1485f7c4:39a549a81d769386:d8:3

	Block seqno 3, already claimed, [L0 ZIL intent log] 1000L/1000P
DVA[0]=<0:12421e0000:1000> zilog uncompressed LE contiguous birth=5163908
fill=0 cksum=c368086f1485f7c4:39a549a81d769386:d8:3


On two of my zpools I''ve eliminated the zpool scrub checksum errors by
mounting /  unmounting the zfs with the unplayed ZIL.
 
 
This message posted from opensolaris.org

Jürgen Keil

2008-Jul-21 14:57 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

Rustam wrote:
 > I''m living with this error for almost 4 months and probably have
record
> number of checksum errors:
> # zpool status -xv
>   pool: box5
...> errors: Permanent errors have been detected in the
> following files:
>  
>         box5:<0x0>
>
> I''ve Sol 10 U5 though.
I suspect that this (S10u5)  is a different issue, because for my
system''s pool it seems to be caused by the opensolaris putback
on July 07th  for these fixes:

6343667 scrub/resilver has to start over when a snapshot is taken
6343693 ''zpool status'' gives delayed start for ''zpool
scrub''
6670746 scrub on degraded pool return the status of ''resilver
completed''?
6675685 DTL entries are lost resulting in checksum errors
6706404 get_history_one() can dereference off end of hist_event_table[]
6715414 assertion failed: ds->ds_owner != tag in dsl_dataset_rele()
6716437 ztest gets SEGV in arc_released()
6722838 bfu does not update grub

This message posted from opensolaris.org

Jürgen Keil

2008-Jul-22 08:57 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

Bill Sommerfeld wrote:> On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote:
> > > I ran a scrub on a root pool after upgrading to snv_94, and got
checksum errors:
> > 
> > Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> > on a system that is running post snv_94 bits:  It also found checksum
errors
> > 
> once is accident.  twice is coincidence.  three times is enemy action :-)
> 
> I''ll file a bug as soon as I can 
I filed 6727872, for the problem with zpool scrub checksum errors
on unmounted zfs filesystems with an unplayed ZIL.
 
 
This message posted from opensolaris.org

Jürgen Keil

2008-Jul-23 16:49 UTC

head link

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

I wrote:> Bill Sommerfeld wrote:
> > On Fri, 2008-07-18 at 10:28 -0700, J?rgen Keil wrote:
> > > > I ran a scrub on a root pool after upgrading to snv_94, and
got checksum errors:
> > > 
> > > Hmm, after reading this, I started a zpool scrub on my mirrored
pool,
> > > on a system that is running post snv_94 bits:  It also found
checksum errors
> > > 
> > once is accident.  twice is coincidence.  three times is enemy action
:-)
> > 
> > I''ll file a bug as soon as I can 
> 
> I filed 6727872, for the problem with zpool scrub checksum errors
> on unmounted zfs filesystems with an unplayed ZIL.
6727872 has already been fixed, in what will become snv_96.

For my zpool, zpool scrub doesn''t report checksum errors any more.

But: something is still a bit strange with the data reported by zpool status.
The error counts displayed by zpool status are all 0 (during the scrub, and when
the scrub has completed), but when zpool scrub completes it tells me that
"scrub completed after 0h58m with 6 errors".  But it doesn''t
list the errors.

# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using ''zpool upgrade''.  Once this is
done, the
	pool will no longer be accessible on older software versions.
 scrub: scrub in progress for 0h57m, 99.39% done, 0h0m to go
config:

	NAME          STATE     READ WRITE CKSUM
	files         ONLINE       0     0     0
	  mirror      ONLINE       0     0     0
	    c8t0d0s6  ONLINE       0     0     0
	    c9t0d0s6  ONLINE       0     0     0

errors: No known data errors


# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using ''zpool upgrade''.  Once this is
done, the
	pool will no longer be accessible on older software versions.
 scrub: scrub completed after 0h58m with 6 errors on Wed Jul 23 18:23:00 2008
config:

	NAME          STATE     READ WRITE CKSUM
	files         ONLINE       0     0     0
	  mirror      ONLINE       0     0     0
	    c8t0d0s6  ONLINE       0     0     0
	    c9t0d0s6  ONLINE       0     0     0

errors: No known data errors
 
 
This message posted from opensolaris.org

zfs discuss - Jul 2008 - checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

[zfs-discuss] checksum errors on root pool after upgrade to snv_94