thr3ads.net - zfs discuss - [zfs-discuss] ZFS checksum error detection [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Thomas Nau

2007-Mar-16 23:14 UTC

[zfs-discuss] ZFS checksum error detection

Hi all.
A quick question about the checksum error detection routines in ZFS. 
Surely ZFS can decide about checksum errors in a redundant environment but 
what about an non-redundant one? We connected a single RAID5 array to a 
v440 as a NFS server and while doing backups and the like we see the 
"zpool status -v" checksum error counters increment once in a while. 
Nevertheless the command keeps us telling that applications are not 
harmed. How can ZFS detect those? I assume it doesn''t to verify after 
write as it would kill performance. Is it cause by read errors that vanish 
after re-trying the read? Would someone please explain how the mechanism 
works in that case?

Of course in the meantime we attached another box in mirror configuration 
;)

Thanks in advance
Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Anton B. Rang

2007-Mar-17 01:57 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

It''s possible (if unlikely) that you are only getting checksum errors
on metadata. Since ZFS always internally mirrors its metadata, even on
non-redundant pools, it can recover from metadata corruption which does not
affect all copies. (If there is only one LUN, the mirroring happens at different
locations on the same LUN.)

In the event of a data checksum error on a non-redundant pool, the application
would see an I/O error.

If the reported recovered errors are common, I''d suspect some sort of
software-induced metadata corruption; you should be moving much more data than
metadata in a typical system.
 
 
This message posted from opensolaris.org

Thomas Nau

2007-Mar-17 10:46 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

On Fri, 16 Mar 2007, Anton B. Rang wrote:> It''s possible (if unlikely) that you are only getting checksum
errors on metadata. Since ZFS always internally mirrors its metadata, even on
non-redundant pools, it can recover from metadata corruption which does not
affect all copies. (If there is only one LUN, the mirroring happens at different
locations on the same LUN.)
I thought about that but looking at the NFS server the real data should be 
much much more than metadata so I would consider it unlikely. Also in the 
now redundant setup we see checksum errors on both attached RAIDs

Any hints on how to track down the problem to the HBA, cables, RAID and so 
on? We see similar things on all our machines with few exceptions. Talking 
to local Sun folks we have been "warned" before that checksum errors
will
show up and that it''s considered normal. Nevertheless I really want to 
know what they are about

Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Robert Milkowski

2007-Mar-17 17:49 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Hello Thomas,

Saturday, March 17, 2007, 11:46:14 AM, you wrote:

TN> On Fri, 16 Mar 2007, Anton B. Rang wrote:>> It''s possible (if unlikely) that you are only getting checksum
errors on metadata. Since ZFS always internally mirrors its metadata, even on
non-redundant pools, it can recover from metadata corruption which does not
affect all copies. (If there is only one LUN, the mirroring happens at different
locations on the same LUN.)
TN> I thought about that but looking at the NFS server the real data should
be
TN> much much more than metadata so I would consider it unlikely. Also in the
TN> now redundant setup we see checksum errors on both attached RAIDs

TN> Any hints on how to track down the problem to the HBA, cables, RAID and
so
TN> on? We see similar things on all our machines with few exceptions.
Talking
TN> to local Sun folks we have been "warned" before that checksum
errors will
TN> show up and that it''s considered normal. Nevertheless I really
want to
TN> know what they are about

I have an opened CR for months now about the same problem - lot of
CKSUM errors all seem to be only meta-data related which is highly
unlikely.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2007-Mar-21 08:36 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Hello Robert,

Saturday, March 17, 2007, 6:49:05 PM, you wrote:

RM> Hello Thomas,

RM> Saturday, March 17, 2007, 11:46:14 AM, you wrote:

TN>> On Fri, 16 Mar 2007, Anton B. Rang wrote:>>> It''s possible (if unlikely) that you are only getting
checksum errors on metadata. Since ZFS always internally mirrors its metadata,
even on non-redundant pools, it can recover from metadata corruption which does
not affect all copies. (If there is only one LUN, the mirroring happens at
different locations on the same LUN.)
TN>> I thought about that but looking at the NFS server the real data
should be
TN>> much much more than metadata so I would consider it unlikely. Also in
the
TN>> now redundant setup we see checksum errors on both attached RAIDs

TN>> Any hints on how to track down the problem to the HBA, cables, RAID
and so
TN>> on? We see similar things on all our machines with few exceptions.
Talking
TN>> to local Sun folks we have been "warned" before that
checksum errors will
TN>> show up and that it''s considered normal. Nevertheless I
really want to
TN>> know what they are about

RM> I have an opened CR for months now about the same problem - lot of
RM> CKSUM errors all seem to be only meta-data related which is highly
RM> unlikely.

We''ve reinstalled servers to U3 and SC3.2 and for last few days no
single CKSUM error (the same pools were imported) - so maybe something
wrong was with U2.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2007-Mar-28 22:37 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Hello Robert,

Wednesday, March 21, 2007, 10:36:15 AM, you wrote:

RM> Hello Robert,

RM> Saturday, March 17, 2007, 6:49:05 PM, you wrote:

RM>> Hello Thomas,

RM>> Saturday, March 17, 2007, 11:46:14 AM, you wrote:

TN>>> On Fri, 16 Mar 2007, Anton B. Rang wrote:>>>> It''s possible (if unlikely) that you are only getting
checksum errors on metadata. Since ZFS always internally mirrors its metadata,
even on non-redundant pools, it can recover from metadata corruption which does
not affect all copies. (If there is only one LUN, the mirroring happens at
different locations on the same LUN.)
TN>>> I thought about that but looking at the NFS server the real data
should be
TN>>> much much more than metadata so I would consider it unlikely.
Also in the
TN>>> now redundant setup we see checksum errors on both attached RAIDs

TN>>> Any hints on how to track down the problem to the HBA, cables,
RAID and so
TN>>> on? We see similar things on all our machines with few
exceptions. Talking
TN>>> to local Sun folks we have been "warned" before that
checksum errors will
TN>>> show up and that it''s considered normal. Nevertheless I
really want to
TN>>> know what they are about

RM>> I have an opened CR for months now about the same problem - lot of
RM>> CKSUM errors all seem to be only meta-data related which is highly
RM>> unlikely.

RM> We''ve reinstalled servers to U3 and SC3.2 and for last few days
no
RM> single CKSUM error (the same pools were imported) - so maybe something
RM> wrong was with U2.

One of those server has reported again some CKSUM errors in the same way so it
looks like only metadata were involved. So the problem is still there
but on U3 to much less extent.



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2007-Mar-28 22:39 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Hello Robert,

Thursday, March 29, 2007, 12:37:28 AM, you wrote:

RM> Hello Robert,

RM> Wednesday, March 21, 2007, 10:36:15 AM, you wrote:

RM>> Hello Robert,

RM>> Saturday, March 17, 2007, 6:49:05 PM, you wrote:

RM>>> Hello Thomas,

RM>>> Saturday, March 17, 2007, 11:46:14 AM, you wrote:

TN>>>> On Fri, 16 Mar 2007, Anton B. Rang
wrote:>>>>> It''s possible (if unlikely) that you are only
getting checksum errors on metadata. Since ZFS always internally mirrors its
metadata, even on non-redundant pools, it can recover from metadata corruption
which does not affect all copies. (If there is only one LUN, the mirroring
happens at different locations on the same LUN.)
TN>>>> I thought about that but looking at the NFS server the real
data should be
TN>>>> much much more than metadata so I would consider it unlikely.
Also in the
TN>>>> now redundant setup we see checksum errors on both attached
RAIDs

TN>>>> Any hints on how to track down the problem to the HBA,
cables, RAID and so
TN>>>> on? We see similar things on all our machines with few
exceptions. Talking
TN>>>> to local Sun folks we have been "warned" before
that checksum errors will
TN>>>> show up and that it''s considered normal.
Nevertheless I really want to
TN>>>> know what they are about

RM>>> I have an opened CR for months now about the same problem - lot
of
RM>>> CKSUM errors all seem to be only meta-data related which is
highly
RM>>> unlikely.

RM>> We''ve reinstalled servers to U3 and SC3.2 and for last few
days no
RM>> single CKSUM error (the same pools were imported) - so maybe
something
RM>> wrong was with U2.

RM> One of those server has reported again some CKSUM errors in the same way
so it
RM> looks like only metadata were involved. So the problem is still there
RM> but on U3 to much less extent.

bash-3.00# uname -a
SunOS XXXXX 5.10 Generic_118833-36 sun4u sparc SUNW,Sun-Fire-V240
bash-3.00#


[...]

  pool: nfs-s5-s6
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        nfs-s5-s6                                ONLINE       0     0     7
          c4t600C0FF00000000009258F4855B59001d0  ONLINE       0     0     7

errors: No known data errors

  pool: nfs-s5-s7
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        nfs-s5-s7                                ONLINE       0     0     6
          c4t600C0FF00000000009258F28706F5201d0  ONLINE       0     0     6

errors: No known data errors

  pool: nfs-s5-s8
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        nfs-s5-s8                                ONLINE       0     0    10
          c4t600C0FF00000000009258F3E4C4C5601d0  ONLINE       0     0    10

errors: No known data errors
bash-3.00#




-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Ricardo Correia

2007-Apr-06 03:33 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Isn''t it more likely that these are errors on data as well? I think zfs
retries read operations when there''s a checksum failure, so maybe these
are transient hardware problems (faulty cables, high temperature..)?

This would explain the non-existence of unrecoverable errors.

Robert Milkowski wrote:> Hello Robert,
> 
> Thursday, March 29, 2007, 12:37:28 AM, you wrote:
> 
> RM> Hello Robert,
> 
> RM> Wednesday, March 21, 2007, 10:36:15 AM, you wrote:
> 
> RM>> Hello Robert,
> 
> RM>> Saturday, March 17, 2007, 6:49:05 PM, you wrote:
> 
> RM>>> Hello Thomas,
> 
> RM>>> Saturday, March 17, 2007, 11:46:14 AM, you wrote:
> 
> TN>>>> On Fri, 16 Mar 2007, Anton B. Rang wrote:
>>>>>> It''s possible (if unlikely) that you are only
getting checksum errors on metadata. Since ZFS always internally mirrors its
metadata, even on non-redundant pools, it can recover from metadata corruption
which does not affect all copies. (If there is only one LUN, the mirroring
happens at different locations on the same LUN.)
> 
> TN>>>> I thought about that but looking at the NFS server the
real data should be
> TN>>>> much much more than metadata so I would consider it
unlikely. Also in the
> TN>>>> now redundant setup we see checksum errors on both
attached RAIDs
> 
> TN>>>> Any hints on how to track down the problem to the HBA,
cables, RAID and so
> TN>>>> on? We see similar things on all our machines with few
exceptions. Talking
> TN>>>> to local Sun folks we have been "warned"
before that checksum errors will
> TN>>>> show up and that it''s considered normal.
Nevertheless I really want to
> TN>>>> know what they are about
> 
> RM>>> I have an opened CR for months now about the same problem -
lot of
> RM>>> CKSUM errors all seem to be only meta-data related which is
highly
> RM>>> unlikely.
> 
> RM>> We''ve reinstalled servers to U3 and SC3.2 and for last
few days no
> RM>> single CKSUM error (the same pools were imported) - so maybe
something
> RM>> wrong was with U2.
> 
> RM> One of those server has reported again some CKSUM errors in the same
way so it
> RM> looks like only metadata were involved. So the problem is still
there
> RM> but on U3 to much less extent.
> 
> bash-3.00# uname -a
> SunOS XXXXX 5.10 Generic_118833-36 sun4u sparc SUNW,Sun-Fire-V240
> bash-3.00#
> 
> 
> [...]
> 
>   pool: nfs-s5-s6
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
> 
>         NAME                                     STATE     READ WRITE CKSUM
>         nfs-s5-s6                                ONLINE       0     0     7
>           c4t600C0FF00000000009258F4855B59001d0  ONLINE       0     0     7
> 
> errors: No known data errors
> 
>   pool: nfs-s5-s7
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
> 
>         NAME                                     STATE     READ WRITE CKSUM
>         nfs-s5-s7                                ONLINE       0     0     6
>           c4t600C0FF00000000009258F28706F5201d0  ONLINE       0     0     6
> 
> errors: No known data errors
> 
>   pool: nfs-s5-s8
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
> 
>         NAME                                     STATE     READ WRITE CKSUM
>         nfs-s5-s8                                ONLINE       0     0    10
>           c4t600C0FF00000000009258F3E4C4C5601d0  ONLINE       0     0    10
> 
> errors: No known data errors
> bash-3.00#
> 
> 
> 
>

Robert Milkowski

2007-Apr-09 22:22 UTC

head link

[zfs-discuss] Re: ZFS checksum error detection

Hello Ricardo,

Friday, April 6, 2007, 5:33:14 AM, you wrote:

RC> Isn''t it more likely that these are errors on data as well? I
think zfs
RC> retries read operations when there''s a checksum failure, so
maybe these
RC> are transient hardware problems (faulty cables, high temperature..)?

RC> This would explain the non-existence of unrecoverable errors.

Wouldn''t ZFS retryy for metadata also then?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Apparently Analagous Threads

Search for more seemingly similar threads

zfs discuss - Mar 2007 - ZFS checksum error detection

[zfs-discuss] ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

[zfs-discuss] Re: ZFS checksum error detection

Apparently Analagous Threads