> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Achim Wolpers
>
> I''m not sure if these files are
> really corrupted,
>
> All files have the identical md5 sum compared the the
> corresponding files of a different box, also running the same version of
OI.>
> How do I find out, if these files are corrupted? If they appear to be ok,
how> do I get rid of the errors?
Given that they all have the same md5sum as another copy on another box that
you have solid reason to believe is not corrupted... Then I think it''s
pretty safe for you to conclude the corruption in the corrupt box was
actually a miscomputed checksum. So... To get rid of the errors...
Just copy the files from the other box, and overwrite the files in the
supposedly corrupted box. This will force the supposedly corrupt system to
calculate new checksums, and start using the new data with correct
checksums. But that''s only half of the problem. As far as I know,
you''ll
have to wait for the corrupted data to cycle its way out through your normal
snapshot rotation. Or you could start destroying snapshots. But some of
the folks here are much better with zdb and so forth than I am - there may
be a way to correct the incorrect cksum. "Your child swallowed a plastic
bead? Don''t worry, it will pass."
> How can two healthy pools get that messed up, when a RAM DIMM gets
> broken?
Two healthy pools? I thought you only mentioned one pool.
No matter. Here''s the answer:
Suppose you are a processor. You have instructions to follow, and you have
paper to write on, to keep track of all the variables you''re using,
which
are too many to keep inside your short term memory all the time. But when
you''re not looking, somebody comes along and changes what you wrote in
your
notepad. You were in the middle of calculating a cksum for some block of
data, and a cosmic flare or something caused your calculation to get messed
up. Of course you didn''t know it.
So you wrote the data to disk, and you wrote the wrong checksum to disk too.
Later you read it back, and the chksum fails, which does not tell you the
data is corrupt - it tells you either the data or the cksum is corrupt. You
don''t know which, so the best thing to do is simply restore the data
from a
known good copy.