Hi!
Today I encountered data corruption on two zfs pools due to a RAM failure in my
OI box running on a dell T710. My rpool now looks like this (after reboot):
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub repaired 0 in 1h1m with 1 errors on Tue Jan 31 19:59:50 2012
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
c4t50014EE10313DE5Dd0s0 ONLINE 0 0 2
c4t50014EE158688073d0s0 ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
//var/pkg/lost+found/var/lib/gdm-20120131T195638Z/core
//usr/lib/svn/libsvn_delta-1.so.0.0.0
//lib/libpam.so.1
//usr/lib/libXtsol.so.1
//usr/lib/gnome-settings-daemon
//usr/ruby/1.8/lib/ruby/1.8/optparse.rb
//usr/gnu/bin/rm
//var/log/syslog
//usr/lib/amd64/libpciaccess.so.0
//usr/local/lib/libiconv.so.2.5.0
/rpool/service/svn/dvg/db/revs/0/653
/rpool/service/svn/privat/db/revs/0/451
/rpool/service/svn/privat/db/revs/0/716
/rpool/service/svn/privat/db/revs/1/1276
/rpool/service/svn/privat/db/revs/0/377
/rpool/service/svn/privat/db/revs/0/835
/rpool/service/svn/privat/db/revs/0/364
I have 17 files that are permanently corrupted. The corruption of gdm/core was
found while scrubbing the pool. All the other 16 files where displayed as
corrupted after the pool fell in degraded state. I''m not sure if these
files are really corrupted, though: I can access all these files and e.g.
/usr/gnu/bin/rm works with no faults. All files have the identical md5 sum
compared the the corresponding files of a different box, also running the same
version of OI.
How do I find out, if these files are corrupted? If they appear to be ok, how do
I get rid of the errors?
How can two healthy pools get that messed up, when a RAM DIMM gets broken?
Achim
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Achim Wolpers > > I''m not sure if these files are > really corrupted, > > All files have the identical md5 sum compared the the > corresponding files of a different box, also running the same version ofOI.> > How do I find out, if these files are corrupted? If they appear to be ok,how> do I get rid of the errors?Given that they all have the same md5sum as another copy on another box that you have solid reason to believe is not corrupted... Then I think it''s pretty safe for you to conclude the corruption in the corrupt box was actually a miscomputed checksum. So... To get rid of the errors... Just copy the files from the other box, and overwrite the files in the supposedly corrupted box. This will force the supposedly corrupt system to calculate new checksums, and start using the new data with correct checksums. But that''s only half of the problem. As far as I know, you''ll have to wait for the corrupted data to cycle its way out through your normal snapshot rotation. Or you could start destroying snapshots. But some of the folks here are much better with zdb and so forth than I am - there may be a way to correct the incorrect cksum. "Your child swallowed a plastic bead? Don''t worry, it will pass."> How can two healthy pools get that messed up, when a RAM DIMM gets > broken?Two healthy pools? I thought you only mentioned one pool. No matter. Here''s the answer: Suppose you are a processor. You have instructions to follow, and you have paper to write on, to keep track of all the variables you''re using, which are too many to keep inside your short term memory all the time. But when you''re not looking, somebody comes along and changes what you wrote in your notepad. You were in the middle of calculating a cksum for some block of data, and a cosmic flare or something caused your calculation to get messed up. Of course you didn''t know it. So you wrote the data to disk, and you wrote the wrong checksum to disk too. Later you read it back, and the chksum fails, which does not tell you the data is corrupt - it tells you either the data or the cksum is corrupt. You don''t know which, so the best thing to do is simply restore the data from a known good copy.