On Tue, Jan 07, 2003 at 08:50:28AM -0500, Maurice Volaski
wrote:> Box with Redhat 7.1 and kernel 2.4.20 has a hardware RAID (about 750
> gigs data) attached via an Adaptec 29160 LP card...
>
> I began seeing numerous SCSI errors in my logs for our external
> hardware RAID and input/output errors on test attempts to copy files
> via cp. Rebooted and immediately saw the errors as the disk was
> initially accessed by the Adaptec driver. The RAID controller did not
> report any problems and when I swapped cards, those errors stopped.
>
> However, fsck.ext3 (1.27) immediately detected errors.
>
> The question is how do I know to trust fsck? What exactly does it
> mean that a file shares blocks with other files? How does fsck know
> file the block really belongs to or could it actually mean files are
> corrupted and fsck is letting them get by?
What probably happened here is the RAID controller got confused, and
write parts of the inode table to the wrong location on disk. So for
example, suppose the block from the inode table describing inodes 8-15
got written on top of the block in the inode table which is supposed
to describe inodes 32-39. This will result in inodes 8 and 32
claiming the same blocks, and thus fsck will complain.
Does fsck know which file a block "really" belongs to? Nope; it
doesn't have psychic abilities. In this scenario, the information of
which blocks are associated with inodes 32-39 is gone, and was
replaced with the blocks associated with inodes 8-15. What e2fsck
will do in this case is to allocate new blocks and fill them with a
copy of the data, so that inodes 8-15 and 32-39 have their own unique
set of data blocks. However, it doesn't restore the missing data --
it can't. What this does do is make the filesystem consistent so that
it's safe to mount the filesystem, and then the system administrator
must sort through the files to determine which files have valid data,
and which ones do not. This is one reason why e2fsck is so meticulous
about printing full pathnames during pass 1B/1C/1D processing.
(For this reason, if you know that there is a lot of filesystem
damage, it can be very useful to run e2fsck under script, so you have
a full transcript of e2fsck's output.)
> What I am asking is should I trust fsck's apparent success or should
> I choose to reformat and restore?
If you have reliable backups, by all means use them. On the other
hand, if there is some precious data that was not backed up, it might
be worth going through the filesystem to see what you can save before
you give up on it.
Good luck!
- Ted
P.S. Consider yourself fortunate that you have backups! This is not
the first time that I've seen the case where a RAID controller goes
insane, and wipes out huge amounts of data. And it's fairly common
that sysadmins assume that RAID means that they don't need to do
backups since they're protected against disk failures, and then get
totally screwed when the RAID controller goes insane.