Given what you've described, then only drive that it would make sense to
pull out would be the one that was dropped and then re-inserted.
On Jan 4, 2008 10:20 PM, Dennison Williams <evoltech at 2inches.com>
wrote:
> > Did you try and re-insert the kicked-out drive as if it was clean, or
> did
> > you try to re-sync it to the existing filesystem. If the former, then
> > that's a HUGE mistake because the data on the drive is no longer
in sync
> > with what is on the other drives. (unless the entire filesystem was
made
> > read-only when (or before) the drive was dropped out.)
>
> I re-inserted it with:
> mdadm /dev/md0 --add /dev/sde
> At which point it seemed to resync with the raid device (ie. the output
> of /proc/mdstat showed that it was incrementally syncing)
>
> > Check the SMART logs for each of the drives to see if they've had
any
> > problems.
>
> there are messages like this:
> /dev/sdc, failed to read SMART Attribute Data
> ...but this wasn't one of the disks that was removed from the raid
device
If there are complaints about SDC, then I'd be inclined to do a long test of
it
in smart. it's possible that the real problem started here.
A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test
the I/O path between the drive and the CPU. If there are complaints about
that drive, then .. at this point, you should consider it suspicious.
>
> Try pullling the (candidate) compromized drive out of the array and see if
> the (degraded) filesystem works OK and has good data. If it does, then
I'd> guess that the pulled drive had bad data written to it somehow --- re-add
it> (as if it was hot-swapped in), and hope it doesn't happen again.
> Try that with each of the drives, in turn until you find the badly
written> drive. If one of the drives has badly written data, the system really
can't> tell, for sure, which one is wrong.
I want to make sure I understand you here. Say my raid device
is> comprised of for devices /dev/md0 = /dev/sd[abcd], are you sugesting
> that for each drive I do somthing like this:
>
> mdadm /dev/md0 --fail /dev/sda --remove /dev/sda
Don't bother. If the drive got resynced, then pulling it won't do any
good
unless software RAID gets silently confused by random data on one plex,
>
>
> then try to mount up the FS as usual to see if it is there? Wouldn't
> this point be moot if the device already re-assembled itself?
>
Yes. it would be moot.
>
> >
> > [[ unless the array was read-only when the drive was dropped, then you
> will
> > only have any hope of good data with the dropped drive pulled ]]
>
> It wasn't read-only, but nothing was writing to it.
>
> Thanks for your time and prompt response.
> Sincerely,
> Dennison Williams
>
Unless noatime was set, then the drive was being written to (if only atime
data). if all that got scrambled was atime data you should still have been
able to mount the drive.
--
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://listman.redhat.com/archives/ext3-users/attachments/20080104/332e294e/attachment.htm>