thr3ads.net - Lustre discuss - [Lustre-discuss] Best way to recover an OST [May 2010]

If this information is useful, please help other people find it:
Share via:

Mervini, Joseph A

2010-May-21 02:25 UTC

[Lustre-discuss] Best way to recover an OST

Hi,

We encountered a multi-disk failure on one of our mdadm RAID6 8+2 OSTs. 2 drives
failed in the array within the space of a couple of hours and were replaced. It
is questionable whether both drives are actually bad because we are seeing the
same behavior in a test environment where a bad drive is actually causing a good
drive to be kicked out of an array.

 Unfortunately another of the drives encountered IO errors during the resync
process and failed causing the array to go out to lunch. The resync process was
attempted two times with the same result. Fortunately I am able (at least for
now) to assemble the array with the existing 8/10 arrays and am able to fsck,
mount via ldiskfs and lustre and am in the process of copying files from the
vulnerable OST to a backup location using "lfs find --obd <target>
/scratch|cpio -puvdm ..."

My question is: What is the best way to restore the OST? Obviously I will need
to somehow restore the array to its full 8+2 configuration. Whether we need to
start from scratch or use some other means, that is our first priority. But I
would like to make the recovery as transparent to the users as possible.

One possible option that we are considering is simply removing the OST from
Lustre, fixing the array and copying the recovered files to a newly created OST
(not desirable). Another is to fix the OST (not remove it from Lustre), delete
the files that exist  and then copy the recovered files back. The problem that
comes to mind in either scenario is what happens if a file is part of a striped
file? Does it lose its affinity with the rest of the stripe?

Another scenario that we are wondering about is if we mount the OST via ldiskfs
and copy everything on the file system to a backup location, fix the array
maintaining the same tunefs.lustre configuration, then move everything back
using the same method as it was backed up, will the files be presented to lustre
(mds and clients) just as it was before when mounted as a lustre file system?

Thanks in advance for you advise and help.

Joe Mervini
Sandia National Laboratories

Andreas Dilger

2010-May-21 08:18 UTC

head link

[Lustre-discuss] Best way to recover an OST

On 2010-05-20, at 20:25, Mervini, Joseph A wrote:> We encountered a multi-disk failure on one of our mdadm RAID6 8+2 OSTs. 2
drives failed in the array within the space of a couple of hours and were
replaced.
I guess the need for +3 parity is closer than we think...
> Fortunately I am able (at least for now) to assemble the array with the
existing 8/10 arrays and am able to fsck, mount via ldiskfs and lustre and am in
the process of copying files from the vulnerable OST to a backup location using
"lfs find --obd <target> /scratch|cpio -puvdm ..."
I''m assuming at this point you also have the OST in question
deactivated on the MDS, (lctl --device N deactivate) so that it is not getting
new files as well?

If you track the original files that were successfully copied, you could rename
the new files back over top of the old ones, and remove any trace of the old
file.

Another option would be to use the "lfs_migrate" script (see
bugzilla), which essentially does this, with a data check in between. Note that
it isn''t totally safe for a live system, since it has no way to know
which files are in use while it is copying it, but I''m assuming at this
point that is irrelevant.
> My question is: What is the best way to restore the OST? Obviously I will
need to somehow restore the array to its full 8+2 configuration. Whether we need
to start from scratch or use some other means, that is our first priority. But I
would like to make the recovery as transparent to the users as possible.
>
> One possible option that we are considering is simply removing the OST from
Lustre, fixing the array and copying the recovered files to a newly created OST
(not desirable).
I''d try to avoid this option, it leaves the old OST around forever. If
you decide to erase the old OST, one option is to just copy over the base config
files (/CONFIGS/*, /CATALOGS, /O/0/LAST_ID, /last_rcvd) to the new OST. This of
course should be done after migrating or otherwise deleting the objects that are
on this OST.
> Another is to fix the OST (not remove it from Lustre), delete the files
that exist and then copy the recovered files back. The problem that comes to
mind in either scenario is what happens if a file is part of a striped file?
Does it lose its affinity with the rest of the stripe?
I''m not sure what you mean by "affinity" here. If you copy
the file to a new file it will normally get the default striping, but there is
no way from userspace to "break" the striping of a file. If the an
object on that OST is missing, then copy will return EIO for that file and you
need to restore it from backup.

Note there is a lustre-patched tar which would keep the original file striping.
Any tool can preserve the file striping via xattrs, and doesn''t have to
know anything about Lustre internals:

xattr_size = getxattr("/path/to/file", "lustre.lov", buf,
65536);
mknod("/path/to/new_file");
setxattr("/path/to/new_file", lustre.lov, xattr_size, 0);
> Another scenario that we are wondering about is if we mount the OST via
ldiskfs and copy everything on the file system to a backup location, fix the
array maintaining the same tunefs.lustre configuration, then move everything
back using the same method as it was backed up, will the files be presented to
lustre (mds and clients) just as it was before when mounted as a lustre file
system?
The easiest option is to just do a block-device-level copy of the whole
filesystem to a new LUN, and then run e2fsck on that. Next best is to format a
new OST with mkfs.lustre, set the label by hand to match the old OST name via
"tune2fs -L {fsname}-OSTNNNN", then mount it via ldiskfs and copy the
files over.

Note when copying (or backing up and restoring) the objects on the OST you
should preserve the xattrs using some tool that can handle this (e.g. RHEL tar,
or rsync 3.x) since there is recovery information stored in the object xattrs.

The OST xattrs are not needed for normal operation, but if you have disk
corruption and can run e2fsck and then ll_recover_lost_found_objs
you''ll be happy to get your data back.

The clients and OST code will not be able to tell the difference between the old
and replacement OSTs.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Peter Grandi

2010-May-23 13:53 UTC

head link

[Lustre-discuss] Best way to recover an OST

>> We encountered a multi-disk failure on one of our mdadm RAID6
>> 8+2 OSTs. 2 drives failed in the array within the space of a
>> couple of hours and were replaced.
There are many reports of multidrive failures, some pretty
impressive e.g. 10 out of 20 on a long-running array after a
restart. Because of common modes, that is not unexpected, as
failures are not uncorrelated (especially when rebuilding!).
> I guess the need for +3 parity is closer than we think...
Some people are pushing this, and I guess that you are thinking
about the arguments here:

  http://blogs.sun.com/ahl/entry/acm_triple_parity_raid

But I think it is simply stupid -- adding more parity makes
things slower and less reliable (e.g. more complexity),
especially if one takes "advantage" of the false sense of
security of more parity to have wider arrays. I''d rather have, in
the few cases where it makes sense, a narrower RAID5 than a wider
RAID6, for example (e.g. two 4+1 RAID5s instead of one 8+2 RAID6).

The usual arguments apply: http://WWW.BAARF.com/ plus that
"stupid" is usually rewarded by "management" who see the
obvious
reduction in cost but don''t see those in performance, simplicity
and reliability.

Note that one argument in the page above is "fills a niche", and
a slong it is acknowledged that is it s a minuscule niche it is
fine; but then "need for +3 parity" is a rather wider statement.

If an 8+2 array had 2 drive failures, perhaps instead of looking
at more parity it would be better to look at common modes of
failure; and not just vibration, heat or electrical common modes,
but also the thoroughly moronic practice of many RAID vendors
(e.g. EMC, DDN, NexSAN by my direct experience, but most/all do
that) to put into their arrays drives not only of the same
manufacturer and model, but even with nearly consecutive serial
numbers from the same delivery and even the same carton.

And in any case if one uses something like Lustre 1.x, which is a
parallel metafilesystem with no data redundancy (and for very
good reasons, and mirroring in 2.x is something that I have very
mixed feelings about), using parity RAID is doubly stupid, as the
storage layer has to provide all the redundancy.

And in any case one cannot do storage systems that never fail;
what matter more is what happens when they do fail. As to this
fortunately Lustre does pretty well.

Lustre discuss - May 2010 - Best way to recover an OST

[Lustre-discuss] Best way to recover an OST

[Lustre-discuss] Best way to recover an OST

[Lustre-discuss] Best way to recover an OST