thr3ads.net - Lustre discuss - [Lustre-discuss] disappeared data from OST [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Herbert Fruchtl

2010-Feb-15 21:08 UTC

[Lustre-discuss] disappeared data from OST

Hi there,

We have 27TB data on 6 OSTs distributed over 3 OSSes. Lustre version 1.6.7.2 on 
CentOS 4.6.

After a power spike this weekend that crashed several machines (not the 
OSS''es...) and/or possibly hitting 100% file space usage on one of them
(we have
been dangerously close for a while), it hung this morning. After restarting, it 
showed many files as missing. I decided to unmount them all and do an fsck.

I unmonted the file system from the MDS, logged in to the OSSes and started 
unmounting the OSTs. This went OK on two of the three, but on the third one, the
umount command hangs with an error message that has something with _BUG in it (I
can look it up tomorrow, if I still have it on the screen; I''m at home
now).
Worryingly, if I do a "df" on that machine, I get 3% file usage:
[root at oss1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda5            236062880   5911252 218160312   3% /
/dev/sda1               101086     10993     84874  12% /boot
none                   1803084         0   1803084   0% /dev/shm
/dev/sdb             236062880   5911252 218160312   3% /mnt/oss1-ost1
/dev/sdc             236062880   5911252 218160312   3% /mnt/oss1-ost2

It should be 98% or thereabouts! Now I am afraid that if I carry on (probably 
just cycling the power, since "reboot" also hangs), it will come back
in the
same state, i.e. 95% of the data gone. Is this already irreparably the case, or 
am I just paranoid?

Any suggestions would be appreciated (in other words: HELP!!!!).

Before this, I had tried an "lfsck -c -l -f" on the mounted file
system, but the
sudden drop in disk usage on oss1 definitely only happened after I killed this 
and tried to umount by hand.

Cheers,

   Herbert
-- 
Herbert Fruchtl
Senior Scientific Computing Officer
School of Chemistry, School of Mathematics and Statistics
University of St Andrews
--
The University of St Andrews is a charity registered in Scotland:
No SC013532

Peter Grandi

2010-Feb-15 23:30 UTC

head link

[Lustre-discuss] disappeared data from OST

> After a power spike this weekend that crashed several machines
> (not the OSS''es...) and/or possibly hitting 100% file space
> usage on one of them (we have been dangerously close for a
> while), it hung this morning.
That''s fairly clear, but did you do any checks as to whether all
the drives involved are entirely error free? How do you know
your storage system is still good to use?

Also did you have battery backup for at least the storage HAs?
> After restarting, it showed many files as missing. [ ... ]
> Now I am afraid that if I carry on (probably just cycling the
> power, since "reboot" also hangs), it will come back in the
> same state, i.e. 95% of the data gone. Is this already
> irreparably the case, or am I just paranoid?  Any suggestions
> would be appreciated (in other words: HELP!!!!).
There is one simple solution: restore backups. That''s what they
are for, situations like this. It is probably much faster than
any attempt at recovery, if the backups are on suitable media.
I think that in many cases restoring from backup is faster than
running ''fsck'' over damaged filesystems.

As to that, I reckon that it is often little appreciated that
the most cost effective way to backup efficiently a large Lustre
storage pool may be another Lustre storage pool, and Lustre can
make pretty good backup servers (excellent sequential write
rates from cheap low IOPS drives, over Ethernet).

Mag Gam

2010-Feb-17 02:52 UTC

head link

[Lustre-discuss] disappeared data from OST

Peter:

I am glad you mention this.

What is an appropriate backup tool for Lustre? I know with 2.x they
will introduce ChangeLogs, but for people using 1.6.x what is a good
tool? I suppose ''rsync'' or drbd for realtime?  What do you
recommend?



On Mon, Feb 15, 2010 at 6:30 PM, Peter Grandi <pg_lus at
lus.for.sabi.co.uk> wrote:>
>> After a power spike this weekend that crashed several machines
>> (not the OSS''es...) and/or possibly hitting 100% file space
>> usage on one of them (we have been dangerously close for a
>> while), it hung this morning.
>
> That''s fairly clear, but did you do any checks as to whether all
> the drives involved are entirely error free? How do you know
> your storage system is still good to use?
>
> Also did you have battery backup for at least the storage HAs?
>
>> After restarting, it showed many files as missing. [ ... ]
>> Now I am afraid that if I carry on (probably just cycling the
>> power, since "reboot" also hangs), it will come back in the
>> same state, i.e. 95% of the data gone. Is this already
>> irreparably the case, or am I just paranoid? ?Any suggestions
>> would be appreciated (in other words: HELP!!!!).
>
> There is one simple solution: restore backups. That''s what they
> are for, situations like this. It is probably much faster than
> any attempt at recovery, if the backups are on suitable media.
> I think that in many cases restoring from backup is faster than
> running ''fsck'' over damaged filesystems.
>
> As to that, I reckon that it is often little appreciated that
> the most cost effective way to backup efficiently a large Lustre
> storage pool may be another Lustre storage pool, and Lustre can
> make pretty good backup servers (excellent sequential write
> rates from cheap low IOPS drives, over Ethernet).
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Feb 2010 - disappeared data from OST

[Lustre-discuss] disappeared data from OST

[Lustre-discuss] disappeared data from OST

[Lustre-discuss] disappeared data from OST