Hi there, We have 27TB data on 6 OSTs distributed over 3 OSSes. Lustre version 1.6.7.2 on CentOS 4.6. After a power spike this weekend that crashed several machines (not the OSS''es...) and/or possibly hitting 100% file space usage on one of them (we have been dangerously close for a while), it hung this morning. After restarting, it showed many files as missing. I decided to unmount them all and do an fsck. I unmonted the file system from the MDS, logged in to the OSSes and started unmounting the OSTs. This went OK on two of the three, but on the third one, the umount command hangs with an error message that has something with _BUG in it (I can look it up tomorrow, if I still have it on the screen; I''m at home now). Worryingly, if I do a "df" on that machine, I get 3% file usage: [root at oss1 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda5 236062880 5911252 218160312 3% / /dev/sda1 101086 10993 84874 12% /boot none 1803084 0 1803084 0% /dev/shm /dev/sdb 236062880 5911252 218160312 3% /mnt/oss1-ost1 /dev/sdc 236062880 5911252 218160312 3% /mnt/oss1-ost2 It should be 98% or thereabouts! Now I am afraid that if I carry on (probably just cycling the power, since "reboot" also hangs), it will come back in the same state, i.e. 95% of the data gone. Is this already irreparably the case, or am I just paranoid? Any suggestions would be appreciated (in other words: HELP!!!!). Before this, I had tried an "lfsck -c -l -f" on the mounted file system, but the sudden drop in disk usage on oss1 definitely only happened after I killed this and tried to umount by hand. Cheers, Herbert -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532
> After a power spike this weekend that crashed several machines > (not the OSS''es...) and/or possibly hitting 100% file space > usage on one of them (we have been dangerously close for a > while), it hung this morning.That''s fairly clear, but did you do any checks as to whether all the drives involved are entirely error free? How do you know your storage system is still good to use? Also did you have battery backup for at least the storage HAs?> After restarting, it showed many files as missing. [ ... ] > Now I am afraid that if I carry on (probably just cycling the > power, since "reboot" also hangs), it will come back in the > same state, i.e. 95% of the data gone. Is this already > irreparably the case, or am I just paranoid? Any suggestions > would be appreciated (in other words: HELP!!!!).There is one simple solution: restore backups. That''s what they are for, situations like this. It is probably much faster than any attempt at recovery, if the backups are on suitable media. I think that in many cases restoring from backup is faster than running ''fsck'' over damaged filesystems. As to that, I reckon that it is often little appreciated that the most cost effective way to backup efficiently a large Lustre storage pool may be another Lustre storage pool, and Lustre can make pretty good backup servers (excellent sequential write rates from cheap low IOPS drives, over Ethernet).
Peter: I am glad you mention this. What is an appropriate backup tool for Lustre? I know with 2.x they will introduce ChangeLogs, but for people using 1.6.x what is a good tool? I suppose ''rsync'' or drbd for realtime? What do you recommend? On Mon, Feb 15, 2010 at 6:30 PM, Peter Grandi <pg_lus at lus.for.sabi.co.uk> wrote:> >> After a power spike this weekend that crashed several machines >> (not the OSS''es...) and/or possibly hitting 100% file space >> usage on one of them (we have been dangerously close for a >> while), it hung this morning. > > That''s fairly clear, but did you do any checks as to whether all > the drives involved are entirely error free? How do you know > your storage system is still good to use? > > Also did you have battery backup for at least the storage HAs? > >> After restarting, it showed many files as missing. [ ... ] >> Now I am afraid that if I carry on (probably just cycling the >> power, since "reboot" also hangs), it will come back in the >> same state, i.e. 95% of the data gone. Is this already >> irreparably the case, or am I just paranoid? ?Any suggestions >> would be appreciated (in other words: HELP!!!!). > > There is one simple solution: restore backups. That''s what they > are for, situations like this. It is probably much faster than > any attempt at recovery, if the backups are on suitable media. > I think that in many cases restoring from backup is faster than > running ''fsck'' over damaged filesystems. > > As to that, I reckon that it is often little appreciated that > the most cost effective way to backup efficiently a large Lustre > storage pool may be another Lustre storage pool, and Lustre can > make pretty good backup servers (excellent sequential write > rates from cheap low IOPS drives, over Ethernet). > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >