Peter Jones
2009-Apr-02 23:15 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
A bug has been identified in 1.6.7 that can cause directory corruptions on the MDT. A patch and full details are in bug 18695 - https://bugzilla.lustre.org/show_bug.cgi?id=18695 We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run e2fsck against the MDT device and apply the patch from bug 18695 as soon as possible. Please note that the landing that caused the regression was that for 11063, so anyone running with that patch on an earlier 1.6.x release should also follow the above procedure. This fix will be included in 1.8.0 and we will also create an ad hoc 1.6.7.1 release to provide this fix as soon as possible. 1.6.7 will be withdrawn from the Sun Download Center
Thomas Wakefield
2009-Apr-03 16:55 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Any idea on the timeline for 1.6.7.1 ? Will it be out today, or just sometime soon? Thanks, Thomas Wakefield Systems Administrator COLA/IGES On Apr 2, 2009, at 7:15 PM, Peter Jones wrote:> A bug has been identified in 1.6.7 that can cause directory > corruptions > on the MDT. A patch and full details are in bug 18695 - > https://bugzilla.lustre.org/show_bug.cgi?id=18695 > > We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, > run > e2fsck against the MDT device and apply the patch from bug 18695 as > soon > as possible. > > Please note that the landing that caused the regression was that for > 11063, so anyone running with that patch on an earlier 1.6.x release > should also follow the above procedure. > > This fix will be included in 1.8.0 and we will also create an ad hoc > 1.6.7.1 release to provide this fix as soon as possible. 1.6.7 will be > withdrawn from the Sun Download Center > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Peter Kjellstrom
2009-Apr-06 14:41 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
On Friday 03 April 2009, Thomas Wakefield wrote:> Any idea on the timeline for 1.6.7.1 ? Will it be out today, or just > sometime soon?Knowing if it''s hours away or awaiting a complete qa-cycle would be nice. That would decide if one should wait or rebuild with the patch. /Peter> Thanks, > > Thomas Wakefield > Systems Administrator COLA/IGES-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090406/cf5b70c9/attachment.bin
Oleg Drokin
2009-Apr-06 19:03 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Hello! On Apr 6, 2009, at 10:41 AM, Peter Kjellstrom wrote:> On Friday 03 April 2009, Thomas Wakefield wrote: >> Any idea on the timeline for 1.6.7.1 ? Will it be out today, or just >> sometime soon? > Knowing if it''s hours away or awaiting a complete qa-cycle would be > nice. That > would decide if one should wait or rebuild with the patch.We have a tag in the cvs and it is undergoing QA-cycle as we speak. I cannot give you specific timeline, since I am not the one doing the final release verification, though. It might be that somebody more clued in on the status will step up and update us. Bye, Oleg
Andrea Rucks
2009-Apr-08 21:58 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Hi there, For a new production system, we downloaded Lustre 1.6.7 a couple weeks ago and have installed and configured it. I just read this warning:>A bug has been identified in 1.6.7 that can cause directory corruptions >on the MDT. A patch and full details are in bug 18695 - >https://bugzilla.lustre.org/show_bug.cgi?id=18695>We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run >e2fsck against the MDT device and apply the patch from bug 18695 as soon >as possible.I''ve visited this page, but am uncertain as to how to "apply the patch" (does it require compiling?). Are there any instructions available, or perhaps someone could point me to a FAQ? Cheers and thanks, Ms. Andrea D. Rucks Sr. Unix Systems Administrator, Lawson ITS Unix Server Team _____________________________ Lawson 380 St. Peter Street St. Paul, MN 55102 Tel: 651-767-6252 http://www.lawson.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090408/c526a894/attachment.html
Andreas Dilger
2009-Apr-09 09:10 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
On Apr 08, 2009 16:58 -0500, Andrea Rucks wrote:> For a new production system, we downloaded Lustre 1.6.7 a couple weeks ago > and have installed and configured it. I just read this warning: > > >A bug has been identified in 1.6.7 that can cause directory corruptions > >on the MDT. A patch and full details are in bug 18695 - > >https://bugzilla.lustre.org/show_bug.cgi?id=18695 > > >We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run > >e2fsck against the MDT device and apply the patch from bug 18695 as soon > >as possible. > > I''ve visited this page, but am uncertain as to how to "apply the patch" > (does it require compiling?). Are there any instructions available, or > perhaps someone could point me to a FAQ?I would instead suggest to downgrade the RPMs on your MDS to 1.6.6 until the 1.6.7.1 packages are available. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Mag Gam
2009-Apr-09 12:20 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Is there a bugzilla entry for 1.6.7 where we can see all the patches available? I would like to test out 1.6.7 but patch it myself for learning experience. Also I want to build a hardened version of 1.6.7 and possibly distribute it out On Thu, Apr 9, 2009 at 5:10 AM, Andreas Dilger <adilger at sun.com> wrote:> On Apr 08, 2009 16:58 -0500, Andrea Rucks wrote: >> For a new production system, we downloaded Lustre 1.6.7 a couple weeks ago >> and have installed and configured it. I just read this warning: >> >> >A bug has been identified in 1.6.7 that can cause directory corruptions >> >on the MDT. A patch and full details are in bug 18695 - >> >https://bugzilla.lustre.org/show_bug.cgi?id=18695 >> >> >We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run >> >e2fsck against the MDT device and apply the patch from bug 18695 as soon >> >as possible. >> >> I''ve visited this page, but am uncertain as to how to "apply the patch" >> (does it require compiling?). Are there any instructions available, or >> perhaps someone could point me to a FAQ? > > I would instead suggest to downgrade the RPMs on your MDS to 1.6.6 until > the 1.6.7.1 packages are available. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Brian J. Murrell
2009-Apr-09 12:39 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
On Thu, 2009-04-09 at 08:20 -0400, Mag Gam wrote:> Is there a bugzilla entry for 1.6.7 where we can see all the patches > available?All of the patches for what? That made 1.6.6 become 1.6.7?> I would like to test out 1.6.7 but patch it myself for > learning experience.So, yes, I do think you mean all of the patches for 1.6.6 that makes it 1.6.7, yes? In any case, all of the patches that go into a release x.y.z are tracked by bugzilla bug (alias) xyz-tracking. Alternatively you could probably search for all bugs with the landedx.y.z flag set. Be forwarned though, there may be ordering dependencies on patches where one patch is based on a previous having been already applied and if you try to apply those out of order, they will likely not go in very easily.> Also I want to build a hardened version of 1.6.7What do you mean "hardened" exactly? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/70d188b2/attachment.bin
Andrea Rucks
2009-Apr-09 19:42 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Hi Andreas and Lustre folks, After asking the noob question on how to patch Lustre. I figured out how to apply bug patch 18695 and compile Lustre 1.6.7 with the source code last night. I read up on how the patch command works, applied the 4 patches and then compiled it. The only issue I encountered was for the 18695-client-patch.patch. Hunk #1 of 2 wouldn''t go into file.c (so I manually edited it, and added the one liner where it said it wanted it, hated to do it that way, but I''m working on a tight deadline to turn this over for testing). For peace of mind, as we hadn''t migrated our data into our filesystems yet, I re-ran my mkfs.lustre --reformat scripts on all my MGS / MDS / OSS filesystems. Our RHEL 5.3 XEN systems are loading the patched MGS / MDS / OSS server and client Lustre modules and mounting the filesystems. We''re crossing fingers hoping we won''t have those data corruption issues the folks at Tokyo University experienced. Any new Lustre admin who''s interested in what I did to apply the patch, here''s kind of a high level view: 1. Copied bug18695.patch, bug18695_LASSERT.patch, and 18695-regression-test.patch to the server where I''d loaded the Lustre source code. 2. Took a tarball of the existing /usr/src/lustre-1.6.7 directory (in case I messed things up) 3. Searched for more information on patch as I''m definitely not a developer (read the man page, found this blog article from SUN that has a blip about patch: http://blogs.sun.com/mberg/entry/building_lustre_1_6_4, and found this helpful, old doc that talks about unified diffs, rcsdiff and patch: http://www.linuxjournal.com/article/1237) 4. Found the files being patched and did the following for the Lustre server side source code: SYNTAX: patch <file-that-needs-patching> <patch-that-fixes file> cd /usr/src/lustre-1.6.7 patch lustre/mds/mds_open.c /tmp/bug18695.patch patch lustre/lvfs/fsfilt_ext3.c /tmp/bug18695_LASSERT.patch patch lustre/tests/sanityN.sh /tmp/18695-regression-test.patch 5. Followed the walkthrough from Joey Jablonski''s blog for compiling Lustre (very helpful and similar to the Lustre Quick Start Guide): http://mergingbusinessandit.blogspot.com/2008/10/building-lustre-1651-against-latest.html 6. Uninstalled my old lustre-1.6.7*rpms: umount /<lustre filesystem(s)> modprobe -r lustre rpm -qa | grep lustre rpm -e --nodeps <old lustre-1.6.7*rpm(s)> 7. Installed the newly compiled bug 18695 patched lustre-1.6.7*rpms (I already had a compiled and patched Lustre XEN kernel and Sun''s e2fsprogs in place from before): rpm -ivh <newly compiled lustre-1.6.7*rpms> depmod -a modprobe lustre 8. Followed the walkthrough from Joey''s blog on how to build a Lustre filesystem, mounted it up and ran some file creation loops through it: http://mergingbusinessandit.blogspot.com/2008/10/building-new-lustre-filesystem.html Joey''s blogs were real life savers. Hope this helps someone else! Cheers, Ms. Andrea D. Rucks Sr. Unix Systems Administrator, Lawson ITS Unix Server Team _____________________________ Lawson 380 St. Peter Street St. Paul, MN 55102 Tel: 651-767-6252 http://www.lawson.com __________________ On Apr 08, 2009 16:58 -0500, Andrea Rucks wrote:> For a new production system, we downloaded Lustre 1.6.7 a couple weeksago> and have installed and configured it. I just read this warning: > > >A bug has been identified in 1.6.7 that can cause directory corruptions > >on the MDT. A patch and full details are in bug 18695 - > >https://bugzilla.lustre.org/show_bug.cgi?id=18695 > > >We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run > >e2fsck against the MDT device and apply the patch from bug 18695 assoon> >as possible. > > I''ve visited this page, but am uncertain as to how to "apply the patch" > (does it require compiling?). Are there any instructions available, or > perhaps someone could point me to a FAQ?I would instead suggest to downgrade the RPMs on your MDS to 1.6.6 until the 1.6.7.1 packages are available. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/cff102c1/attachment-0001.html
Michael D. Seymour
2009-Apr-09 21:11 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Peter Jones wrote:> A bug has been identified in 1.6.7 that can cause directory corruptions > on the MDT. A patch and full details are in bug 18695 - > https://bugzilla.lustre.org/show_bug.cgi?id=18695 > > We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run > e2fsck against the MDT device and apply the patch from bug 18695 as soon > as possible. > > Please note that the landing that caused the regression was that for > 11063, so anyone running with that patch on an earlier 1.6.x release > should also follow the above procedure. > > This fix will be included in 1.8.0 and we will also create an ad hoc > 1.6.7.1 release to provide this fix as soon as possible. 1.6.7 will be > withdrawn from the Sun Download Center > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussHi Peter, all, Is there an accepted procedure for recovering from any introduced errors from this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck on the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount? I performed the following on one of our 17 TB lustre fs, containing disposable data. I performed the following: umount mdt e2fsck /dev/md2 # mdt device Say yes to all repair queries downgrade to 1.6.6 mount mdt this resulted in <100 files out of 587k that had ? ? ? directory entries, but everything else seems fine. I have not performed any checks of file consistency. We have a second lustre file system that stores permanent data but I don''t want to rick any lost or corrupt files. Thanks for any help, Mike S. -- Michael D. Seymour Phone: 416-978-1776 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto
Andreas Dilger
2009-Apr-09 23:05 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
On Apr 09, 2009 17:11 -0400, Michael D. Seymour wrote:> Is there an accepted procedure for recovering from any introduced errors from > this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck on > the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount?No, there is no lustre-specific mechanism for recovery for this problem. This may result in files being put into the underlying lost+found directory, which you might consider moving into a newly-created ROOT/lost+found directory by mounting the MDS as "-t ldiskfs". You shouldn''t just move the filesystem lost+found directory, as that can cause trouble at a later time.> I performed the following on one of our 17 TB lustre fs, containing disposable > data. I performed the following: > > umount mdt > e2fsck /dev/md2 # mdt device > Say yes to all repair queries > downgrade to 1.6.6 > mount mdt > > this resulted in <100 files out of 587k that had ? ? ? directory entries, but > everything else seems fine. I have not performed any checks of file consistency. > > We have a second lustre file system that stores permanent data but I don''t want > to rick any lost or corrupt files. >Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Michael D. Seymour
2009-Apr-10 00:26 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Andreas Dilger wrote:> On Apr 09, 2009 17:11 -0400, Michael D. Seymour wrote: >> Is there an accepted procedure for recovering from any introduced errors from >> this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck on >> the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount? > > No, there is no lustre-specific mechanism for recovery for this > problem. This may result in files being put into the underlying > lost+found directory, which you might consider moving into a > newly-created ROOT/lost+found directory by mounting the MDS as > "-t ldiskfs". You shouldn''t just move the filesystem lost+found > directory, as that can cause trouble at a later time. >So this would be the course of action. > umount /lustre/mdt > e2fsck /dev/md2 # mdt device > # Say yes to all repair queries # Here then one would: mkdir /root/MDT-lost+found mount -t ldiskfs /dev/md2 /mnt/tmp rsync -a /mnt/tmp/lost+found/ /root/MDT-lost+found > downgrade to 1.6.6 > mount -t lustre /dev/md2 /lustre/mdt I am unclear what use the left over files placed in lost+found from the MDT fs could be. Thanks Andreas for your help, Mike
Andreas Dilger
2009-Apr-10 17:27 UTC
[Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
On Apr 09, 2009 20:26 -0400, Michael D. Seymour wrote:> Andreas Dilger wrote: >> On Apr 09, 2009 17:11 -0400, Michael D. Seymour wrote: >>> Is there an accepted procedure for recovering from any introduced >>> errors from this bug? i.e. performing e2fsck with the --mdsdb option >>> on the MDT, lfsck on the OSTs? Or simply do an e2fsck on the >>> unmounted MDT, downgrade and remount? >> >> No, there is no lustre-specific mechanism for recovery for this >> problem. This may result in files being put into the underlying >> lost+found directory, which you might consider moving into a >> newly-created ROOT/lost+found directory by mounting the MDS as >> "-t ldiskfs". You shouldn''t just move the filesystem lost+found >> directory, as that can cause trouble at a later time. >> > > So this would be the course of action. > > > umount /lustre/mdt > > e2fsck /dev/md2 # mdt device > > # Say yes to all repair queries > # Here then one would: > mkdir /root/MDT-lost+found > mount -t ldiskfs /dev/md2 /mnt/tmp > rsync -a /mnt/tmp/lost+found/ /root/MDT-lost+foundNo, you need to keep the files inside the MDT filesystem: mount -t ldiskfs /dev/md2 /mnt/tmp mkdir /mnt/tmp/ROOT/lost+found mv /mnt/tmp/lost+found/* /mnt/tmp/ROOT/lost+found/ Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.