We are running Lustre x86_64 2.6.22.14 version 1.6.7 with 1 MGS/MDT and 14 OSTs. The past few days I''ve been fighting a problem with one user who is storing roughly 50,000 >1k files in many subdirectories in our Lustre filesystem. Saturday I had to repair the Meta server with e2fsck and now this morning, everything was fine until he started another batch of jobs submitted to our PBS queue. Lustre: laredofs-MDT0000: sending delayed replies to recovered clients Lustre: laredofs-MDT0000: recovery complete: rc 0 Lustre: MDS laredofs-MDT0000: laredofs-OST0000_UUID now active, resetting orphans Lustre: MDS laredofs-MDT0000: laredofs-OST0001_UUID now active, resetting orphans Lustre: Client laredofs-client has started nph-mascot.exe[3816]: segfault at 0000000000000018 rip 000000000041bb45 rsp 00007fffa412e890 error 6 The above messages lasted all week with now errors. This morning I see this: Lustre: 22640:0:(lustre_fsfilt.h:330:fsfilt_setattr()) laredofs-MDT0000: slow setattr 48s Lustre: 22644:0:(lustre_fsfilt.h:229:fsfilt_start_log()) laredofs-MDT0000: slow journal start 32s LDISKFS-fs error (device sdb2): ldiskfs_add_entry: bad entry in directory #12731224: inode out of bounds - offset=1900, inode=1953587812, rec_len=204, name_len=194 Aborting journal on device sdb2. Remounting filesystem read-only LustreError: 22627:0:(fsfilt-ldiskfs.c:280:fsfilt_ldiskfs_start()) error starting handle for op 8 (33 credits): rc -30 LustreError: 22627:0:(mds_reint.c:154:mds_finish_transno()) fsfilt_start: -30 I managed to unmount the OSTs and the MDT but there was a kernel panic that prevented me from running e2fsck on the Meta server so I simply rebooted it. Then I ran e2fsck and it found inode problems all associated within his directories, luckily e2fsck fixed them. Now, everything is back to normal and his jobs are processing. Currently, the stripe set on his directory is 128k (this is our default stripe). I''m curious if I need to set a smaller stripe on his directories with those 50,000+ files. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
Johann Lombardi
2010-Feb-25 17:20 UTC
[Lustre-discuss] Adjusting stripe for 50,000+ files?
On Thu, Feb 25, 2010 at 11:04:25AM -0600, Jeremy Mann wrote:> We are running Lustre x86_64 2.6.22.14 version 1.6.7 with 1 MGS/MDT and 14^^^^^ Please note that we released 1.6.7.1 shortly after 1.6.7 in order to address a MDS corruption which was bug 18695.> LDISKFS-fs error (device sdb2): ldiskfs_add_entry: bad entry in directory > #12731224: inode out of bounds - offset=1900, inode=1953587812, > rec_len=204, name_len=194This really looks like you are hitting bug 18695. I would really recommend upgrading to 1.6.7.2 or 1.8.2. Please note that your MDS can be severely damaged by bug 18695, so you should run e2fsck against the MDT device before upgrading. Johann
Johann Lombardi wrote:> On Thu, Feb 25, 2010 at 11:04:25AM -0600, Jeremy Mann wrote: >> We are running Lustre x86_64 2.6.22.14 version 1.6.7 with 1 MGS/MDT and >> 14 > ^^^^^ > Please note that we released 1.6.7.1 shortly after 1.6.7 in order to > address > a MDS corruption which was bug 18695. > >> LDISKFS-fs error (device sdb2): ldiskfs_add_entry: bad entry in >> directory >> #12731224: inode out of bounds - offset=1900, inode=1953587812, >> rec_len=204, name_len=194 > > This really looks like you are hitting bug 18695. I would really recommend > upgrading to 1.6.7.2 or 1.8.2. Please note that your MDS can be severely > damaged by bug 18695, so you should run e2fsck against the MDT device > before > upgrading.Thank you Johann, I will get right on this. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672