rklundt@sandia.gov
2007-Jan-10 10:39 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 Not sure where my comments went... On the 320 OST fs, while trying to recover orphans, lfsck -d failed with EINVAL and the message: llapi_lov_get_uuids: getting LOV config The lustre version is 1.4.6 in Cray release 1.4.35. e2fsprogs is version 1.39.cfs1 strace showed the ioctl in that routine failing after a read only open of /scratch_grande. logs are not easily available from this machine.
pbojanic@clusterfs.com
2007-Jan-10 11:46 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- CC| |alex@clusterfs.com OtherBugsDependingO|3431 |718 nThis| | AssignedTo|bugs@clusterfs.com |girish@clusterfs.com Alex, Girish, here is an lfsck reported issue. Can Alex take an initial look and determine a course of action? If this is related to the other e2fsprogs work that Girish has been assigned, then it should rolled into the same planning effort.
rklundt@sandia.gov
2007-Jan-10 12:16 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Severity|S3 (normal) |S1 (critical) (In reply to comment #1)> Alex, Girish, here is an lfsck reported issue. Can Alex take an initial look and > determine a course of action? If this is related to the other e2fsprogs work > that Girish has been assigned, then it should rolled into the same planningeffort. (In reply to comment #1)> Alex, Girish, here is an lfsck reported issue. Can Alex take an initial look and > determine a course of action? If this is related to the other e2fsprogs work > that Girish has been assigned, then it should rolled into the same planningeffort. We have experienced another problem on this fs, details in bug 11520, which is preventing us from using the fs at all. We''re advised by Peggy G at cray that completing the orphan cleanup may help with that problem, so I''ve been told to raise the severity of this problem too.
rklundt@sandia.gov
2007-Jan-10 15:26 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #3)> Ruth, > > is that possible to rerun lfsck with lustre debug enabled and learn from > the log where EINVAL is returned from? >There is talk of trying the operation on the black side tomorrow during pm. Let me know if there are specific debug mask/level settings you would like to see.
rklundt@sandia.gov
2007-Jan-10 16:01 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 Created an attachment (id=9311) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9311&action=view) lustre log on ost nid15027
rklundt@sandia.gov
2007-Jan-10 16:03 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 Created an attachment (id=9312) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9312&action=view) syslog on nid15127
rklundt@sandia.gov
2007-Jan-10 16:26 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9311 is|0 |1 obsolete| | (From update of attachment 9311) sorry wrong bug, this goes on 11520
rklundt@sandia.gov
2007-Jan-10 16:27 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9312 is|0 |1 obsolete| | (From update of attachment 9312) sorry wrong bug, this goes on 11520
rklundt@sandia.gov
2007-Jan-10 16:29 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #5)> (In reply to comment #5) > > There is talk of trying the operation on the black side tomorrow during pm. > Let > > me know if there are specific debug mask/level settings you would like to > see. > > well, -1 would be the best. if sharing the logs with us isn''t possible, then > I''d like to ask on-site people to use -1 debug level and track where -EINVAL > is returned from. >Ok we''re planning to redo the lfsck step tomorrow. We will bring up grande clean, cat -1 > /proc/sys/portals/debug on the client node and all service nodes, and run lfsck with strace. Please let us know if any of that is wrong or not enough.
rklundt@sandia.gov
2007-Jan-11 10:21 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 We''re about to do this at 11AM MDT, that''s in about 45 minutes. We''ve been instructed by Cray to transcribe what we find into this bug, so this is just a heads up. We have a ~2 hour window for doing stuff - if someone is paying attention during that time we might get further with the diagnosis. thanks, Ruth
rklundt@sandia.gov
2007-Jan-11 11:51 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 We ran with debug set to -1, and ran the lfsck again. Nothing showed up in syslogs. We are trying lctl dk to see if something will come out there. Ruth
rklundt@sandia.gov
2007-Jan-11 12:01 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #15)> (In reply to comment #16) > > We ran with debug set to -1, and ran the lfsck again. Nothing showed up in > syslogs. > > yes, this is expected. > > > We are trying lctl dk to see if something will come out there. > > we need to learn from lustre logs source of EINVAL. >(In reply to comment #14)> We ran with debug set to -1, and ran the lfsck again. Nothing showed up insyslogs.> > We are trying lctl dk to see if something will come out there. > > Ruth >So we see the EINVAL coming from lov_iocontrol. at lov_obd.c:1997 rc=18446744073709551594 : -22 : ffffffffffffffea right before that is lustre_lib.h:397:obd_ioctl_freedata process leaving. Hope this helps, Ruth
rklundt@sandia.gov
2007-Jan-11 14:28 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 Created an attachment (id=9321) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9321&action=view) strace of lfsck command
rklundt@sandia.gov
2007-Jan-11 14:29 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 Created an attachment (id=9322) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9322&action=view) lctl dk output on login node during lfsck command
rklundt@sandia.gov
2007-Jan-11 20:31 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #19)> Ruth, will it be possible to add few lines to the code to get more details. > especially, I''m interested in knowning: > 1) count in lov_iocontrol() > 2) max_ost_count and *ost_count in llapi_lov_get_uuids() > > it looks like count is greater than max_ost_count.Alex, This is a production machine, we have to clear off the users in order to run debug experiments. I would like to know that someone is attempting to reproduce/test this problem offline. I can imagine an fs with 320 very small osts showing the same behavior. I''ll ask about the machine availability for this nevertheless. Ruth
rklundt@sandia.gov
2007-Jan-12 09:39 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #20)> (In reply to comment #19) > > Ruth, will it be possible to add few lines to the code to get more details. > > especially, I''m interested in knowning: > > 1) count in lov_iocontrol() > > 2) max_ost_count and *ost_count in llapi_lov_get_uuids() > > > > it looks like count is greater than max_ost_count. > > Alex, > > This is a production machine, we have to clear off the users in order to run > debug experiments. I would like to know that someone is attempting to > reproduce/test this problem offline. I can imagine an fs with 320 very small > osts showing the same behavior. > > I''ll ask about the machine availability for this nevertheless. > > Ruth > >This is planned for Monday morning. If there is any progress in the mean time please keep us posted. Thanks, Ruth
sss@cray.com
2007-Jan-12 10:11 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #20)> I can imagine an fs with 320 very small osts showing the same behavior.Maybe this is a dumb question but has lfsck ever been run successfully on this file system (i.e. before the duplicate object problem appeared)? Does it run OK on scratch_grande on the unclassified side?
rklundt@sandia.gov
2007-Jan-12 10:17 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED (In reply to comment #16)> (In reply to comment #20) > > I can imagine an fs with 320 very small osts showing the same behavior. > > Maybe this is a dumb question but has lfsck ever been run successfully on > this file system (i.e. before the duplicate object problem appeared)? > Does it run OK on scratch_grande on the unclassified side? >To my knowledge this is the first time lfsck has been tried on 320 osts, and the folks running the machine are reluctant to experiment with the other non-broken fs, in case something in the process caused the inconsistency we experienced in bug 15220. ruth
sss@cray.com
2007-Jan-12 10:21 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | Keep bug open
rklundt@sandia.gov
2007-Jan-12 10:51 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 I need some advice on building lfsck. I downloaded e2fsprogs-1.39.cfs1-1.src.rpm from the lustre ftp site, and configured it to point at lustre sources. It seems to have found a different DB definition than expected since the compile fails on line 68 below: 54 55 int e2fsck_lfsck_cleanupdb(e2fsck_t ctx) 56 { 57 int i; 58 int rc = 0; 59 DB *dbp; 60 61 if (ctx->lfsck_oinfo == NULL) { 62 return (0); 63 } 64 65 for (i = 0; i < ctx->lfsck_oinfo->ost_count; i++) { 66 if (ctx->lfsck_oinfo->ofile_ctx[i].dbp != NULL) { 67 dbp = ctx->lfsck_oinfo->ofile_ctx[i].dbp; 68 rc += dbp->close(dbp, 0); 69 ctx->lfsck_oinfo->ofile_ctx[i].dbp = NULL; 70 } make[2]: Entering directory `/usr/local5/rklundt/e2fsprogs-1.39.cfs1/build/e2fsck'' CC ../../e2fsck/pass6.c ../../e2fsck/pass6.c: In function `e2fsck_lfsck_cleanupdb'': ../../e2fsck/pass6.c:68: error: too many arguments to function config.log shows this: DB_LIB=''-ldb-4.2'' doesn''t look quite right? This is a suse linux enterprise 9.2 dist on x86_64. Thanks for any advice. Ruth
rklundt@sandia.gov
2007-Jan-12 11:55 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #26) It was picking up /usr/local/include/db.h instead of /usr/include/db.h - it''s fine now.
rklundt@sandia.gov
2007-Jan-12 14:34 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 the code in llapi_lov_get_uuids computes max_ost_count based on things that don''t change: max_ost_count = (OBD_MAX_IOCTL_BUFFER - size_round(sizeof(data)) - size_round(sizeof(desc))) / (sizeof(*uuidp) + sizeof(*obdgens)); It''s 171 all the time here. Then it''s used to calculate how much uuid buffer to use, and to fill in data.ioc_inllen2 with a value which is definitely going to be less than the actual ost count times the sizeof the uuid entry. We''ll confirm that monday. I''m thinking a build with a config option raising the size of OBD_MAX_IOCTL_BUFFER might solve the lfsck problem. I''m not sure of other effects of that change though. Besides lov.ko, and lfsck what other modules/utilities might have to be swapped? Or is there a reason this won''t work at all?
rklundt@sandia.gov
2007-Jan-15 09:57 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #29)> Ruth, I think it''s safe to double it. >We are past the EINVAL, but building an lov.ko with 16834 config''d for OBD_MAX_IOCTL_BUFFER. I have a question about a message we are getting from the ost where the LBUB came from in bug 11520. It reports: ERROR 11 dangling inodes found. Should we use the -c flag to fix that, or? Ruth
bogl@cray.com
2007-Jan-15 11:56 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #30)> > We are past the EINVAL, but building an lov.ko with 16834 config''d for > OBD_MAX_IOCTL_BUFFER. >This appears to be a configure cmd option. According to ./configure --help: --with-obd-buffer-size=size set lctl ioctl maximum bytes (default=8192) Should we (Cray) perhaps be building routinely with this upped to 16384 (2x default)?
rklundt@sandia.gov
2007-Jan-15 12:04 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 (In reply to comment #23)> (In reply to comment #30) > > > > > We are past the EINVAL, but building an lov.ko with 16834 config''d for > > OBD_MAX_IOCTL_BUFFER. > > > > This appears to be a configure cmd option. According to ./configure --help: > > --with-obd-buffer-size=size > set lctl ioctl maximum bytes (default=8192) > > Should we (Cray) perhaps be building routinely with this upped to 16384 (2x > default)? >Yes that''s exactly what what I did. I think it''s appropriate to permanently increase it.
rklundt@sandia.gov
2007-Jan-22 09:54 UTC
[Lustre-devel] [Bug 11519] lfsck failure ''getting LOV config''
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11519 What |Removed |Added ---------------------------------------------------------------------------- Severity|S1 (critical) |S3 (normal) Priority|P1 |P3 (In reply to comment #36)> Hi Ruth > > I am just following up to establish the status of this issue. Is your cluster > still down? If so, then what can we do to help? If not, then I suggest that we > lower the severity of this ticket to reflect that. > > Regards > > PeterI have a workaround so that lfsck can run on the large fs. The file system is still mounted read only but the machine is running with other file systems. They have scheduled another fix attempt for tomorrow morning. I reduced the severity. Thanks, Ruth