Daniel Kulinski
2009-Aug-05 19:12 UTC
[Lustre-discuss] Inode errors at time of job failure
What would cause the following error to appear? LustreError: 10991:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode 14520180 This happened at the same time a job failed. Error number 2 is ENOENT which means that this inode does not exist? Is there a way to query the MDS to find out which file this inode should have belonged to? Thanks, Dan Kulinski -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090805/a181f3be/attachment.html
Hello! On Aug 5, 2009, at 3:12 PM, Daniel Kulinski wrote:> What would cause the following error to appear?Typically this is some sort of a race where you presume an inode exist (because you have some traces of it in memory), but it is not anymore (on mds, anyway). So when client comes to fetch inode attributes, there is nothing anymore. Normally this should not happen because lustre uses locking to ensure caching consistency, but in some cases this is not true (e.g. open returns dentry without lock oftentimes). Also if a client was evicted, cached opened files could not be revoked right away until they are closed.> LustreError: 10991:0:(file.c:2930:ll_inode_revalidate_fini()) > failure -2 inode 14520180 > This happened at the same time a job failed. Error number 2 is > ENOENT which means that this inode does not exist?Right.> Is there a way to query the MDS to find out which file this inode > should have belonged to?Well, there is lfs find that can search by inode number, but since there is no such inode anymore, there is no way to find out to what name it was attached (and the name likely does not exist either). Did you have client eviction before this message by any chance? What was the job doing at the time? Bye, Oleg
Hi, these ll_inode_revalidate_fini errors are unfortunately quite known to us. So what would you guess if that happens again and again, on a number of clients - MDT softly dying away? Because we haven''t seen any mass evictions (and no reasons for that) in connection with these errors. Or could the problem with the cached open files also be present if the communication interruption does not show up as an eviction in the logs? Regards, Thomas Oleg Drokin wrote:> Hello! > > On Aug 5, 2009, at 3:12 PM, Daniel Kulinski wrote: > >> What would cause the following error to appear? > > Typically this is some sort of a race where you presume an inode exist > (because you have some traces of it in memory), > but it is not anymore (on mds, anyway). So when client comes to fetch > inode attributes, there is nothing anymore. > Normally this should not happen because lustre uses locking to ensure > caching consistency, but in some cases > this is not true (e.g. open returns dentry without lock oftentimes). > Also if a client was evicted, > cached opened files could not be revoked right away until they are > closed. > >> LustreError: 10991:0:(file.c:2930:ll_inode_revalidate_fini()) >> failure -2 inode 14520180 >> This happened at the same time a job failed. Error number 2 is >> ENOENT which means that this inode does not exist? > > Right. > >> Is there a way to query the MDS to find out which file this inode >> should have belonged to? > > Well, there is lfs find that can search by inode number, but since > there is no such inode anymore, there is no way > to find out to what name it was attached (and the name likely does not > exist either). > > Did you have client eviction before this message by any chance? > What was the job doing at the time? > > Bye, > Oleg > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote:> Hi, > these ll_inode_revalidate_fini errors are unfortunately quite known > to us. > So what would you guess if that happens again and again, on a number > of > clients - MDT softly dying away?No, I do not think this is MDT problem of any sort at present, more like some strange client interaction. Are there any negative side effects in your case aside from log clutter? Jobs failing or anything like that?> Because we haven''t seen any mass evictions (and no reasons for that) > in > connection with these errors. > Or could the problem with the cached open files also be present if the > communication interruption does not show up as an eviction in the > logs?It has nothing to do with opened files if there are no evictions. I checked in bugzilla and found bug 16377 which looks like this report too. Though the logs in there are somewhat confusing. It almost appears as if the failing dentry is reported as a mountpoint by vfs, but then it is not, since following inode_revalidate call ends up on lustre again. Do you have "lookup on mtpt" sort of errors coming from namei.c? If you can reproduce the problem with ls or another tool at will, can you please execute this on a client (comment #17 in the bug 16377): # script Script started, file is typescript # lctl clear # echo -1 > /proc/sys/lnet/debug [ reproduce problem ] # lctl dk > /tmp/ls.debug # exit Script done, file is typescript and attach your resulting ls.debug in the bug? Also what lustre version are you using? Bye, Oleg
Hi Oleg, thanks for your reply. I''m not able to reproduce this error at will, though. There are files reported missing by our users, but I couldn''t correlate these with the ll_inode_revalidate_fini errors, at least not directly. In fact, some of the missing files reappeared later, as reported in bug 16377, while others are gone for good. In comment #29 of bug 16377, Brian Murell stated that this can be caused by on-disk corruption. A file system check on the MDT claimed to correct a large number of problems when we had the last down time a month ago. (The said disappearance of files wasn''t correlated with this fsck ;-)). So I''m still not reassured concerning the health of this MDT. We are running Lustre v 1.6.7.2 on the servers, the clients mainly still on 1.6.5.1. Regards, Thomas Oleg Drokin wrote:> Hello! > > On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote: > >> Hi, >> these ll_inode_revalidate_fini errors are unfortunately quite known to >> us. >> So what would you guess if that happens again and again, on a number of >> clients - MDT softly dying away? > > No, I do not think this is MDT problem of any sort at present, more > like some strange client interaction. > Are there any negative side effects in your case aside from log clutter? > Jobs failing or anything like that? > >> Because we haven''t seen any mass evictions (and no reasons for that) in >> connection with these errors. >> Or could the problem with the cached open files also be present if the >> communication interruption does not show up as an eviction in the logs? > > It has nothing to do with opened files if there are no evictions. > I checked in bugzilla and found bug 16377 which looks like this report > too. Though the logs in there are somewhat confusing. > It almost appears as if the failing dentry is reported as a mountpoint > by vfs, but then it is not, since following inode_revalidate call > ends up on lustre again. > Do you have "lookup on mtpt" sort of errors coming from namei.c? > If you can reproduce the problem with ls or another tool at will, > can you please execute this on a client (comment #17 in the bug 16377): > # script > Script started, file is typescript > # lctl clear > # echo -1 > /proc/sys/lnet/debug > [ reproduce problem ] > # lctl dk > /tmp/ls.debug > # exit > Script done, file is typescript > > and attach your resulting ls.debug in the bug? > > Also what lustre version are you using? > > Bye, > Oleg