Kewley, David
2007-Jun-27 17:25 UTC
[Lustre-discuss] kernel: excessive revalidate_it loops
For the past several months, we''ve been running Lustre 1.4.7 on our production cluster. Periodically we get the Subject: error message in the syslog. This most often happens while a user MPI job is running, often on the node that is running rank 0, often very early in the run. Several applications can trigger this message. I see that the kernel code that prints this message is contained in the Lustre 1.4.7 patches, specifically the addition of namei.c:revalidate_special(). I have only a shallow and very incomplete understanding of what circumstances can cause the error message to be logged. Looking at the code, it appears to happen when ten successive (rapid) attempts to revalidate a dentry suffer a certain class of failure. I do not know: * what circumstances cause revalidate_special() to be called * what types of failure cause the loop to be re-executed (up to ten times) * what are typical circumstances in which the loop terminates with the Subject: message * what dentry validation is, really So I''m asking you: What might be causing the failures, how we can check, and how we can avoid them in the future? ----- Let me elaborate a little on why I care. I have two concerns. First, we''ve been seeing this messages sporadically over the last several months. I''ve never been able to find any commonality until today (see next paragraph), and Google has not been very friendly. I want to figure out whether these message reflect that there''s a problem I need to solve for my users. My second, more important concern is that recently a particular application with particular input parameters has been causing nodes to die, and we don''t know why. This affect that user (jobs die), and other users (loss of nodes). I just noticed today that the node deaths appear to be correlated with appearance of the Subject: error message. The node "death" is simply that many processes get general protection errors logged in syslog. The great majority of these log entries are for processes that are not related to the job processes, except for the fact that they run on the same node. I wonder whether there is some non-obvious resource starvation, or a kernel bug, or ... Once the general protection errors start getting logged for a node, the node is unusable without a reboot. If you already have an interactive shell open, you can do certain things but not others. The Subject: message can be logged even when the node does not die and the job keeps running. When the node *does* die in this way, though, signs of the node death always start occurring within a few seconds of the Subject: message. I''d appreciate any suggestions. Thanks, David -- David Kewley Dell Services - Americas Technology Consulting Consultant Cell Phone: 602-460-7617 David_Kewley@Dell.com I speak only for myself; my views do not necessarily reflect Dell''s views. Dell Services: http://www.dell.com/services/ How am I doing? Email my manager Dustin_Johnson@Dell.com with any feedback.