Thomas Roth
2010-May-03 15:49 UTC
[Lustre-discuss] lock callback timer expired, lock on destroyed export, locks stolen, busy with active RPCs, operation 400 on unconnected MDS
Hi all, just want to share my recent insight and increase the number of Google hits for those who suffer from - MDT / filesystem becoming suddenly unusable - LustreError: ... lock callback timer expired ... - LustreError: ... lock on destroyed export ... - Lustre: ... Stealing 1 locks ... - Lustre: ... All locks stolen ... - LustreError: ... busy with active 2 RPCs ... - LustreError: ... operation 400 on unconnected MDS ... All of these and more we have seen on the MDT of our 1.6.7.2-Cluster after running for one year without major problems. For the last 2 weeks the system hasn''t had an uptime of more than 30h, though. We found a user job submission script that probably caused all this by starting - several hundred (900) jobs simultaneously - all of them opening one and the same file for batch system errors and one and the same file for its output. So if someone is sitting in front of an uncooperative MDT, dazed and confused as I was, perhaps this is the direction to investigate. Still I''d like to learn more about "operation X on unconnected MDS", on the net I only found my own question from two years ago. Regards, Thomas -- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 64291 Darmstadt Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker, Christiane Neumann, Dr. Hartmut Eickhoff Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Oleg Drokin
2010-May-03 18:30 UTC
[Lustre-discuss] lock callback timer expired, lock on destroyed export, locks stolen, busy with active RPCs, operation 400 on unconnected MDS
Hello! On May 3, 2010, at 11:49 AM, Thomas Roth wrote:> We found a user job submission script that probably caused all this by > starting > - several hundred (900) jobs simultaneously > - all of them opening one and the same file for batch system errors and > one and the same file for its output.You probably should keep an eye on developments in bug 20373 which should help to avert this kind of problems for the usecase you describe. The existing "good" patch in there should help somewhat and the other patch under development will help some more once it''s completed.> Still I''d like to learn more about "operation X on unconnected MDS", on > the net I only found my own question from two years ago.This means MDS got a request X from a client that it believes is no longer connected to it (because the client was evicted, I guess). Bye, Oleg