Hi all, our MDT gets stuck and unresponsive with very high loads (Lustre 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling attention is one ll_mt_?? process running with 100% cpu. Nothing unusual happening on the cluster before that. After reboot as well as after moving the service to another server, this behavior reappears. The initial stages - mounting MGS, mouting MDT, recovery - work fine, but then the load goes up and the system is rendered unusable. Atm, I don''t know what to do, except shutting down all servers and possible do a writeconf everywhere. I see that a similar problem was reported by Mag in March this year, but no clues or solutions appeared. Any ideas? Yours, Thomas
Hi Tom: There was a known issue with 1.6.7.1. What I did was downgrade to 1.6.6 and everything worked well. Or you can try upgrading, but there is something def wrong with that version... If you like, I can help you offline. I should be free this weekend (I have a long weekend) On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote:> Hi all, > > our MDT gets stuck and unresponsive with very high loads (Lustre > 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling > attention is one ll_mt_?? process running with 100% cpu. Nothing unusual > happening on the cluster before that. > After reboot as well as after moving the service to another server, this > behavior reappears. The initial stages - mounting MGS, mouting MDT, > recovery - work fine, but then the load goes up and the system is > rendered unusable. > > Atm, I don''t know what to do, except shutting down all servers and > possible do a writeconf everywhere. > > I see that a similar problem was reported by Mag in March this year, but > no clues or solutions appeared. > Any ideas? > > Yours, > Thomas >
Hi, I didn''t take notice of a discussion of such problems with 1.6.7.1. Do you know something more specific about it? We won''t want to downgrade since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t really reliable. For the moment the problem seems to have been fixed by shutdown, fs-check and writeconf of all servers. However, I don''t want to do that every other week ... Thanks a lot for your help, Thomas Mag Gam wrote:> Hi Tom: > > There was a known issue with 1.6.7.1. What I did was downgrade to > 1.6.6 and everything worked well. Or you can try upgrading, but there > is something def wrong with that version... > > If you like, I can help you offline. I should be free this weekend (I > have a long weekend) > > > > On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >> Hi all, >> >> our MDT gets stuck and unresponsive with very high loads (Lustre >> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >> happening on the cluster before that. >> After reboot as well as after moving the service to another server, this >> behavior reappears. The initial stages - mounting MGS, mouting MDT, >> recovery - work fine, but then the load goes up and the system is >> rendered unusable. >> >> Atm, I don''t know what to do, except shutting down all servers and >> possible do a writeconf everywhere. >> >> I see that a similar problem was reported by Mag in March this year, but >> no clues or solutions appeared. >> Any ideas? >> >> Yours, >> Thomas >>-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html Look familiar? On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote:> Hi, > > I didn''t take notice of a discussion of such problems with 1.6.7.1. ?Do > you know something more specific about it? We won''t want to downgrade > since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And > we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to > speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t > really reliable. > > For the moment the problem seems to have been fixed by shutdown, > fs-check and writeconf of all servers. > However, I don''t want to do that every other week ... > > Thanks a lot for your help, > Thomas > > Mag Gam wrote: >> Hi Tom: >> >> There was a known issue with 1.6.7.1. What I did was downgrade to >> 1.6.6 and everything worked well. Or you can try upgrading, but there >> is something def wrong with that version... >> >> If you like, I can help you offline. I should be free this weekend (I >> have a long weekend) >> >> >> >> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >>> Hi all, >>> >>> our MDT gets stuck and unresponsive with very high loads (Lustre >>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>> happening on the cluster before that. >>> After reboot as well as after moving the service to another server, this >>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>> recovery - work fine, but then the load goes up and the system is >>> rendered unusable. >>> >>> Atm, I don''t know what to do, except shutting down all servers and >>> possible do a writeconf everywhere. >>> >>> I see that a similar problem was reported by Mag in March this year, but >>> no clues or solutions appeared. >>> Any ideas? >>> >>> Yours, >>> Thomas >>> > > -- > -------------------------------------------------------------------- > Thomas Roth > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum f?r Schwerionenforschung GmbH > Planckstra?e 1 > D-64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschr?nkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Gesch?ftsf?hrer: Professor Dr. Horst St?cker > > Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >
Mag Gam wrote:> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html > > Look familiar? >Yes, I''ve read the thread - that''s why I addressed you in addition to the list ;-) But I was not aware that this is supposed to be a bug in this particular Lustre version. Right now the MDT stops cooperating without any ll_mdt processes going up. Load is 0.5 or so on the MDT but no connections possible. In the log I only noted some "still busy with 2 active RPCs" messages. I just hope I don''t have to writeconf the MDT again - I learned on this list that this would be necessary if these RPCs are never finished. Regards, Thomas> > On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote: >> Hi, >> >> I didn''t take notice of a discussion of such problems with 1.6.7.1. ? Do >> you know something more specific about it? We won''t want to downgrade >> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And >> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to >> speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t >> really reliable. >> >> For the moment the problem seems to have been fixed by shutdown, >> fs-check and writeconf of all servers. >> However, I don''t want to do that every other week ... >> >> Thanks a lot for your help, >> Thomas >> >> Mag Gam wrote: >>> Hi Tom: >>> >>> There was a known issue with 1.6.7.1. What I did was downgrade to >>> 1.6.6 and everything worked well. Or you can try upgrading, but there >>> is something def wrong with that version... >>> >>> If you like, I can help you offline. I should be free this weekend (I >>> have a long weekend) >>> >>> >>> >>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>> Hi all, >>>> >>>> our MDT gets stuck and unresponsive with very high loads (Lustre >>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>>> happening on the cluster before that. >>>> After reboot as well as after moving the service to another server, this >>>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>>> recovery - work fine, but then the load goes up and the system is >>>> rendered unusable. >>>> >>>> Atm, I don''t know what to do, except shutting down all servers and >>>> possible do a writeconf everywhere. >>>> >>>> I see that a similar problem was reported by Mag in March this year, but >>>> no clues or solutions appeared. >>>> Any ideas? >>>> >>>> Yours, >>>> Thomas >>>> >> -- >> -------------------------------------------------------------------- >> Thomas Roth >> Department: Informationstechnologie >> Location: SB3 1.262 >> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986 >> >> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH >> Planckstra??e 1 >> D-64291 Darmstadt >> www.gsi.de >> >> Gesellschaft mit beschr??nkter Haftung >> Sitz der Gesellschaft: Darmstadt >> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >> >> Gesch??ftsf??hrer: Professor Dr. Horst St??cker >> >> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Exactly the symptoms I had. How long were you running this for? Also, how easy is it for you to reproduce this error? This should clear up your doubts. But you said you are running at 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this could be a different bug? http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote:> > > Mag Gam wrote: >> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html >> >> Look familiar? >> > Yes, I''ve read the thread - that''s why I addressed you in addition to > the list ?;-) > > But I was not aware that this is supposed to be a bug in this particular > Lustre version. > > Right now the MDT stops cooperating without any ll_mdt processes going > up. Load is 0.5 or so on the MDT but no connections possible. > ?In the log I only noted some "still busy with 2 active RPCs" messages. > I just hope I don''t have to writeconf the MDT again - I learned on this > list that this would be necessary if these RPCs are never finished. > > Regards, > Thomas > > >> >> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote: >>> Hi, >>> >>> I didn''t take notice of a discussion of such problems with 1.6.7.1. ? Do >>> you know something more specific about it? We won''t want to downgrade >>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And >>> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to >>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t >>> really reliable. >>> >>> For the moment the problem seems to have been fixed by shutdown, >>> fs-check and writeconf of all servers. >>> However, I don''t want to do that every other week ... >>> >>> Thanks a lot for your help, >>> Thomas >>> >>> Mag Gam wrote: >>>> Hi Tom: >>>> >>>> There was a known issue with 1.6.7.1. What I did was downgrade to >>>> 1.6.6 and everything worked well. Or you can try upgrading, but there >>>> is something def wrong with that version... >>>> >>>> If you like, I can help you offline. I should be free this weekend (I >>>> have a long weekend) >>>> >>>> >>>> >>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>>> Hi all, >>>>> >>>>> our MDT gets stuck and unresponsive with very high loads (Lustre >>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>>>> happening on the cluster before that. >>>>> After reboot as well as after moving the service to another server, this >>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>>>> recovery - work fine, but then the load goes up and the system is >>>>> rendered unusable. >>>>> >>>>> Atm, I don''t know what to do, except shutting down all servers and >>>>> possible do a writeconf everywhere. >>>>> >>>>> I see that a similar problem was reported by Mag in March this year, but >>>>> no clues or solutions appeared. >>>>> Any ideas? >>>>> >>>>> Yours, >>>>> Thomas >>>>> >>> -- >>> -------------------------------------------------------------------- >>> Thomas Roth >>> Department: Informationstechnologie >>> Location: SB3 1.262 >>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986 >>> >>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH >>> Planckstra??e 1 >>> D-64291 Darmstadt >>> www.gsi.de >>> >>> Gesellschaft mit beschr??nkter Haftung >>> Sitz der Gesellschaft: Darmstadt >>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>> >>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker >>> >>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>> > > -- > -------------------------------------------------------------------- > Thomas Roth > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum f?r Schwerionenforschung GmbH > Planckstra?e 1 > D-64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschr?nkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Gesch?ftsf?hrer: Professor Dr. Horst St?cker > > Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >
Hi, Mag Gam wrote:> Exactly the symptoms I had. How long were you running this for? Also, > how easy is it for you to reproduce this error?the MDS-going-on-strike - instances happened only twice since we upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April. Since last week everything seems to work fine again. The difference: I had to move data off of one OST whose RAID announces hardware errors. To do that, I ran "lfs find --obd <OST> /lustre/<dir>", at first massivel parallel, then with 6 processes, and for the last few directories only step-by-step. Of course I''m bewildered that such a well defined operation should be able to break the MDT''s operation, while the things our users do in their unlimited ingenuity did not. In the other hand, there is that issues with switching on quota. As I have reported earlier, "lfs quotacheck -ug" also leads to enormous loads on the MDT, finally stopping everything. Maybe it''s more of a hardware issue.> > This should clear up your doubts. But you said you are running at > 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this > could be a different bug? > > http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.htmlWell, that was the bug causing data corruption on the MDT. There were patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that. But now we experienced this stop of operation of the MDT. After curing it in the way I described earlier, there were no data corruptions or losses that could be attributed to this outage. Regards, Thomas> > On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote: >> >> Mag Gam wrote: >>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html >>> >>> Look familiar? >>> >> Yes, I''ve read the thread - that''s why I addressed you in addition to >> the list ? ;-) >> >> But I was not aware that this is supposed to be a bug in this particular >> Lustre version. >> >> Right now the MDT stops cooperating without any ll_mdt processes going >> up. Load is 0.5 or so on the MDT but no connections possible. >> ? In the log I only noted some "still busy with 2 active RPCs" messages. >> I just hope I don''t have to writeconf the MDT again - I learned on this >> list that this would be necessary if these RPCs are never finished. >> >> Regards, >> Thomas >> >> >>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>> Hi, >>>> >>>> I didn''t take notice of a discussion of such problems with 1.6.7.1. ?? Do >>>> you know something more specific about it? We won''t want to downgrade >>>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And >>>> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to >>>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t >>>> really reliable. >>>> >>>> For the moment the problem seems to have been fixed by shutdown, >>>> fs-check and writeconf of all servers. >>>> However, I don''t want to do that every other week ... >>>> >>>> Thanks a lot for your help, >>>> Thomas >>>> >>>> Mag Gam wrote: >>>>> Hi Tom: >>>>> >>>>> There was a known issue with 1.6.7.1. What I did was downgrade to >>>>> 1.6.6 and everything worked well. Or you can try upgrading, but there >>>>> is something def wrong with that version... >>>>> >>>>> If you like, I can help you offline. I should be free this weekend (I >>>>> have a long weekend) >>>>> >>>>> >>>>> >>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>>>> Hi all, >>>>>> >>>>>> our MDT gets stuck and unresponsive with very high loads (Lustre >>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>>>>> happening on the cluster before that. >>>>>> After reboot as well as after moving the service to another server, this >>>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>>>>> recovery - work fine, but then the load goes up and the system is >>>>>> rendered unusable. >>>>>> >>>>>> Atm, I don''t know what to do, except shutting down all servers and >>>>>> possible do a writeconf everywhere. >>>>>> >>>>>> I see that a similar problem was reported by Mag in March this year, but >>>>>> no clues or solutions appeared. >>>>>> Any ideas? >>>>>> >>>>>> Yours, >>>>>> Thomas >>>>>> >>>> -- >>>> -------------------------------------------------------------------- >>>> Thomas Roth >>>> Department: Informationstechnologie >>>> Location: SB3 1.262 >>>> Phone: +49-6159-71 1453 ?? Fax: +49-6159-71 2986 >>>> >>>> GSI Helmholtzzentrum f????r Schwerionenforschung GmbH >>>> Planckstra????e 1 >>>> D-64291 Darmstadt >>>> www.gsi.de >>>> >>>> Gesellschaft mit beschr????nkter Haftung >>>> Sitz der Gesellschaft: Darmstadt >>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>>> >>>> Gesch????ftsf????hrer: Professor Dr. Horst St????cker >>>> >>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>>> >> -- >> -------------------------------------------------------------------- >> Thomas Roth >> Department: Informationstechnologie >> Location: SB3 1.262 >> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986 >> >> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH >> Planckstra??e 1 >> D-64291 Darmstadt >> www.gsi.de >> >> Gesellschaft mit beschr??nkter Haftung >> Sitz der Gesellschaft: Darmstadt >> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >> >> Gesch??ftsf??hrer: Professor Dr. Horst St??cker >> >> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
So, are you all good now? Thanks for the explanation, BTW! On Tue, Jul 7, 2009 at 7:42 AM, Thomas Roth<t.roth at gsi.de> wrote:> Hi, > > Mag Gam wrote: >> Exactly the symptoms I had. How long were you running this for? ?Also, >> how easy is it for you to reproduce this error? > > the MDS-going-on-strike - instances happened only twice since we > upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April. > Since last week everything seems to work fine again. The difference: I > had to move data off of one OST whose RAID announces hardware errors. To > do that, I ran "lfs find --obd <OST> /lustre/<dir>", at first massivel > parallel, then with 6 processes, and for the last few directories only > step-by-step. Of course I''m bewildered that such a well defined > operation should be able to break the MDT''s operation, while the things > our users do in their unlimited ingenuity did not. > In the other hand, there is that issues with switching on quota. As I > have reported earlier, "lfs quotacheck -ug" also leads to enormous loads > on the MDT, finally stopping everything. > Maybe it''s more of a hardware issue. > >> >> This should clear up your doubts. But you said you are running at >> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this >> could be a different bug? >> >> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html > > Well, that was the bug causing data corruption on the MDT. There were > patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that. > But now we experienced this stop of operation of the MDT. After curing > it in the way I described earlier, there were no data corruptions or > losses that could be attributed to this outage. > > > Regards, > Thomas > > >> >> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote: >>> >>> Mag Gam wrote: >>>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html >>>> >>>> Look familiar? >>>> >>> Yes, I''ve read the thread - that''s why I addressed you in addition to >>> the list ? ;-) >>> >>> But I was not aware that this is supposed to be a bug in this particular >>> Lustre version. >>> >>> Right now the MDT stops cooperating without any ll_mdt processes going >>> up. Load is 0.5 or so on the MDT but no connections possible. >>> ? In the log I only noted some "still busy with 2 active RPCs" messages. >>> I just hope I don''t have to writeconf the MDT again - I learned on this >>> list that this would be necessary if these RPCs are never finished. >>> >>> Regards, >>> Thomas >>> >>> >>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>>> Hi, >>>>> >>>>> I didn''t take notice of a discussion of such problems with 1.6.7.1. ?? Do >>>>> you know something more specific about it? We won''t want to downgrade >>>>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And >>>>> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to >>>>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t >>>>> really reliable. >>>>> >>>>> For the moment the problem seems to have been fixed by shutdown, >>>>> fs-check and writeconf of all servers. >>>>> However, I don''t want to do that every other week ... >>>>> >>>>> Thanks a lot for your help, >>>>> Thomas >>>>> >>>>> Mag Gam wrote: >>>>>> Hi Tom: >>>>>> >>>>>> There was a known issue with 1.6.7.1. What I did was downgrade to >>>>>> 1.6.6 and everything worked well. Or you can try upgrading, but there >>>>>> is something def wrong with that version... >>>>>> >>>>>> If you like, I can help you offline. I should be free this weekend (I >>>>>> have a long weekend) >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> our MDT gets stuck and unresponsive with very high loads (Lustre >>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>>>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>>>>>> happening on the cluster before that. >>>>>>> After reboot as well as after moving the service to another server, this >>>>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>>>>>> recovery - work fine, but then the load goes up and the system is >>>>>>> rendered unusable. >>>>>>> >>>>>>> Atm, I don''t know what to do, except shutting down all servers and >>>>>>> possible do a writeconf everywhere. >>>>>>> >>>>>>> I see that a similar problem was reported by Mag in March this year, but >>>>>>> no clues or solutions appeared. >>>>>>> Any ideas? >>>>>>> >>>>>>> Yours, >>>>>>> Thomas >>>>>>> >>>>> -- >>>>> -------------------------------------------------------------------- >>>>> Thomas Roth >>>>> Department: Informationstechnologie >>>>> Location: SB3 1.262 >>>>> Phone: +49-6159-71 1453 ?? Fax: +49-6159-71 2986 >>>>> >>>>> GSI Helmholtzzentrum f????r Schwerionenforschung GmbH >>>>> Planckstra????e 1 >>>>> D-64291 Darmstadt >>>>> www.gsi.de >>>>> >>>>> Gesellschaft mit beschr????nkter Haftung >>>>> Sitz der Gesellschaft: Darmstadt >>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>>>> >>>>> Gesch????ftsf????hrer: Professor Dr. Horst St????cker >>>>> >>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>>>> >>> -- >>> -------------------------------------------------------------------- >>> Thomas Roth >>> Department: Informationstechnologie >>> Location: SB3 1.262 >>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986 >>> >>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH >>> Planckstra??e 1 >>> D-64291 Darmstadt >>> www.gsi.de >>> >>> Gesellschaft mit beschr??nkter Haftung >>> Sitz der Gesellschaft: Darmstadt >>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>> >>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker >>> >>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>> > > -- > -------------------------------------------------------------------- > Thomas Roth > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum f?r Schwerionenforschung GmbH > Planckstra?e 1 > D-64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschr?nkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Gesch?ftsf?hrer: Professor Dr. Horst St?cker > > Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >