thr3ads.net - Lustre discuss - [Lustre-discuss] MDT crash: ll

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2009-Jul-02 12:22 UTC

[Lustre-discuss] MDT crash: ll_mdt at 100%

Hi all,

our MDT gets stuck and unresponsive with very high loads (Lustre
1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
attention is one ll_mt_?? process running with 100% cpu. Nothing unusual
happening on the cluster before that.
After reboot as well as after moving the service to another server, this
behavior reappears. The initial stages - mounting MGS, mouting MDT,
recovery - work fine, but then the load goes up and the system is
rendered unusable.

Atm, I don''t know what to do, except shutting down all servers and
possible do a writeconf everywhere.

I see that a similar problem was reported by Mag in March this year, but
no clues or solutions appeared.
Any ideas?

Yours,
Thomas

Mag Gam

2009-Jul-03 04:03 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

Hi Tom:

There was a known issue with 1.6.7.1. What I did was downgrade to
1.6.6 and everything worked well. Or you can try upgrading, but there
is something def wrong with that version...

If you like, I can help you offline. I should be free this weekend (I
have a long weekend)



On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de>
wrote:> Hi all,
>
> our MDT gets stuck and unresponsive with very high loads (Lustre
> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual
> happening on the cluster before that.
> After reboot as well as after moving the service to another server, this
> behavior reappears. The initial stages - mounting MGS, mouting MDT,
> recovery - work fine, but then the load goes up and the system is
> rendered unusable.
>
> Atm, I don''t know what to do, except shutting down all servers and
> possible do a writeconf everywhere.
>
> I see that a similar problem was reported by Mag in March this year, but
> no clues or solutions appeared.
> Any ideas?
>
> Yours,
> Thomas
>

Thomas Roth

2009-Jul-03 11:32 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

Hi,

I didn''t take notice of a discussion of such problems with 1.6.7.1.  Do
you know something more specific about it? We won''t want to downgrade
since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And
we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try to
speed that up and force an upgrade if you told me that 1.6.7.1 wasn''t
really reliable.

For the moment the problem seems to have been fixed by shutdown,
fs-check and writeconf of all servers.
However, I don''t want to do that every other week ...

Thanks a lot for your help,
Thomas

Mag Gam wrote:> Hi Tom:
> 
> There was a known issue with 1.6.7.1. What I did was downgrade to
> 1.6.6 and everything worked well. Or you can try upgrading, but there
> is something def wrong with that version...
> 
> If you like, I can help you offline. I should be free this weekend (I
> have a long weekend)
> 
> 
> 
> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote:
>> Hi all,
>>
>> our MDT gets stuck and unresponsive with very high loads (Lustre
>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
>> attention is one ll_mt_?? process running with 100% cpu. Nothing
unusual
>> happening on the cluster before that.
>> After reboot as well as after moving the service to another server,
this
>> behavior reappears. The initial stages - mounting MGS, mouting MDT,
>> recovery - work fine, but then the load goes up and the system is
>> rendered unusable.
>>
>> Atm, I don''t know what to do, except shutting down all servers
and
>> possible do a writeconf everywhere.
>>
>> I see that a similar problem was reported by Mag in March this year,
but
>> no clues or solutions appeared.
>> Any ideas?
>>
>> Yours,
>> Thomas
>>
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Mag Gam

2009-Jul-03 12:55 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html

Look familiar?



On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de>
wrote:> Hi,
>
> I didn''t take notice of a discussion of such problems with
1.6.7.1. ?Do
> you know something more specific about it? We won''t want to
downgrade
> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And
> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could try
to
> speed that up and force an upgrade if you told me that 1.6.7.1
wasn''t
> really reliable.
>
> For the moment the problem seems to have been fixed by shutdown,
> fs-check and writeconf of all servers.
> However, I don''t want to do that every other week ...
>
> Thanks a lot for your help,
> Thomas
>
> Mag Gam wrote:
>> Hi Tom:
>>
>> There was a known issue with 1.6.7.1. What I did was downgrade to
>> 1.6.6 and everything worked well. Or you can try upgrading, but there
>> is something def wrong with that version...
>>
>> If you like, I can help you offline. I should be free this weekend (I
>> have a long weekend)
>>
>>
>>
>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de>
wrote:
>>> Hi all,
>>>
>>> our MDT gets stuck and unresponsive with very high loads (Lustre
>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
>>> attention is one ll_mt_?? process running with 100% cpu. Nothing
unusual
>>> happening on the cluster before that.
>>> After reboot as well as after moving the service to another server,
this
>>> behavior reappears. The initial stages - mounting MGS, mouting MDT,
>>> recovery - work fine, but then the load goes up and the system is
>>> rendered unusable.
>>>
>>> Atm, I don''t know what to do, except shutting down all
servers and
>>> possible do a writeconf everywhere.
>>>
>>> I see that a similar problem was reported by Mag in March this
year, but
>>> no clues or solutions appeared.
>>> Any ideas?
>>>
>>> Yours,
>>> Thomas
>>>
>
> --
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986
>
> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
> Planckstra?e 1
> D-64291 Darmstadt
> www.gsi.de
>
> Gesellschaft mit beschr?nkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>
> Gesch?ftsf?hrer: Professor Dr. Horst St?cker
>
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>

Thomas Roth

2009-Jul-03 14:44 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

Mag Gam wrote:> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
> 
> Look familiar?
> Yes, I''ve read the thread - that''s why I addressed you in
addition to
the list  ;-)

But I was not aware that this is supposed to be a bug in this particular
Lustre version.

Right now the MDT stops cooperating without any ll_mdt processes going
up. Load is 0.5 or so on the MDT but no connections possible.
 In the log I only noted some "still busy with 2 active RPCs"
messages.
I just hope I don''t have to writeconf the MDT again - I learned on this
list that this would be necessary if these RPCs are never finished.

Regards,
Thomas

> 
> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote:
>> Hi,
>>
>> I didn''t take notice of a discussion of such problems with
1.6.7.1. ? Do
>> you know something more specific about it? We won''t want to
downgrade
>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7).
And
>> we don''t have the 1.6.7.2 (Debian-) packages yet. But I could
try to
>> speed that up and force an upgrade if you told me that 1.6.7.1
wasn''t
>> really reliable.
>>
>> For the moment the problem seems to have been fixed by shutdown,
>> fs-check and writeconf of all servers.
>> However, I don''t want to do that every other week ...
>>
>> Thanks a lot for your help,
>> Thomas
>>
>> Mag Gam wrote:
>>> Hi Tom:
>>>
>>> There was a known issue with 1.6.7.1. What I did was downgrade to
>>> 1.6.6 and everything worked well. Or you can try upgrading, but
there
>>> is something def wrong with that version...
>>>
>>> If you like, I can help you offline. I should be free this weekend
(I
>>> have a long weekend)
>>>
>>>
>>>
>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de>
wrote:
>>>> Hi all,
>>>>
>>>> our MDT gets stuck and unresponsive with very high loads
(Lustre
>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing
calling
>>>> attention is one ll_mt_?? process running with 100% cpu.
Nothing unusual
>>>> happening on the cluster before that.
>>>> After reboot as well as after moving the service to another
server, this
>>>> behavior reappears. The initial stages - mounting MGS, mouting
MDT,
>>>> recovery - work fine, but then the load goes up and the system
is
>>>> rendered unusable.
>>>>
>>>> Atm, I don''t know what to do, except shutting down all
servers and
>>>> possible do a writeconf everywhere.
>>>>
>>>> I see that a similar problem was reported by Mag in March this
year, but
>>>> no clues or solutions appeared.
>>>> Any ideas?
>>>>
>>>> Yours,
>>>> Thomas
>>>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 1.262
>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986
>>
>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH
>> Planckstra??e 1
>> D-64291 Darmstadt
>> www.gsi.de
>>
>> Gesellschaft mit beschr??nkter Haftung
>> Sitz der Gesellschaft: Darmstadt
>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>
>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker
>>
>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Mag Gam

2009-Jul-03 15:02 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

Exactly the symptoms I had. How long were you running this for?  Also,
how easy is it for you to reproduce this error?

This should clear up your doubts. But you said you are running at
1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this
could be a different bug?

http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html





On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de>
wrote:>
>
> Mag Gam wrote:
>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
>>
>> Look familiar?
>>
> Yes, I''ve read the thread - that''s why I addressed you in
addition to
> the list ?;-)
>
> But I was not aware that this is supposed to be a bug in this particular
> Lustre version.
>
> Right now the MDT stops cooperating without any ll_mdt processes going
> up. Load is 0.5 or so on the MDT but no connections possible.
> ?In the log I only noted some "still busy with 2 active RPCs"
messages.
> I just hope I don''t have to writeconf the MDT again - I learned on
this
> list that this would be necessary if these RPCs are never finished.
>
> Regards,
> Thomas
>
>
>>
>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de>
wrote:
>>> Hi,
>>>
>>> I didn''t take notice of a discussion of such problems with
1.6.7.1. ? Do
>>> you know something more specific about it? We won''t want
to downgrade
>>> since our users are happier after the last upgrade (1.6.5 ->
1.6.7). And
>>> we don''t have the 1.6.7.2 (Debian-) packages yet. But I
could try to
>>> speed that up and force an upgrade if you told me that 1.6.7.1
wasn''t
>>> really reliable.
>>>
>>> For the moment the problem seems to have been fixed by shutdown,
>>> fs-check and writeconf of all servers.
>>> However, I don''t want to do that every other week ...
>>>
>>> Thanks a lot for your help,
>>> Thomas
>>>
>>> Mag Gam wrote:
>>>> Hi Tom:
>>>>
>>>> There was a known issue with 1.6.7.1. What I did was downgrade
to
>>>> 1.6.6 and everything worked well. Or you can try upgrading, but
there
>>>> is something def wrong with that version...
>>>>
>>>> If you like, I can help you offline. I should be free this
weekend (I
>>>> have a long weekend)
>>>>
>>>>
>>>>
>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at
gsi.de> wrote:
>>>>> Hi all,
>>>>>
>>>>> our MDT gets stuck and unresponsive with very high loads
(Lustre
>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing
calling
>>>>> attention is one ll_mt_?? process running with 100% cpu.
Nothing unusual
>>>>> happening on the cluster before that.
>>>>> After reboot as well as after moving the service to another
server, this
>>>>> behavior reappears. The initial stages - mounting MGS,
mouting MDT,
>>>>> recovery - work fine, but then the load goes up and the
system is
>>>>> rendered unusable.
>>>>>
>>>>> Atm, I don''t know what to do, except shutting down
all servers and
>>>>> possible do a writeconf everywhere.
>>>>>
>>>>> I see that a similar problem was reported by Mag in March
this year, but
>>>>> no clues or solutions appeared.
>>>>> Any ideas?
>>>>>
>>>>> Yours,
>>>>> Thomas
>>>>>
>>> --
>>>
--------------------------------------------------------------------
>>> Thomas Roth
>>> Department: Informationstechnologie
>>> Location: SB3 1.262
>>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986
>>>
>>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH
>>> Planckstra??e 1
>>> D-64291 Darmstadt
>>> www.gsi.de
>>>
>>> Gesellschaft mit beschr??nkter Haftung
>>> Sitz der Gesellschaft: Darmstadt
>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>
>>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker
>>>
>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>
>
> --
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986
>
> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
> Planckstra?e 1
> D-64291 Darmstadt
> www.gsi.de
>
> Gesellschaft mit beschr?nkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>
> Gesch?ftsf?hrer: Professor Dr. Horst St?cker
>
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>

Thomas Roth

2009-Jul-07 11:42 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

Hi,

Mag Gam wrote:> Exactly the symptoms I had. How long were you running this for?  Also,
> how easy is it for you to reproduce this error?
the MDS-going-on-strike - instances happened only twice since we
upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April.
Since last week everything seems to work fine again. The difference: I
had to move data off of one OST whose RAID announces hardware errors. To
do that, I ran "lfs find --obd <OST> /lustre/<dir>", at
first massivel
parallel, then with 6 processes, and for the last few directories only
step-by-step. Of course I''m bewildered that such a well defined
operation should be able to break the MDT''s operation, while the things
our users do in their unlimited ingenuity did not.
In the other hand, there is that issues with switching on quota. As I
have reported earlier, "lfs quotacheck -ug" also leads to enormous
loads
on the MDT, finally stopping everything.
Maybe it''s more of a hardware issue.
> 
> This should clear up your doubts. But you said you are running at
> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this
> could be a different bug?
> 
> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html
Well, that was the bug causing data corruption on the MDT. There were
patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that.
But now we experienced this stop of operation of the MDT. After curing
it in the way I described earlier, there were no data corruptions or
losses that could be attributed to this outage.


Regards,
Thomas

> 
> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>
>> Mag Gam wrote:
>>>
http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
>>>
>>> Look familiar?
>>>
>> Yes, I''ve read the thread - that''s why I addressed
you in addition to
>> the list ? ;-)
>>
>> But I was not aware that this is supposed to be a bug in this
particular
>> Lustre version.
>>
>> Right now the MDT stops cooperating without any ll_mdt processes going
>> up. Load is 0.5 or so on the MDT but no connections possible.
>> ? In the log I only noted some "still busy with 2 active
RPCs" messages.
>> I just hope I don''t have to writeconf the MDT again - I
learned on this
>> list that this would be necessary if these RPCs are never finished.
>>
>> Regards,
>> Thomas
>>
>>
>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de>
wrote:
>>>> Hi,
>>>>
>>>> I didn''t take notice of a discussion of such problems
with 1.6.7.1. ?? Do
>>>> you know something more specific about it? We won''t
want to downgrade
>>>> since our users are happier after the last upgrade (1.6.5 ->
1.6.7). And
>>>> we don''t have the 1.6.7.2 (Debian-) packages yet. But
I could try to
>>>> speed that up and force an upgrade if you told me that 1.6.7.1
wasn''t
>>>> really reliable.
>>>>
>>>> For the moment the problem seems to have been fixed by
shutdown,
>>>> fs-check and writeconf of all servers.
>>>> However, I don''t want to do that every other week ...
>>>>
>>>> Thanks a lot for your help,
>>>> Thomas
>>>>
>>>> Mag Gam wrote:
>>>>> Hi Tom:
>>>>>
>>>>> There was a known issue with 1.6.7.1. What I did was
downgrade to
>>>>> 1.6.6 and everything worked well. Or you can try upgrading,
but there
>>>>> is something def wrong with that version...
>>>>>
>>>>> If you like, I can help you offline. I should be free this
weekend (I
>>>>> have a long weekend)
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at
gsi.de> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> our MDT gets stuck and unresponsive with very high
loads (Lustre
>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only
thing calling
>>>>>> attention is one ll_mt_?? process running with 100%
cpu. Nothing unusual
>>>>>> happening on the cluster before that.
>>>>>> After reboot as well as after moving the service to
another server, this
>>>>>> behavior reappears. The initial stages - mounting MGS,
mouting MDT,
>>>>>> recovery - work fine, but then the load goes up and the
system is
>>>>>> rendered unusable.
>>>>>>
>>>>>> Atm, I don''t know what to do, except shutting
down all servers and
>>>>>> possible do a writeconf everywhere.
>>>>>>
>>>>>> I see that a similar problem was reported by Mag in
March this year, but
>>>>>> no clues or solutions appeared.
>>>>>> Any ideas?
>>>>>>
>>>>>> Yours,
>>>>>> Thomas
>>>>>>
>>>> --
>>>>
--------------------------------------------------------------------
>>>> Thomas Roth
>>>> Department: Informationstechnologie
>>>> Location: SB3 1.262
>>>> Phone: +49-6159-71 1453 ?? Fax: +49-6159-71 2986
>>>>
>>>> GSI Helmholtzzentrum f????r Schwerionenforschung GmbH
>>>> Planckstra????e 1
>>>> D-64291 Darmstadt
>>>> www.gsi.de
>>>>
>>>> Gesellschaft mit beschr????nkter Haftung
>>>> Sitz der Gesellschaft: Darmstadt
>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>>
>>>> Gesch????ftsf????hrer: Professor Dr. Horst St????cker
>>>>
>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 1.262
>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986
>>
>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH
>> Planckstra??e 1
>> D-64291 Darmstadt
>> www.gsi.de
>>
>> Gesellschaft mit beschr??nkter Haftung
>> Sitz der Gesellschaft: Darmstadt
>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>
>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker
>>
>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Mag Gam

2009-Jul-07 11:58 UTC

head link

[Lustre-discuss] MDT crash: ll_mdt at 100%

So, are you all good now?

Thanks for the explanation, BTW!



On Tue, Jul 7, 2009 at 7:42 AM, Thomas Roth<t.roth at gsi.de>
wrote:> Hi,
>
> Mag Gam wrote:
>> Exactly the symptoms I had. How long were you running this for? ?Also,
>> how easy is it for you to reproduce this error?
>
> the MDS-going-on-strike - instances happened only twice since we
> upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April.
> Since last week everything seems to work fine again. The difference: I
> had to move data off of one OST whose RAID announces hardware errors. To
> do that, I ran "lfs find --obd <OST> /lustre/<dir>",
at first massivel
> parallel, then with 6 processes, and for the last few directories only
> step-by-step. Of course I''m bewildered that such a well defined
> operation should be able to break the MDT''s operation, while the
things
> our users do in their unlimited ingenuity did not.
> In the other hand, there is that issues with switching on quota. As I
> have reported earlier, "lfs quotacheck -ug" also leads to
enormous loads
> on the MDT, finally stopping everything.
> Maybe it''s more of a hardware issue.
>
>>
>> This should clear up your doubts. But you said you are running at
>> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this
>> could be a different bug?
>>
>> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html
>
> Well, that was the bug causing data corruption on the MDT. There were
> patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that.
> But now we experienced this stop of operation of the MDT. After curing
> it in the way I described earlier, there were no data corruptions or
> losses that could be attributed to this outage.
>
>
> Regards,
> Thomas
>
>
>>
>> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de>
wrote:
>>>
>>> Mag Gam wrote:
>>>>
http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
>>>>
>>>> Look familiar?
>>>>
>>> Yes, I''ve read the thread - that''s why I
addressed you in addition to
>>> the list ? ;-)
>>>
>>> But I was not aware that this is supposed to be a bug in this
particular
>>> Lustre version.
>>>
>>> Right now the MDT stops cooperating without any ll_mdt processes
going
>>> up. Load is 0.5 or so on the MDT but no connections possible.
>>> ? In the log I only noted some "still busy with 2 active
RPCs" messages.
>>> I just hope I don''t have to writeconf the MDT again - I
learned on this
>>> list that this would be necessary if these RPCs are never finished.
>>>
>>> Regards,
>>> Thomas
>>>
>>>
>>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at
gsi.de> wrote:
>>>>> Hi,
>>>>>
>>>>> I didn''t take notice of a discussion of such
problems with 1.6.7.1. ?? Do
>>>>> you know something more specific about it? We
won''t want to downgrade
>>>>> since our users are happier after the last upgrade (1.6.5
-> 1.6.7). And
>>>>> we don''t have the 1.6.7.2 (Debian-) packages yet.
But I could try to
>>>>> speed that up and force an upgrade if you told me that
1.6.7.1 wasn''t
>>>>> really reliable.
>>>>>
>>>>> For the moment the problem seems to have been fixed by
shutdown,
>>>>> fs-check and writeconf of all servers.
>>>>> However, I don''t want to do that every other week
...
>>>>>
>>>>> Thanks a lot for your help,
>>>>> Thomas
>>>>>
>>>>> Mag Gam wrote:
>>>>>> Hi Tom:
>>>>>>
>>>>>> There was a known issue with 1.6.7.1. What I did was
downgrade to
>>>>>> 1.6.6 and everything worked well. Or you can try
upgrading, but there
>>>>>> is something def wrong with that version...
>>>>>>
>>>>>> If you like, I can help you offline. I should be free
this weekend (I
>>>>>> have a long weekend)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth
at gsi.de> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> our MDT gets stuck and unresponsive with very high
loads (Lustre
>>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only
thing calling
>>>>>>> attention is one ll_mt_?? process running with 100%
cpu. Nothing unusual
>>>>>>> happening on the cluster before that.
>>>>>>> After reboot as well as after moving the service to
another server, this
>>>>>>> behavior reappears. The initial stages - mounting
MGS, mouting MDT,
>>>>>>> recovery - work fine, but then the load goes up and
the system is
>>>>>>> rendered unusable.
>>>>>>>
>>>>>>> Atm, I don''t know what to do, except
shutting down all servers and
>>>>>>> possible do a writeconf everywhere.
>>>>>>>
>>>>>>> I see that a similar problem was reported by Mag in
March this year, but
>>>>>>> no clues or solutions appeared.
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Yours,
>>>>>>> Thomas
>>>>>>>
>>>>> --
>>>>>
--------------------------------------------------------------------
>>>>> Thomas Roth
>>>>> Department: Informationstechnologie
>>>>> Location: SB3 1.262
>>>>> Phone: +49-6159-71 1453 ?? Fax: +49-6159-71 2986
>>>>>
>>>>> GSI Helmholtzzentrum f????r Schwerionenforschung GmbH
>>>>> Planckstra????e 1
>>>>> D-64291 Darmstadt
>>>>> www.gsi.de
>>>>>
>>>>> Gesellschaft mit beschr????nkter Haftung
>>>>> Sitz der Gesellschaft: Darmstadt
>>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>>>
>>>>> Gesch????ftsf????hrer: Professor Dr. Horst St????cker
>>>>>
>>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix
Vierkorn-Rudolph,
>>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>>>
>>> --
>>>
--------------------------------------------------------------------
>>> Thomas Roth
>>> Department: Informationstechnologie
>>> Location: SB3 1.262
>>> Phone: +49-6159-71 1453 ? Fax: +49-6159-71 2986
>>>
>>> GSI Helmholtzzentrum f??r Schwerionenforschung GmbH
>>> Planckstra??e 1
>>> D-64291 Darmstadt
>>> www.gsi.de
>>>
>>> Gesellschaft mit beschr??nkter Haftung
>>> Sitz der Gesellschaft: Darmstadt
>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>
>>> Gesch??ftsf??hrer: Professor Dr. Horst St??cker
>>>
>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>
>
> --
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986
>
> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
> Planckstra?e 1
> D-64291 Darmstadt
> www.gsi.de
>
> Gesellschaft mit beschr?nkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>
> Gesch?ftsf?hrer: Professor Dr. Horst St?cker
>
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>

Lustre discuss - Jul 2009 - MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%

[Lustre-discuss] MDT crash: ll_mdt at 100%