thr3ads.net - Lustre discuss - [Lustre-discuss] Connection losses to MGS/MDS [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2008-Dec-18 18:30 UTC

[Lustre-discuss] Connection losses to MGS/MDS

Hi all,

in a cluster with 375 clients, for a  12 hour period I get about  500 
messages  of the type

 > Connection to service MGS via nid A.B.C.D at tcp was lost; in progress 
operations using this service will fail.

and about 800 messages of the type

 > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in 
progress operations using this service will wait for recovery to complete.

Those clients are batch farm nodes, they run continuously all kind of 
user jobs that read and write data on Lustre.

I have no way of telling how bad this situation is, since I know only 
the error logs of our cluster. I have seen these messages right from the 
start of testing this cluster, but did not try to count them, since the 
performance then was splendid.

So what is your experience? Should there be no errors of this kind at 
all, is it something to be expected on a busy network, should there be a 
few connection losses due to specific machine problems, or is this just 
normal?

Thanks,
Thomas

Wojciech Turek

2008-Dec-18 19:01 UTC

head link

[Lustre-discuss] Connection losses to MGS/MDS

Hi,

It doesn''t look healthy. I assume that those messages and the numbers 
are from the client side, what do you see on the MDS server itself?
It seem to me that your network connection to the MDS is flaky and thus 
so many disconnection messages. It maybe doesn''t hurt noticeably your 
bandwidth  performance but it should certainly kill your mata data 
performance. I suggest to run some test and see for yourself. From your 
email I see that you are using Ethernet for connecting MDS to the rest 
of the cluster. It maybe worth of checking the cable or the interface 
for errors, dropped packets.

I have here 600 nodes cluster 100% utilized with jobs for most of the 
time, lustre is serving /home and /scratch file system and I don''t see 
these messages in the logs. I use lustre 1.6.6 for RHEL4

cheers

Wojciech

Thomas Roth wrote:> Hi all,
>
> in a cluster with 375 clients, for a  12 hour period I get about  500 
> messages  of the type
>
>  > Connection to service MGS via nid A.B.C.D at tcp was lost; in
progress
> operations using this service will fail.
>
> and about 800 messages of the type
>
>  > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in 
> progress operations using this service will wait for recovery to complete.
>
> Those clients are batch farm nodes, they run continuously all kind of 
> user jobs that read and write data on Lustre.
>
> I have no way of telling how bad this situation is, since I know only 
> the error logs of our cluster. I have seen these messages right from the 
> start of testing this cluster, but did not try to count them, since the 
> performance then was splendid.
>
> So what is your experience? Should there be no errors of this kind at 
> all, is it something to be expected on a busy network, should there be a 
> few connection losses due to specific machine problems, or is this just 
> normal?
>
> Thanks,
> Thomas
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Thomas Roth

2008-Dec-19 11:33 UTC

head link

[Lustre-discuss] Connection losses to MGS/MDS

Hi Wojciech,

Wojciech Turek wrote:> Hi,
> 
> It doesn''t look healthy. I assume that those messages and the
numbers
> are from the client side, what do you see on the MDS server itself?
I haven''t gotten a good correlation between the client- and MDS
messages
yet.
Of course, on the MDS I see evictions, refusal of connections due to 
left over active RPCs, also timeouts because "Request x55349122 took 
longer than estimated" - the whole spectrum I think.

> It seem to me that your network connection to the MDS is flaky and thus 
> so many disconnection messages. It maybe doesn''t hurt noticeably
your
> bandwidth  performance but it should certainly kill your mata data 
> performance. I suggest to run some test and see for yourself. From your 
> email I see that you are using Ethernet for connecting MDS to the rest 
> of the cluster. It maybe worth of checking the cable or the interface 
> for errors, dropped packets.
The major trouble as seen from the user''s side is of the type
"Node A
doesn''t see Lustre". Jobs dispatched to such a node then cannot
run and
exit with failure. On inspection the node is doing fine, Lustre is 
mounted and accessible - it just took too long to reactivate the 
connection. So indeed, meta data performance is  dead.
> 
> I have here 600 nodes cluster 100% utilized with jobs for most of the 
> time, lustre is serving /home and /scratch file system and I don''t
see
> these messages in the logs. I use lustre 1.6.6 for RHEL4
Thanks. That''s what I wanted to hear.

Regards,
Thomas

> 
> cheers
> 
> Wojciech
> 
> Thomas Roth wrote:
>> Hi all,
>>
>> in a cluster with 375 clients, for a  12 hour period I get about  500 
>> messages  of the type
>>
>>  > Connection to service MGS via nid A.B.C.D at tcp was lost; in
progress
>> operations using this service will fail.
>>
>> and about 800 messages of the type
>>
>>  > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in
>> progress operations using this service will wait for recovery to 
>> complete.
>>
>> Those clients are batch farm nodes, they run continuously all kind of 
>> user jobs that read and write data on Lustre.
>>
>> I have no way of telling how bad this situation is, since I know only 
>> the error logs of our cluster. I have seen these messages right from 
>> the start of testing this cluster, but did not try to count them, 
>> since the performance then was splendid.
>>
>> So what is your experience? Should there be no errors of this kind at 
>> all, is it something to be expected on a busy network, should there be 
>> a few connection losses due to specific machine problems, or is this 
>> just normal?
>>
>> Thanks,
>> Thomas
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
> 
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Lustre discuss - Dec 2008 - Connection losses to MGS/MDS

[Lustre-discuss] Connection losses to MGS/MDS

[Lustre-discuss] Connection losses to MGS/MDS

[Lustre-discuss] Connection losses to MGS/MDS