Hi Wojciech,
Wojciech Turek wrote:> Hi,
>
> It doesn''t look healthy. I assume that those messages and the
numbers
> are from the client side, what do you see on the MDS server itself?
I haven''t gotten a good correlation between the client- and MDS
messages
yet.
Of course, on the MDS I see evictions, refusal of connections due to
left over active RPCs, also timeouts because "Request x55349122 took
longer than estimated" - the whole spectrum I think.
> It seem to me that your network connection to the MDS is flaky and thus
> so many disconnection messages. It maybe doesn''t hurt noticeably
your
> bandwidth performance but it should certainly kill your mata data
> performance. I suggest to run some test and see for yourself. From your
> email I see that you are using Ethernet for connecting MDS to the rest
> of the cluster. It maybe worth of checking the cable or the interface
> for errors, dropped packets.
The major trouble as seen from the user''s side is of the type
"Node A
doesn''t see Lustre". Jobs dispatched to such a node then cannot
run and
exit with failure. On inspection the node is doing fine, Lustre is
mounted and accessible - it just took too long to reactivate the
connection. So indeed, meta data performance is dead.
>
> I have here 600 nodes cluster 100% utilized with jobs for most of the
> time, lustre is serving /home and /scratch file system and I don''t
see
> these messages in the logs. I use lustre 1.6.6 for RHEL4
Thanks. That''s what I wanted to hear.
Regards,
Thomas
>
> cheers
>
> Wojciech
>
> Thomas Roth wrote:
>> Hi all,
>>
>> in a cluster with 375 clients, for a 12 hour period I get about 500
>> messages of the type
>>
>> > Connection to service MGS via nid A.B.C.D at tcp was lost; in
progress
>> operations using this service will fail.
>>
>> and about 800 messages of the type
>>
>> > Connection to service MDT0000 via nid A.B.C.D at tcp was lost; in
>> progress operations using this service will wait for recovery to
>> complete.
>>
>> Those clients are batch farm nodes, they run continuously all kind of
>> user jobs that read and write data on Lustre.
>>
>> I have no way of telling how bad this situation is, since I know only
>> the error logs of our cluster. I have seen these messages right from
>> the start of testing this cluster, but did not try to count them,
>> since the performance then was splendid.
>>
>> So what is your experience? Should there be no errors of this kind at
>> all, is it something to be expected on a busy network, should there be
>> a few connection losses due to specific machine problems, or is this
>> just normal?
>>
>> Thanks,
>> Thomas
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453 Fax: +49-6159-71 2986
GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de
Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528
Gesch?ftsf?hrer: Professor Dr. Horst St?cker
Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt