On Mon, Nov 09, 2009 at 02:48:34PM +0100, Heiko Schr?ter
wrote:> Hello,
>
> we do encounter peaks of upto 30% package loss in our Gigabit Network.
It would be helpful if you''d elaborate on where the 30% came from.
> This is sporadic, say once every hour remaining for some seconds. We cannot
specify if it extends into minutes.
> We do relate this to a very high peak load on the net.
>
> Could it be that lustre ''reconnect'' messages or
''lnet_try_match_md()'' are correlated to this ?
I''m not sure which ''reconnect'' you meant, but usually
they''re rate
limited and backed off exponentially so I''d be surprised that
reconnection requests were overwhelming the network.
The ''lnet_try_match_md()'' errors are usually caused by buffer
management problems in Lustre services, which would result in incoming
messages being dropped. If the other end resends those messages
aggressively, it could be a problem but now there''s too little clue to
tell.
> i.e. the mds has problems to match infos between osts and mgs ...
> What happens inside lustre when it stumbles across famous ''package
loss'' on the net ? (Any timeout/retry counters ???)
Usually packet loss is handled by TCP. If you''d enable network error
console logging you''d see some errors when TCP has given up
retransmission: echo +neterror > /proc/sys/lnet/printk
Thanks,
Isaac