thr3ads.net - Lustre discuss - [Lustre-discuss] Network Package loss [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Heiko Schröter

2009-Nov-09 13:48 UTC

[Lustre-discuss] Network Package loss

Hello,

we do encounter peaks of upto 30% package loss in our Gigabit Network.
This is sporadic, say once every hour remaining for some seconds. We cannot
specify if it extends into minutes.
We do relate this to a very high peak load on the net.

Could it be that lustre ''reconnect'' messages or
''lnet_try_match_md()'' are correlated to this ?
i.e. the mds has problems to match infos between osts and mgs ...
What happens inside lustre when it stumbles across famous ''package
loss'' on the net ? (Any timeout/retry counters ???)

Regards
Heiko

Isaac Huang

2009-Nov-10 01:41 UTC

head link

[Lustre-discuss] Network Package loss

On Mon, Nov 09, 2009 at 02:48:34PM +0100, Heiko Schr?ter
wrote:> Hello,
> 
> we do encounter peaks of upto 30% package loss in our Gigabit Network.
It would be helpful if you''d elaborate on where the 30% came from.
> This is sporadic, say once every hour remaining for some seconds. We cannot
specify if it extends into minutes.
> We do relate this to a very high peak load on the net.
> 
> Could it be that lustre ''reconnect'' messages or
''lnet_try_match_md()'' are correlated to this ?
I''m not sure which ''reconnect'' you meant, but usually
they''re rate
limited and backed off exponentially so I''d be surprised that
reconnection requests were overwhelming the network.

The ''lnet_try_match_md()'' errors are usually caused by buffer
management problems in Lustre services, which would result in incoming
messages being dropped. If the other end resends those messages
aggressively, it could be a problem but now there''s too little clue to
tell.
> i.e. the mds has problems to match infos between osts and mgs ...
> What happens inside lustre when it stumbles across famous ''package
loss'' on the net ? (Any timeout/retry counters ???)
Usually packet loss is handled by TCP. If you''d enable network error
console logging you''d see some errors when TCP has given up
retransmission: echo +neterror > /proc/sys/lnet/printk

Thanks,
Isaac

Lustre discuss - Nov 2009 - Network Package loss

[Lustre-discuss] Network Package loss

[Lustre-discuss] Network Package loss