thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11544] ptlrpc_check

If this information is useful, please help other people find it:
Share via:

eeb@clusterfs.com

2007-Apr-25 13:16 UTC

[Lustre-devel] [Bug 11544] ptlrpc_check_set() LBUG

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11544



Bernd,

I strongly suspect that this bug is expressed only when there are network errors
that cause lustre to recover its connections.   In the logs you last posted an
RPC request sent from the client to 192.168.41.102@o2ib (beo-102) completed with
failure (status 12, IB_WC_RETRY_EXC_ERR - message retries exceeded).  

BTW, these messages do not appear on the console by default since attempts to
communicate with a node that''s in reset can also cause them.  However
if you
want low-level network error to appear on the console and syslogs, you can run
"echo + neterror > /proc/sys/lnet/printk"

If network failures always occur on 1 client, then I''d want to see if
the
problem moves with the client''s HCA.  Similarly, if they always occur
with the
same server, I''d want to see if it moves with the server''s
HCA.  Otherwise, I''d
suspect that the fabric has a problem - in fact that''s my strongest
suspicion.

At the time that the error noted above was logged, lustre had posted many 1MByte
RDMAs.  In fact lustre can load the fabric with incessant, many-to-many RDMAs
which stress the fabric harder than many test programs or applications.  Maybe
checking and zeroing the fabric error counters periodically to see where most
errors are accumulating could help isolate the problem.

Of course since you have our best reproducer for this bug, we''d rather
you
didn''t fix the fabric until we''ve solved it :)

Lustre devel - Apr 2007 - [Bug 11544] ptlrpc_check_set() LBUG

[Lustre-devel] [Bug 11544] ptlrpc_check_set() LBUG