eeb@clusterfs.com
2007-Apr-25 13:16 UTC
[Lustre-devel] [Bug 11544] ptlrpc_check_set() LBUG
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11544 Bernd, I strongly suspect that this bug is expressed only when there are network errors that cause lustre to recover its connections. In the logs you last posted an RPC request sent from the client to 192.168.41.102@o2ib (beo-102) completed with failure (status 12, IB_WC_RETRY_EXC_ERR - message retries exceeded). BTW, these messages do not appear on the console by default since attempts to communicate with a node that''s in reset can also cause them. However if you want low-level network error to appear on the console and syslogs, you can run "echo + neterror > /proc/sys/lnet/printk" If network failures always occur on 1 client, then I''d want to see if the problem moves with the client''s HCA. Similarly, if they always occur with the same server, I''d want to see if it moves with the server''s HCA. Otherwise, I''d suspect that the fabric has a problem - in fact that''s my strongest suspicion. At the time that the error noted above was logged, lustre had posted many 1MByte RDMAs. In fact lustre can load the fabric with incessant, many-to-many RDMAs which stress the fabric harder than many test programs or applications. Maybe checking and zeroing the fabric error counters periodically to see where most errors are accumulating could help isolate the problem. Of course since you have our best reproducer for this bug, we''d rather you didn''t fix the fabric until we''ve solved it :)