Thomas Roth
2009-Apr-09 17:17 UTC
[Lustre-discuss] MDT connection refusal: still busy with 2 active RPCs
Hi all, we are suffering from an increasing unusability of our cluster due to refused connections, with typical log entries on the MDS: ldlm_lib.ctarget_handle_connect lustre-MDT0000: refuse reconnection from 77cbd453-ee72-fe75-cb06-c49179e0a011 at Lustre-Client@tcp to 0xffff810111341000; still busy with 2 active RPCs These messages are surrounded by an increasing amount of "triggered watchdogs" and Log-dumps, which contain pretty much what can also be seen in /var/log/kern.log. I have searched whatever hits Google gave me for "busy with N active RPCs", but found no conclusive answer as to what caused this behavior and - more important - how to repair it. Right now all connectivity to the MDT was lost in the end, so I had to restart the MDS. Thanks, Thomas
Brian J. Murrell
2009-Apr-09 17:44 UTC
[Lustre-discuss] MDT connection refusal: still busy with 2 active RPCs
On Thu, 2009-04-09 at 19:17 +0200, Thomas Roth wrote:> Hi all,Hi.> ldlm_lib.ctarget_handle_connect lustre-MDT0000: refuse reconnection from > 77cbd453-ee72-fe75-cb06-c49179e0a011 at Lustre-Client@tcp to > 0xffff810111341000; still busy with 2 active RPCs > > These messages are surrounded by an increasing amount of "triggered > watchdogs" and Log-dumps, which contain pretty much what can also be > seen in /var/log/kern.log.If the server never makes progress with those outstanding RPCs, that''s usually a deadlock issue. If you are not already on 1.6.7 I would suggest upgrading to 1.6.7.1 (or patching 1.6.7 yourself with the MDS corruption bug fix) when it becomes available and see if the problem persists. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/81b31603/attachment.bin