Thomas Roth
2009-Mar-05 17:56 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
Hi all, after running for days without any problems, our MDS is refusing cooperation for two hours now. The log files show nothing until>Mar 5 16:46:24 mds1 kernel: Lustre:17841:0:(ldlm_lib.c:525:target_handle_reconnect()) MDT0000: 481fa70b-590d -31b6-f621-c6125a54bfff reconnecting>Mar 5 16:46:24 mds1 kernel: Lustre:17841:0:(ldlm_lib.c:760:target_handle_connect()) MDT0000: refuse reconnec tion from 481fa70b-590d-31b6-f621-c6125a54bfff at 1.2.3.4@tcp to 0xffff8107ef44a000; still busy with 2 active RPCs I thought that such a thing would be between the MDT and this particular client. However, the log goes on like that with many other clients. Now the MDS is refusing any connection, bringing the system to a stand still. The situation also triggered the dumping of ca. 130 log dumps to /tmp. Most of these are small and contain just>Watchdog triggered for pid 17866: it was inactive for 12000s >nable to dump stack because of missing exportA few are larger and contain more complaints about lengthy requests and possible timeouts:>ptlrpc_server_handle_request Request x75091039 took longer thanestimated (42+4208s); client may timeout. or>ptlrpc_server_handle_request Dropping timed-out request from12345-140.181.114.222 at tcp: deadline 1000+923s ago All of these do not seem critical? Maybe all clients have timed out for some reason? Even so, I''d assume the MDS to be still responsive, say to a mount request from a fresh client, one that does not possibly have any leftover transactions pending on it? Right now the only thing I see to do is to reboot the server. Of course not a nice procedure on a system we advertised as stable and reliable to our users... So any help will be much appreciated. Regards, Thomas
Patricia Santos Marco
2009-Aug-17 18:14 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
The last day our MDS refusing conections too. The logs are the same, and we should reboot the MDS server . What''s is the reason for this? 2009/3/5 Thomas Roth <t.roth at gsi.de>> Hi all, > > after running for days without any problems, our MDS is refusing > cooperation for two hours now. > The log files show nothing until > >Mar 5 16:46:24 mds1 kernel: Lustre: > 17841:0:(ldlm_lib.c:525:target_handle_reconnect()) MDT0000: 481fa70b-590d > -31b6-f621-c6125a54bfff reconnecting > >Mar 5 16:46:24 mds1 kernel: Lustre: > 17841:0:(ldlm_lib.c:760:target_handle_connect()) MDT0000: refuse reconnec > tion from 481fa70b-590d-31b6-f621-c6125a54bfff at 1.2.3.4@tcp to > 0xffff8107ef44a000; still busy with 2 active RPCs > > I thought that such a thing would be between the MDT and this particular > client. However, the log goes on like that with many other clients. > > Now the MDS is refusing any connection, bringing the system to a stand > still. > > The situation also triggered the dumping of ca. 130 log dumps to /tmp. > Most of these are small and contain just > >Watchdog triggered for pid 17866: it was inactive for 12000s > >nable to dump stack because of missing export > > A few are larger and contain more complaints about lengthy requests and > possible timeouts: > >ptlrpc_server_handle_request Request x75091039 took longer than > estimated (42+4208s); client may timeout. > or > >ptlrpc_server_handle_request Dropping timed-out request from > 12345-140.181.114.222 at tcp: deadline 1000+923s ago > > All of these do not seem critical? > Maybe all clients have timed out for some reason? > Even so, I''d assume the MDS to be still responsive, say to a mount > request from a fresh client, one that does not possibly have any > leftover transactions pending on it? > > Right now the only thing I see to do is to reboot the server. Of course > not a nice procedure on a system we advertised as stable and reliable to > our users... > > So any help will be much appreciated. > Regards, > Thomas > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- (\__/) ( O.o) ( > <) Este es conejo. Copia a conejo en tu firma y ayudalo en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090817/b65c7c97/attachment.html
Oleg Drokin
2009-Aug-17 19:22 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
Hello! On Aug 17, 2009, at 2:14 PM, Patricia Santos Marco wrote:> The last day our MDS refusing conections too. The logs are the same, > and we should reboot the MDS server . What''s is the reason for this?That means some requests from this client are still being processed and server has a self-preservation mechanism trying to protect itself from client resending same rpc (that leads to slow server processing if not worse) again and again and occupying more and more server threads. The hung threads either had lbug that you can see in the logs or watchdogs should have triggered showing what it was doing (also visible in logs) before clients time out. Bye, Oleg
Patricia Santos Marco
2009-Aug-18 08:27 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
Our MDT have lustre 1.6.7, I see in this message http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html that this version have a bug that cause directory corruptions on the MDT. Can this bug produce this kind of errors? 2009/8/17 Oleg Drokin <Oleg.Drokin at sun.com>> Hello! > > On Aug 17, 2009, at 2:14 PM, Patricia Santos Marco wrote: > > The last day our MDS refusing conections too. The logs are the same, and >> we should reboot the MDS server . What''s is the reason for this? >> > > That means some requests from this client are still being processed and > server has a self-preservation mechanism trying to protect itself > from client resending same rpc (that leads to slow server processing if not > worse) again and again and occupying > more and more server threads. > The hung threads either had lbug that you can see in the logs or watchdogs > should have triggered showing what it was doing > (also visible in logs) before clients time out. > > Bye, > Oleg >-- (\__/) ( O.o) ( > <) Este es conejo. Copia a conejo en tu firma y ayudalo en sus planes de dominaci?n mundial. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090818/31496012/attachment.html
Mag Gam
2009-Aug-18 12:23 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
just curious, if you didn''t compile your own kernel, how do you apply this patch? Is our only option to upgrade via RPMS or is there another way to apply the patch? On Tue, Aug 18, 2009 at 4:27 AM, Patricia Santos Marco<psantos at bifi.es> wrote:> Our MDT have lustre 1.6.7, I see in this message > http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html that > this version have a bug that cause directory corruptions on the MDT. Can > this bug produce this kind of errors? > > > > 2009/8/17 Oleg Drokin <Oleg.Drokin at sun.com> >> >> Hello! >> >> On Aug 17, 2009, at 2:14 PM, Patricia Santos Marco wrote: >> >>> The last day our MDS refusing conections too. The logs are the same, and >>> we should reboot the MDS server . What''s is the reason for this? >> >> That means some requests from this client are still being processed and >> server has a self-preservation mechanism trying to protect itself >> from client resending same rpc (that leads to slow server processing if >> not worse) again and again and occupying >> more and more server threads. >> The hung threads either had lbug that you can see in the logs or watchdogs >> should have triggered showing what it was doing >> (also visible in logs) before clients time out. >> >> Bye, >> ? ?Oleg > > > > -- > (\__/) > ( O.o) > ( > <) Este es conejo. > Copia a conejo en tu firma y ayudalo en sus planes de dominaci?n mundial. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Oleg Drokin
2009-Aug-18 16:14 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
Hello! On Aug 18, 2009, at 4:27 AM, Patricia Santos Marco wrote:> Our MDT have lustre 1.6.7, I see in this message http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html > that this version have a bug that cause directory corruptions on > the MDT. Can this bug produce this kind of errors?The corruption in that bug does not lead to thread hanging, from what I can see, so likely it''s not relevant to your case. Also there are very distinct messages you would see in the systems logs if you hit that bug. (like bad inode messages) Bye, Oleg
Oleg Drokin
2009-Aug-18 16:15 UTC
[Lustre-discuss] MDS refuses connections (no visible reason)
Hello! On Aug 18, 2009, at 8:23 AM, Mag Gam wrote:> just curious, if you didn''t compile your own kernel, how do you apply > this patch? Is our only option to upgrade via RPMS or is there another > way to apply the patch?This patch is to lustre itself, not to a kernel. So you just need lustre sources to which you would apply the patch and recompile against your existing running kernel. Or you can use provided update RPMs. Bye, Oleg