Hi everyone, 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux 2.6.9-78.0.22. We were quite happy with lustre''s performance, especially because bottlenecks caused by /home disk access were history. Saturday, the cluster went down (= was inaccessible). After some investigation I found out that the reason seems to be an overloaded MDS. Over the following 4 days, this happened multiple times and could only be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. The MDS did not respond to any command, if I managed to get a video signal (not often), load was >170. Additionally, 2 times kernel oops got displayed, but unfortunately I have to record of them. The clients showed the following error:> Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messagesSo, my question is: what could cause such a load? The cluster was not exessively used... Is this a bug or a user''s job that creates the load? How can I protect lustre against this kind of failure? Thanks in advance, Arne -- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
Hi Arne, could be memory pressure and the OOM running and shooting at things. How much memory does you server has? Michael Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:> Hi everyone, > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux > 2.6.9-78.0.22. > > We were quite happy with lustre''s performance, especially because > bottlenecks caused by /home disk access were history. > > Saturday, the cluster went down (= was inaccessible). After some > investigation I found out that the reason seems to be an overloaded MDS. > Over the following 4 days, this happened multiple times and could only > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. > > The MDS did not respond to any command, if I managed to get a video > signal (not often), load was >170. Additionally, 2 times kernel oops got > displayed, but unfortunately I have to record of them. > > The clients showed the following error: > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > So, my question is: what could cause such a load? The cluster was not > exessively used... Is this a bug or a user''s job that creates the load? > How can I protect lustre against this kind of failure? > > Thanks in advance, > Arne >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/608ec349/attachment.bin
On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:> > The clients showed the following error: > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > So, my question is: what could cause such a load? The cluster was not > exessively used... Is this a bug or a user''s job that creates the load? > How can I protect lustre against this kind of failure?Without any more information we could not possibly know. If you really are getting oopses then you will need console logs (i.e. serial console) so that we can see the stack trace. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/744dcfe2/attachment.bin
Hi, 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz Arne On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:> Hi Arne, > > could be memory pressure and the OOM running and shooting at things. How > much memory does you server has? > > > Michael > > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy: > > Hi everyone, > > > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux > > 2.6.9-78.0.22. > > > > We were quite happy with lustre''s performance, especially because > > bottlenecks caused by /home disk access were history. > > > > Saturday, the cluster went down (= was inaccessible). After some > > investigation I found out that the reason seems to be an overloaded MDS. > > Over the following 4 days, this happened multiple times and could only > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. > > > > The MDS did not respond to any command, if I managed to get a video > > signal (not often), load was >170. Additionally, 2 times kernel oops got > > displayed, but unfortunately I have to record of them. > > > > The clients showed the following error: > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > > > So, my question is: what could cause such a load? The cluster was not > > exessively used... Is this a bug or a user''s job that creates the load? > > How can I protect lustre against this kind of failure? > > > > Thanks in advance, > > Arne > >-- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
Hmm. Should be enough. I guess you need to set up a loghost for syslog then and a reliable serial console to get stack traces. Everything else would be just a wild guess (as the question for the ram size was). Michael> Hi, > > 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz > > Arne > > On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote: > > Hi Arne, > > > > could be memory pressure and the OOM running and shooting at things. How > > much memory does you server has? > > > > > > Michael > > > > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy: > > > Hi everyone, > > > > > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 > > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux > > > 2.6.9-78.0.22. > > > > > > We were quite happy with lustre''s performance, especially because > > > bottlenecks caused by /home disk access were history. > > > > > > Saturday, the cluster went down (= was inaccessible). After some > > > investigation I found out that the reason seems to be an overloaded MDS. > > > Over the following 4 days, this happened multiple times and could only > > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. > > > > > > The MDS did not respond to any command, if I managed to get a video > > > signal (not often), load was >170. Additionally, 2 times kernel oops got > > > displayed, but unfortunately I have to record of them. > > > > > > The clients showed the following error: > > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > > > > > So, my question is: what could cause such a load? The cluster was not > > > exessively used... Is this a bug or a user''s job that creates the load? > > > How can I protect lustre against this kind of failure? > > > > > > Thanks in advance, > > > Arne > > >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/9e93182e/attachment.bin
Hi, thanks for replying! I understand that without further information we can''t do much about the oopses. I was more hoping for some information regarding possible sources of such an overload. Is it normal that a MDS gets overloaded like this, while the OSTs have nothing to do, and what can I do about it? How can I find the source of the problem? More specifically, what are the operations that lead to a lot of MDS load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is not a top-notch server, it''s fairly recent and I feel the load we''re experiencing is not handable by a single MDS. My problem is that I can''t make out major problems in the user''s jobs running on the cluster, and I can''t quantify nor track down the problem because I don''t know what behavior might have caused it. As I said, ooppses appeared only twice, and all other problems where just apparent by a non-responsive MDS. Thanks, Arne On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:> On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote: > > > > The clients showed the following error: > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > > > So, my question is: what could cause such a load? The cluster was not > > exessively used... Is this a bug or a user''s job that creates the load? > > How can I protect lustre against this kind of failure? > > Without any more information we could not possibly know. If you really > are getting oopses then you will need console logs (i.e. serial console) > so that we can see the stack trace. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:> Hi,Hi,> thanks for replying!NP.> I understand that without further information we can''t do much about the > oopses.Not just the oopses. It''s pretty hard to diagnose any problem without data.> I was more hoping for some information regarding possible > sources of such an overload.The possibilities are endless. Lustre is a complex system with many, many interactions with many other components. Sure, I could make stab in the dark guesses like too little memory, too slow disk, etc., but none of those are actually useful for anything more than a shopping trip and even then new hardware might still not solve your problem.> How can I find the source of the problem?Examine the logs. Lustre logs a lot of information when things are going wrong. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/e0e9a691/attachment.bin
LMT (http://code.google.com/p/lmt) might be able to give some hints if users are using the FS in a ''wild'' fashion. For the question "what can cause this behaviour of my MDS" I guess the answer is like: a million things ;) There is no way of being more specific with more input about the problem itself. Michael Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne Brutschy:> Hi, > > thanks for replying! > > I understand that without further information we can''t do much about the > oopses. I was more hoping for some information regarding possible > sources of such an overload. Is it normal that a MDS gets overloaded > like this, while the OSTs have nothing to do, and what can I do about > it? How can I find the source of the problem? > > More specifically, what are the operations that lead to a lot of MDS > load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is > not a top-notch server, it''s fairly recent and I feel the load we''re > experiencing is not handable by a single MDS. > > My problem is that I can''t make out major problems in the user''s jobs > running on the cluster, and I can''t quantify nor track down the problem > because I don''t know what behavior might have caused it. > > As I said, ooppses appeared only twice, and all other problems where > just apparent by a non-responsive MDS. > > Thanks, > Arne > > > On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote: > > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote: > > > > > > The clients showed the following error: > > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > > Oct 8 09:58:55 majorana kernel: LustreError: 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages > > > > > > So, my question is: what could cause such a load? The cluster was not > > > exessively used... Is this a bug or a user''s job that creates the load? > > > How can I protect lustre against this kind of failure? > > > > Without any more information we could not possibly know. If you really > > are getting oopses then you will need console logs (i.e. serial console) > > so that we can see the stack trace. > > > > b. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/ff83e9bd/attachment.bin
Hi, ok, thanks then. I was hoping there''s something like "doing this or that operation in abundance will cause this sort of behavior". :) Well then, I will add a serial console and the lustre monitoring. I will then try to send the MDS into overload, as we have by now managed to isolate the user who is apparently causing the overload. Thanks for now and have a nice weekend Arne On Fr, 2009-10-09 at 10:23 -0400, Brian J. Murrell wrote:> On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote: > > Hi, > > Hi, > > > thanks for replying! > > NP. > > > I understand that without further information we can''t do much about the > > oopses. > > Not just the oopses. It''s pretty hard to diagnose any problem without > data. > > > I was more hoping for some information regarding possible > > sources of such an overload. > > The possibilities are endless. Lustre is a complex system with many, > many interactions with many other components. Sure, I could make stab > in the dark guesses like too little memory, too slow disk, etc., but > none of those are actually useful for anything more than a shopping trip > and even then new hardware might still not solve your problem. > > > How can I find the source of the problem? > > Examine the logs. Lustre logs a lot of information when things are > going wrong. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
Hello, thanks for your suggestions. I tried to investigate further the reason for the MDS overload. It happened 3 times last week, even after I locked out the user that I thought was causing it. As suggested, I set up a serial console monitor on the MDS and installed the lmt (not particularly easy :). This morning, the MDS was overloaded again. The cpu load spike beyond 150 and the logs where showing the following errors: -------8<---------- Oct 20 12:02:38 compute-2-0 kernLustre: 9806:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x64452376 took longer than estimated (100+87s); client may timeout. req at f6081c00 x64452376/t1397543027 o101->5f1b0ba3-5e5b-d6b0-17a2-8e974cb5df12@:0/0 lens 496/472 e 0 to 0 dl 1256032871 ref 1 fl Complete:/0/0 rc 301/301 el: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 187s: evicting client at 10.255.255.190 at tcp ns: mds-lustre-MDT0000_UUID lock: f0249540/0xefbdbd73c4253265 lrc: 3/0,0 mode: CR/CR res: 1343024/2832666510 bits 0x3 rrc: 174 type: IBT flags: 0x4000030 remote: 0x73cb76d4b7121d86 expref: 1253 pid: 9807 timeout: 8475134 Lustre: 9747:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x78099102 took longer than estimated (100+92s); client may timeout. req at f4843a00 x78099102/t0 o101->99653b1f-c07d-b7c3-93fc-11f9ba3630f8@:0/0 lens 432/416 e 0 to 0 dl 1256032871 ref 1 fl Complete:/0/0 rc 0/0 LustreError: 9886:0:(ldlm_lockd.c:1161:ldlm_handle_enqueue()) ### lock on destroyed export f631b000 ns: mds-lustre-MDT0000_UUID lock: d23069c0/0xefbdbd73c4253281 lrc: 3/0,0 mode: CR/CR res: 1343024/2832666510 bits 0x3 rrc: 168 type: IBT flags: 0x4000030 remote: 0x9422ea29787ca6a7 expref: 3 pid: 9886 timeout: 0 Oct 20 12:02:43 compute-2-0 kernel: LustreError: 9886:0:(ldlm_lockd.c:1161:ldlm_handle_enqueue()) ### lock on destroyed export f631b000 ns: mds-lustre-MDT0000_UUID lock: d23069c0/0xefbdbd73c4253281 lrc: 3/0,0 mode: CR/CR res: 1343024/2832666510 bits 0x3 rrc: 168 type: IBT flags: 0x4000030 remote: 0x9422ea29787ca6a7 expref: 3 pid: 9886 timeout: 0 Lustre: 4380:0:(import.c:507:import_select_connection()) lustre-OST0002-osc: tried all connections, increasing latency to 26s LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail. LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl()) ptlrpc_queue_wait failed, rc: -3 Oct 20 12:02:45 LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl()) Skipped 1 previous similar message compute-2-0 kernLustreError: 12203:0:(quota_master.c:1661:qmaster_recovery_main()) qmaster recovery failed! (id:552 type:0 rc:-3) el: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress operations using this service will fail. Oct 20 12:02:45 compute-2-0 kernel: LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl()) ptlrpc_queue_wait failed, rc: -3 Oct 20 12:02:45 compute-2-0 kernel: LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl()) Skipped 1 previous similar message Oct 20 12:02:45 compute-2-0 kernel: LustreError: 12203:0:(quota_master.c:1661:qmaster_recovery_main()) qmaster recovery failed! (id:552 type:0 rc:-3) LustreError: 11770:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62 of 124 llog-records failed: -2 Oct 20 12:02:47 compute-2-0 kernel: LustreError: 11770:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62 of 124 llog-records failed: -2 LustreError: 9771:0:(handler.c:1601:mds_handle()) operation 35 on unconnected MDS from 12345-10.255.255.190 at tcp Oct 20 12:02:49 compute-2-0 kernel: LustreError: 9771:0:(handler.c:1601:mds_handle()) operation 35 on unconnected MDS from 12345-10.255.255.190 at tcp Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 9891: it was inactive for 200s Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 9816 ll_mdt_75 S 00000246 ll_mdt_94 3660 9816 1 S 00000246 9817 9815 (L-TLB) 3556 9882 1 eef1f8e4 00000046 988300000000 00000246 9881f9743d7c f60c17b0 (L-TLB) 4add8a03 ee1337a4 d1e28c40 00000046 f9677b82 00000000 00000000 00000246 f9743d7c c5664780 f6af2cb0 c5663de0 4add8a03 e02c1640 00000001 f9677b82 00000000 a135db00 00000000 c566c780 0098e380 c566bde0 00000002 f60c17b0 f6cf3170 00000000 a135db00 f6cf32dc 0098e380 00000000 f6af2cb0 f9678064 00000286 ecfaf870 eef1e000 ecfaf9dc d1e28c40 00000000 Call Trace: f9678064 00000286 ee132000 e02c1640 Call Trace: [<f9677b82>] [<f9677b82>] ldlm_expired_completion_wait+0xb2/0x510 [ptlrpc] ldlm_expired_completion_wait+0xb2/0x510 [ptlrpc] [<f9678064>] [<f9678064>] is_granted_or_cancelled+0x44/0x80 [ptlrpc] is_granted_or_cancelled+0x44/0x80 [ptlrpc] [<f96789fd>] [<f96789fd>] ldlm_completion_ast+0x95d/0xfb0 [ptlrpc] ldlm_completion_ast+0x95d/0xfb0 [ptlrpc] [<f969b06a>] [<f969b06a>] ldlm_process_inodebits_lock+0x27a/0x7a0 [ptlrpc] ldlm_process_inodebits_lock+0x27a/0x7a0 [ptlrpc] [<c02de9c6>] [<c02de9c6>] __cond_resched+0x14/0x39 __cond_resched+0x14/0x39 [<c011e7ac>] [<c011e7ac>] default_wake_function+0x0/0xc default_wake_function+0x0/0xc [<f9652cb7>] [<f9652cb7>] ldlm_lock_enqueue+0x1e7/0x11f0 [ptlrpc] ldlm_lock_enqueue+0x1e7/0x11f0 [ptlrpc] [<f9677ad0>] [<f9677ad0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc] [<f9677ac0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc] [<f9677ac0>] interrupted_completion_wait+0x0/0x10 [ptlrpc] interrupted_completion_wait+0x0/0x10 [ptlrpc] [<f9677ad0>] [<f9677ad0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc] [<f9677ac0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc] [<f9677ac0>] interrupted_completion_wait+0x0/0x10 [ptlrpc] interrupted_completion_wait+0x0/0x10 [ptlrpc] [<f96780a0>] [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] [<f967a00a>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] [<f967a00a>] ldlm_cli_enqueue_local+0x91a/0xac0 [ptlrpc] ldlm_cli_enqueue_local+0x91a/0xac0 [ptlrpc] [<f9679050>] [<f9679050>] ldlm_blocking_ast+0x0/0x690 [ptlrpc] [<f96780a0>] ldlm_blocking_ast+0x0/0x690 [ptlrpc] [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] [<f8f468eb>] [<f8f468eb>] enqueue_ordered_locks+0x1db/0x850 [mds] enqueue_ordered_locks+0x1db/0x850 [mds] [<f9679050>] [<f9679050>] ldlm_blocking_ast+0x0/0x690 [ptlrpc] [<f96780a0>] ldlm_blocking_ast+0x0/0x690 [ptlrpc] [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] ldlm_completion_ast+0x0/0xfb0 [ptlrpc] [<c0168e0f>] [<c0168e44>] __lookup_hash+0x47/0x95 __lookup_hash+0x7c/0x95 [<f8f48a8c>] [<f8f48a8c>] mds_get_parent_child_locked+0x3cc/0xe30 [mds] mds_get_parent_child_locked+0x3cc/0xe30 [mds] [<f8cdf24f>] ksocknal_queue_tx_locked+0x1cf/0x630 [ksocklnd] [<f895c025>] upcall_cache_get_entry+0x295/0xd10 [lvfs] [<f8f6fdce>] mds_open+0x70e/0x53c0 [mds] [<f8f204d1>] mds_getattr_lock+0x241/0x1490 [mds] [<c011e832>] [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc] [<f8f34040>] fixup_handle_for_resent_req+0x40/0x480 [mds] [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc] [<f8f34e71>] mds_intent_policy+0x9f1/0x1500 [mds] [<f8bbdfdb>] LNetMDBind+0x1fb/0x6b0 [lnet] [<f964bc9e>] __wake_up+0x29/0x3c [<f895c025>] ldlm_lock_remove_from_lru+0x5e/0x210 [ptlrpc] upcall_cache_get_entry+0x295/0xd10 [lvfs] [<f8cdf8e3>] ksocknal_launch_packet+0x143/0x5c0 [ksocklnd] [<f96c7ea3>] [<f8f34480>] mds_intent_policy+0x0/0x1500 [mds] [<f96531a3>] lustre_msg_string+0x43/0x510 [ptlrpc] [<c0170c30>] ldlm_lock_enqueue+0x6d3/0x11f0 [ptlrpc] [<f89dc8e0>] lustre_hash_add+0x180/0x4d0 [obdclass] [<f968a6ab>] ldlm_handle_enqueue+0x67b/0x23f0 [ptlrpc] [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc] [<f9689060>] ldlm_server_completion_ast+0x0/0x990 [ptlrpc] [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc] [<f8f29674>] mds_handle+0x21e4/0x5830 [mds] [<f96cac0c>] lustre_msg_get_conn_cnt+0x10c/0x1f0 [ptlrpc] [<c011cdc3>] activate_task+0x88/0x95 [<f96d4d2e>] ptlrpc_check_req+0xe/0x220 [ptlrpc] [<f96c9a8c>] dput+0x34/0x19b [<f8f55c48>] mds_reint_rec+0xe8/0x3a0 [mds] lustre_msg_get_opc+0x10c/0x1f0 [ptlrpc] [<c011de69>] [<f96c4260>] lustre_pack_reply_v2+0x1f0/0x620 [ptlrpc] [<f8f7b3d0>] mds_update_unpack+0x320/0x550 [mds] [<c02de9c6>] find_busiest_group+0xdd/0x295 [<c010aea2>] do_gettimeofday+0x1a/0x9c [<f96d9f8a>] ptlrpc_server_handle_request+0x28a/0x1e80 [ptlrpc] [<c010aea2>] do_gettimeofday+0x1a/0x9c [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs] [<c02de2dd>]<3>LustreError: 9881:0:(handler.c:1601:mds_handle()) operation 400 on unconnected MDS from 12345-10.255.255.191 at tcp __cond_resched+0x14/0x39 [<f8f2438e>] mds_reint+0xfe/0x770 [mds] [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc] [<f895ee68>] lprocfs_counter_add+0xf8/0x120 [lvfs] [<f8f34f28>] mds_intent_policy+0xaa8/0x1500 [mds] [<f96547a4>] ldlm_run_cp_ast_work+0x174/0x330 [ptlrpc] [<f9653d87>] ldlm_reprocess_queue+0xc7/0x390 [ptlrpc] [<f969cbd8>] ldlm_pool_recalc+0x108/0x120 [ptlrpc] [<f964b045>] lock_res_and_lock+0x45/0xc0 [ptlrpc] [<f8f34480>] mds_intent_policy+0x0/0x1500 [mds] [<f96531a3>] ldlm_lock_enqueue+0x6d3/0x11f0 [ptlrpc] [<f89dc8e0>] schedule+0x855/0x8f3 lustre_hash_add+0x180/0x4d0 [obdclass] [<c02de30d>] [<f968a6ab>] ldlm_handle_enqueue+0x67b/0x23f0 [ptlrpc] [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc] [<f9689060>] ldlm_server_completion_ast+0x0/0x990 [ptlrpc] [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc] [<f8f29674>] mds_handle+0x21e4/0x5830 [mds] [<f96cac0c>] lustre_msg_get_conn_cnt+0x10c/0x1f0 [ptlrpc] [<c011cdc3>] activate_task+0x88/0x95 [<f96d4d2e>] ptlrpc_check_req+0xe/0x220 [ptlrpc] [<f96c9a8c>] lustre_msg_get_opc+0x10c/0x1f0 [ptlrpc] [<c010aea2>] do_gettimeofday+0x1a/0x9c [<f96d9f8a>] ptlrpc_server_handle_request+0x28a/0x1e80 [ptlrpc] [<c010aea2>] do_gettimeofday+0x1a/0x9c [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs] [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs] [<c011e832>] __wake_up+0x29/0x3c [<f89423c0>] lc_watchdog_touch_ms+0xb0/0x280 [libcfs] [<f894263c>] lc_watchdog_disable+0x7c/0x240 [libcfs] [<f89345b3>] cfs_block_sigs+0x73/0xa0 [libcfs] [<f96dcb3b>] ptlrpc_main+0x62b/0xdb0 [ptlrpc] [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc] [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc] [<c02e03b2>] schedule+0x885/0x8f3 [<f89423c0>] lc_watchdog_touch_ms+0xb0/0x280 [libcfs] [<f89345b3>] cfs_block_sigs+0x73/0xa0 [libcfs] [<f96dcb3b>] ptlrpc_main+0x62b/0xdb0 [ptlrpc] [<c011e7ac>] default_wake_function+0x0/0xc [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc] [<c02e03b2>] ret_from_fork+0x6/0x14 [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc] [<f96dc510>] ptlrpc_main+0x0/0xdb0 [ptlrpc] [<c01041f5>] kernel_thread_helper+0x5/0xb -------8<---------- Followed by pages and pages of call traces. The system locked up and I had to reboot the MDS (I managed by using the console, ssh wasn''t possible anymore). I attached the full log from the reboot yesterday to todays reboot. Reading the logs myself, it seems to be that the MDS looses connection to one of the OSS, and tries to recover from it. Afterwards, things to go haywire and the qmaster seems to crash? But I have trouble decrypting the logs. Can you see what''s going on? Cheers, Arne PS: a small recap: 1 MDS, 4 OSS, ~80 clients, ~350 cpus, max of ~1000 active jobs. On Fr, 2009-10-09 at 10:23 -0400, Brian J. Murrell wrote:> On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote: > > Hi, > > Hi, > > > thanks for replying! > > NP. > > > I understand that without further information we can''t do much about the > > oopses. > > Not just the oopses. It''s pretty hard to diagnose any problem without > data. > > > I was more hoping for some information regarding possible > > sources of such an overload. > > The possibilities are endless. Lustre is a complex system with many, > many interactions with many other components. Sure, I could make stab > in the dark guesses like too little memory, too slow disk, etc., but > none of those are actually useful for anything more than a shopping trip > and even then new hardware might still not solve your problem. > > > How can I find the source of the problem? > > Examine the logs. Lustre logs a lot of information when things are > going wrong. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary) -------------- next part -------------- A non-text attachment was scrubbed... Name: mds.log.bz2 Type: application/x-bzip Size: 21908 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091020/4363e24d/attachment-0001.bin
On Tue, 2009-10-20 at 16:13 +0200, Arne Brutschy wrote:> > As suggested, I set up a serial console monitor on the MDS and installed > the lmt (not particularly easy :). This morning, the MDS was overloaded > again. The cpu load spike beyond 150 and the logs where showing the > following errors:Well, prior to this there are lots of quota errors you really ought to look into fixing. Try searching bugzilla for similar problems. What version of Lustre is this MDS running? As for the MDS problems, the first sign of trouble looks like: Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid 10.255.255.202 at tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 5s ago has timed out (limit 5s). Which leads you do OST0002, which is at 10.255.255.202, yes? So what''s going on there that the MDS lost it''s connection to the OST there? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091020/c00bdca8/attachment.bin
On Di, 2009-10-20 at 11:20 -0400, Brian J. Murrell wrote:> Well, prior to this there are lots of quota errors you really ought to > look into fixing. Try searching bugzilla for similar problems. What > version of Lustre is this MDS running?As far as I understand right, these ''errors'' just report that a user ran out of quota. All the cluster is running lustre 1.6.7.2> As for the MDS problems, the first sign of trouble looks like: > > Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). > Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid 10.255.255.202 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). > Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 50s ago has timed out (limit 50s). > Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202 at tcp 5s ago has timed out (limit 5s). > > Which leads you do OST0002, which is at 10.255.255.202, yes? > So what''s going on there that the MDS lost it''s connection to the OST > there?Well, thats the big question. The OSS''s are basically idle, as we mostly use small files. The OSS''s log for this time: Oct 19 11:59:33 compute-2-3 kernel: LustreError: 166-1: MGC10.255.255.206 at tcp: Connection to service MGS via nid 10.255.255.206 at tcp was lost; in progress operations using this service will fail. Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:552:target_handle_reconnect()) lustre-mdtlov_UUID reconnecting from MGC10.255.255.206 at tcp_0, handle mismatch (ours 0x9f3a42f18876fee7, theirs 0x1b158acd85bf638) Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:786:target_handle_connect()) lustre-OST0003: NID 10.255.255.206 at tcp (lustre-mdtlov_UUID) reconnected with 1 conn_cnt; cookies not random? Oct 19 12:00:08 compute-2-3 kernel: LustreError: 4421:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-114) req at f56e9e00 x10/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to 0 dl 1255946508 ref 1 fl Interpret:/0/0 rc -114/0 Just reporting that it lost connection. It seems that somehow the MDS looses connection to at least one of the OSS, and then the situation deteriorates as it gets overwhelmed with reconnect events which it refuses. The in the end, it seems that every client is refused by the MFS. This seems to be similar to this problem: http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010244.html Currently, the connection to clients and between MDS/OSS''s is a simple gigabit link (2 48port managed switches). We don''t have a separate storage network, as traffic on the job-side of things is very low. Looking at the interface statistics, I can''t see any errors or dropped packages during normal operation. Maybe I should connect the lustre servers among themselves and see if the latency/timeout problems persist. Regards, Arne -- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 3168 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)