Were are running a patched (mballoc and others) 1.6.3 lustre on x86_64 platform under Centos4.5. Kernel is ... Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux and is the same on the clients and server side. We have both tcp and o2ib NIDS. It seems to work pretty well most of the time but yesterday a large (256-way) job started and it seems that the MDS pretty much just fell over. This occurred at 18:36 (see log below). The whole cluster went into recovery mode as though we had rebooted the MDS. Prior to that we were getting some "target send reply msg" and "no handle for file close" errors. We know that each thread of the 256-way job opens a writes to a file independently but roughly concurrently so it probably got hit pretty hard all at once. Nonetheless, we hit it just as hard in our pre- production testing and it held up very well. So the questions are... 1. Does anyone recognize the errors below as known bugs for which there are patches/fixes we missed? 2. Looking at the tracebacks, I can''t help but wonder if there is some ldlm parameter we need to tweak for a cluster of our size. Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600 cores). Our MDS is a 4-way Opteron 2212 w/ 8GB RAM. We have 8 OSS''s and 24 OST''s (3 per OSS). Each OSS is a 4-way Opteron 275 with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage arrays. Any ideas? Suggestions? Thanks, Charlie Taylor UF HPC Center Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff810122fe1e00 x183667281/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 2 previous similar messages Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie 0x23710b5b3ed594eb req at ffff810116300000 x183716808/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 1474:mds_close()) Skipped 6 previous similar messages Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff810116300000 x183716808/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 6 previous similar messages Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie 0x23710b5b3c868998 req at ffff810119cf1400 x183809041/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff810119cf1400 x183809041/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie 0x23710b5b3dfc4c91 req at ffff8101189a3600 x183817543/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 1474:mds_close()) Skipped 5 previous similar messages Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101189a3600 x183817543/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 5 previous similar messages Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie 0x23710b5b3ed9874c req at ffff81012fa67c00 x183830524/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff81012fa67c00 x183830524/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie 0x23710b5b3eda7dd4 req at ffff8101161d0c00 x183871936/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101161d0c00 x183871936/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie 0x23710b5b3c790ddb req at ffff810119320400 x183904052/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff810119320400 x183904052/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven''t heard from client 8514a628-1f97-0759-0932-de3e470cbb4f (at 10.13.26.48 at o2ib) in 2252 seconds. I think it''s dead, and I am evicting it. Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven''t heard from client 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at 10.13.26.48 at o2ib) in 2252 seconds. I think it''s dead, and I am evicting it. Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie 0x23710b5b3c18eca5 req at ffff81011f3d3000 x184041701/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 1474:mds_close()) Skipped 2 previous similar messages Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff81011f3d3000 x184041701/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 2 previous similar messages Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie 0x23710b5b3d470bf2 req at ffff81011860cc00 x184196858/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 1474:mds_close()) Skipped 4 previous similar messages Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff81011860cc00 x184196858/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 4 previous similar messages Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie 0x23710b5b3e8e62f1 req at ffff810117d1e400 x184423975/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 1474:mds_close()) Skipped 3 previous similar messages Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) req at ffff810117d1e400 x184423975/t0 o35- >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 3 previous similar messages Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client e23549c4-cd61- b385-8f2d-1c125e41e7ef at NET_0x500000a0d1a78_UUID nid 10.13.26.120 at o2ib ns: mds-ufhpc-MDT0000_UUID lock: ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res: 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote: 0x4a51d1af1d561084 expref: 13 pid 31402 Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: 1498:mds_handle()) operation 101 on unconnected MDS from 12345-10.13.26.120 at o2ib Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-107) req at ffff8101191b9400 x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0 Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 12 previous similar messages Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420 Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle +0x478d/0x59af Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle +0x478d/0x59af Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer +0xb0/0xbe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer +0xb0/0xbe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: ll_mdt_100 S ffff81022c2315e0 0 31377 1 31378 31376 [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: (L-TLB) Jan 10 18:45:17 hpcmds kernel: ffff81022c2315e0 ffff81022c231510 ffff81022c231510 000000000000000a Jan 10 18:45:17 hpcmds kernel: ffff8102206c90c0 ffff81013396f040 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: 0000000000009a43 Jan 10 18:45:17 hpcmds kernel: ffff8102206c92a8 0000000000000002 ffffffff885ceb51 ffffffffffffffff Jan 10 18:45:17 hpcmds kernel: Call Trace: Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: ll_mdt_17 S ffff81011b57f6a0 0 26533 1 26534 26532 (L-TLB) Jan 10 18:45:17 hpcmds kernel: ffff81011b57f6a0 [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 Jan 10 18:45:17 hpcmds kernel: ffff81011b57f5d0 ffff81011b57f5d0 000000000000000a Jan 10 18:45:17 hpcmds kernel: ffff810118f48040 ffff8101339010c0 000abe5ed9897be8 0000000000008bd4 Jan 10 18:45:17 hpcmds kernel: ffff810118f48228 0000000000000001 ffffffff885ceb51 ffffffffffffffff Jan 10 18:45:17 hpcmds kernel: Call Trace: Jan 10 18:45:17 hpcmds kernel: [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout +0x8a/0xad Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout +0x0/0x5 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout +0x8a/0xad Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout +0x0/0x5 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90 Followed by tons more of the above tracebacks.
Hi We have similar problems when running benchIO and metaBench with large number of clients. I found following patches that may help resolve the problem. I haven''t tried them yet on our lustre installation but I am going to do that during the weekend. https://bugzilla.lustre.org/attachment.cgi?id=13540 https://bugzilla.lustre.org/attachment.cgi?id=13739 cheers Wojciech Turek On 11 Jan 2008, at 12:18, Charles Taylor wrote:> > Were are running a patched (mballoc and others) 1.6.3 lustre on > x86_64 platform under Centos4.5. Kernel is ... > > Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 > EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > > and is the same on the clients and server side. We have both tcp > and o2ib NIDS. > > It seems to work pretty well most of the time but yesterday a large > (256-way) job started and it seems that the MDS pretty much just fell > over. This occurred at 18:36 (see log below). The whole cluster > went into recovery mode as though we had rebooted the MDS. Prior > to that we were getting some "target send reply msg" and "no handle > for file close" errors. > > We know that each thread of the 256-way job opens a writes to a file > independently but roughly concurrently so it probably got hit pretty > hard all at once. Nonetheless, we hit it just as hard in our pre- > production testing and it held up very well. So the questions are... > > 1. Does anyone recognize the errors below as known bugs for which > there are patches/fixes we missed? > > 2. Looking at the tracebacks, I can''t help but wonder if there is > some ldlm parameter we need to tweak for a cluster of our size. > > Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600 > cores). Our MDS is a 4-way Opteron 2212 w/ 8GB RAM. We have 8 > OSS''s and 24 OST''s (3 per OSS). Each OSS is a 4-way Opteron 275 > with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage > arrays. > > Any ideas? Suggestions? > > Thanks, > > Charlie Taylor > UF HPC Center > > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810122fe1e00 x183667281/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie > 0x23710b5b3ed594eb req at ffff810116300000 x183716808/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) Skipped 6 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810116300000 x183716808/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 6 previous similar messages > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie > 0x23710b5b3c868998 req at ffff810119cf1400 x183809041/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810119cf1400 x183809041/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie > 0x23710b5b3dfc4c91 req at ffff8101189a3600 x183817543/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) Skipped 5 previous similar messages > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff8101189a3600 x183817543/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 5 previous similar messages > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie > 0x23710b5b3ed9874c req at ffff81012fa67c00 x183830524/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81012fa67c00 x183830524/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie > 0x23710b5b3eda7dd4 req at ffff8101161d0c00 x183871936/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff8101161d0c00 x183871936/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie > 0x23710b5b3c790ddb req at ffff810119320400 x183904052/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810119320400 x183904052/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven''t heard > from client 8514a628-1f97-0759-0932-de3e470cbb4f (at > 10.13.26.48 at o2ib) in 2252 seconds. I think it''s dead, and I am > evicting it. > Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven''t heard from client > 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at 10.13.26.48 at o2ib) in 2252 > seconds. I think it''s dead, and I am evicting it. > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie > 0x23710b5b3c18eca5 req at ffff81011f3d3000 x184041701/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) Skipped 2 previous similar messages > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81011f3d3000 x184041701/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie > 0x23710b5b3d470bf2 req at ffff81011860cc00 x184196858/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) Skipped 4 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81011860cc00 x184196858/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 4 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie > 0x23710b5b3e8e62f1 req at ffff810117d1e400 x184423975/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) Skipped 3 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810117d1e400 x184423975/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 3 previous similar messages > Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: > 210:waiting_locks_callback()) ### lock callback timer expired: > evicting client e23549c4-cd61- > b385-8f2d-1c125e41e7ef at NET_0x500000a0d1a78_UUID nid > 10.13.26.120 at o2ib ns: mds-ufhpc-MDT0000_UUID lock: > ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res: > 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote: > 0x4a51d1af1d561084 expref: 13 pid 31402 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: > 1498:mds_handle()) operation 101 on unconnected MDS from > 12345-10.13.26.120 at o2ib > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-107) > req at ffff8101191b9400 x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 12 previous similar messages > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_100 S ffff81022c2315e0 0 > 31377 1 31378 31376 > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81022c2315e0 ffff81022c231510 > ffff81022c231510 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff8102206c90c0 ffff81013396f040 > 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: 0000000000009a43 > Jan 10 18:45:17 hpcmds kernel: ffff8102206c92a8 0000000000000002 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_17 S ffff81011b57f6a0 0 > 26533 1 26534 26532 (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f6a0 > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f5d0 ffff81011b57f5d0 > 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff810118f48040 ffff8101339010c0 > 000abe5ed9897be8 0000000000008bd4 > Jan 10 18:45:17 hpcmds kernel: ffff810118f48228 0000000000000001 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > > > Followed by tons more of the above tracebacks. > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Hi, I forgot to ask, if your mds is failing over what happens on failover mds? Does it recover connections to all clients? Can you also describe you failover solution (e.g. heartbeat etc.)? cheers Wojciech Turek On 11 Jan 2008, at 12:18, Charles Taylor wrote:> > Were are running a patched (mballoc and others) 1.6.3 lustre on > x86_64 platform under Centos4.5. Kernel is ... > > Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 > EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > > and is the same on the clients and server side. We have both tcp > and o2ib NIDS. > > It seems to work pretty well most of the time but yesterday a large > (256-way) job started and it seems that the MDS pretty much just fell > over. This occurred at 18:36 (see log below). The whole cluster > went into recovery mode as though we had rebooted the MDS. Prior > to that we were getting some "target send reply msg" and "no handle > for file close" errors. > > We know that each thread of the 256-way job opens a writes to a file > independently but roughly concurrently so it probably got hit pretty > hard all at once. Nonetheless, we hit it just as hard in our pre- > production testing and it held up very well. So the questions are... > > 1. Does anyone recognize the errors below as known bugs for which > there are patches/fixes we missed? > > 2. Looking at the tracebacks, I can''t help but wonder if there is > some ldlm parameter we need to tweak for a cluster of our size. > > Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600 > cores). Our MDS is a 4-way Opteron 2212 w/ 8GB RAM. We have 8 > OSS''s and 24 OST''s (3 per OSS). Each OSS is a 4-way Opteron 275 > with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage > arrays. > > Any ideas? Suggestions? > > Thanks, > > Charlie Taylor > UF HPC Center > > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810122fe1e00 x183667281/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie > 0x23710b5b3ed594eb req at ffff810116300000 x183716808/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) Skipped 6 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810116300000 x183716808/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 6 previous similar messages > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie > 0x23710b5b3c868998 req at ffff810119cf1400 x183809041/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810119cf1400 x183809041/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie > 0x23710b5b3dfc4c91 req at ffff8101189a3600 x183817543/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) Skipped 5 previous similar messages > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff8101189a3600 x183817543/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 5 previous similar messages > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie > 0x23710b5b3ed9874c req at ffff81012fa67c00 x183830524/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81012fa67c00 x183830524/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie > 0x23710b5b3eda7dd4 req at ffff8101161d0c00 x183871936/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff8101161d0c00 x183871936/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie > 0x23710b5b3c790ddb req at ffff810119320400 x183904052/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810119320400 x183904052/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven''t heard > from client 8514a628-1f97-0759-0932-de3e470cbb4f (at > 10.13.26.48 at o2ib) in 2252 seconds. I think it''s dead, and I am > evicting it. > Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven''t heard from client > 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at 10.13.26.48 at o2ib) in 2252 > seconds. I think it''s dead, and I am evicting it. > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie > 0x23710b5b3c18eca5 req at ffff81011f3d3000 x184041701/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) Skipped 2 previous similar messages > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81011f3d3000 x184041701/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie > 0x23710b5b3d470bf2 req at ffff81011860cc00 x184196858/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) Skipped 4 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff81011860cc00 x184196858/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 4 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie > 0x23710b5b3e8e62f1 req at ffff810117d1e400 x184423975/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) Skipped 3 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > req at ffff810117d1e400 x184423975/t0 o35- >> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 3 previous similar messages > Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: > 210:waiting_locks_callback()) ### lock callback timer expired: > evicting client e23549c4-cd61- > b385-8f2d-1c125e41e7ef at NET_0x500000a0d1a78_UUID nid > 10.13.26.120 at o2ib ns: mds-ufhpc-MDT0000_UUID lock: > ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res: > 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote: > 0x4a51d1af1d561084 expref: 13 pid 31402 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: > 1498:mds_handle()) operation 101 on unconnected MDS from > 12345-10.13.26.120 at o2ib > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-107) > req at ffff8101191b9400 x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 12 previous similar messages > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_100 S ffff81022c2315e0 0 > 31377 1 31378 31376 > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81022c2315e0 ffff81022c231510 > ffff81022c231510 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff8102206c90c0 ffff81013396f040 > 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: 0000000000009a43 > Jan 10 18:45:17 hpcmds kernel: ffff8102206c92a8 0000000000000002 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_17 S ffff81011b57f6a0 0 > 26533 1 26534 26532 (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f6a0 > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f5d0 ffff81011b57f5d0 > 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff810118f48040 ffff8101339010c0 > 000abe5ed9897be8 0000000000008bd4 > Jan 10 18:45:17 hpcmds kernel: ffff810118f48228 0000000000000001 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > > > Followed by tons more of the above tracebacks. > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
On Fri, 2008-01-11 at 07:18 -0500, Charles Taylor wrote:> Were are running a patched (mballoc and others) 1.6.3 lustre on > x86_64 platform under Centos4.5. Kernel is ...This looks like bug 13917. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080111/6d799394/attachment-0002.bin
Hi, Yes I came with same conclusion. I have tried to apply patches from this bug but patching process has FAILED. [lbuild at mds01 lustre-1.6.3]$ patch -p0 < ../../../patch_13540 patching file lustre/ldlm/ldlm_request.c Hunk #2 succeeded at 772 (offset 2 lines). Hunk #4 succeeded at 866 (offset 2 lines). Hunk #6 succeeded at 989 (offset 46 lines). Hunk #7 FAILED at 1061. Hunk #8 FAILED at 1134. Hunk #9 succeeded at 1169 (offset 36 lines). Hunk #10 succeeded at 1228 (offset 46 lines). 2 out of 10 hunks FAILED -- saving rejects to file lustre/ldlm/ ldlm_request.c.rej [lbuild at mds01 lustre-1.6.3]$ cat lustre/ldlm/ldlm_request.c.rej *************** *** 1005,1012 **** int count, int max, int flags) { cfs_time_t cur = cfs_time_current(); - struct ldlm_lock *lock, *next; - int rc, added = 0, left; ENTRY; spin_lock(&ns->ns_unused_lock); --- 1061,1068 ---- int count, int max, int flags) { cfs_time_t cur = cfs_time_current(); + struct ldlm_lock *lock; + int added = 0; ENTRY; spin_lock(&ns->ns_unused_lock); *************** *** 1078,1107 **** spin_unlock(&ns->ns_unused_lock); /* Handle only @added inserted locks. */ - left = added; - list_for_each_entry_safe(lock, next, cancels, l_bl_ast) { - if (left-- == 0) - break; - - rc = ldlm_cli_cancel_local(lock); - if (rc == LDLM_FL_BL_AST) { - CFS_LIST_HEAD(head); - - LDLM_DEBUG(lock, "Cancel lock separately"); - list_del_init(&lock->l_bl_ast); - list_add(&lock->l_bl_ast, &head); - ldlm_cli_cancel_req(lock->l_conn_export, &head, 1); - rc = LDLM_FL_LOCAL_ONLY; - } - if (rc == LDLM_FL_LOCAL_ONLY) { - /* CANCEL RPC should not be sent to server. */ - list_del_init(&lock->l_bl_ast); - LDLM_LOCK_PUT(lock); - added--; - } - - } - RETURN(added); } /* when called with LDLM_ASYNC the blocking callback will be handled --- 1134,1140 ---- spin_unlock(&ns->ns_unused_lock); /* Handle only @added inserted locks. */ + RETURN (ldlm_cancel_list(cancels, added)); } /* when called with LDLM_ASYNC the blocking callback will be handled [lbuild at mds01 lustre-1.6.3]$ Any idea why? Can I get help with that? Regards, Wojciech Turek On 11 Jan 2008, at 18:29, Brian J. Murrell wrote:> On Fri, 2008-01-11 at 07:18 -0500, Charles Taylor wrote: >> Were are running a patched (mballoc and others) 1.6.3 lustre on >> x86_64 platform under Centos4.5. Kernel is ... > > This looks like bug 13917. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
On Friday 11 January 2008 10:51:34 am Wojciech Turek wrote:> Hi, > > Yes I came with same conclusion. I have tried to apply patches from > this bug but patching process has FAILED.This may be a little off-topic, but since patches for bug #13917 have been included in 1.6.4.1, is manually patching really worth the hassle? Unless you have a specific reason to stick with 1.6.3, of course. Cheers, -- Kilian
It is just a confidence/testing issue. Since 1.6.4.1 is not a "bug- fix only" release, it will be essentially untested for us. Since we run a production cluster, this is not acceptable. We put 1.6.3 into service originally without the thorough testing it needed and we were *VERY* sorry. Applying a specific patch to a known code base is a good bit safer though not foolproof. Charlie Taylor UF HPC Center On Jan 11, 2008, at 1:55 PM, Kilian CAVALOTTI wrote:> On Friday 11 January 2008 10:51:34 am Wojciech Turek wrote: >> Hi, >> >> Yes I came with same conclusion. I have tried to apply patches from >> this bug but patching process has FAILED. > > This may be a little off-topic, but since patches for bug #13917 have > been included in 1.6.4.1, is manually patching really worth the > hassle? > Unless you have a specific reason to stick with 1.6.3, of course. > > Cheers, > -- > Kilian > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Hi, The reason is that we have quite stable configuration of lustre 1.6.3 and this is running on production system. We have plans to upgrade to 1.6.4.1 but first we need to test it. Experience taught us that upgrading to the latest release is not always a good thing. Could you give us some help with this patches? Cheers, Wojciech On 11 Jan 2008, at 18:55, Kilian CAVALOTTI wrote:> On Friday 11 January 2008 10:51:34 am Wojciech Turek wrote: >> Hi, >> >> Yes I came with same conclusion. I have tried to apply patches from >> this bug but patching process has FAILED. > > This may be a little off-topic, but since patches for bug #13917 have > been included in 1.6.4.1, is manually patching really worth the > hassle? > Unless you have a specific reason to stick with 1.6.3, of course. > > Cheers, > -- > Kilian