Kai Germaschewski
2007-Aug-21  09:14 UTC
[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors
We''ve been playing with using luster as root fs for our x86_64 based cluster. We''ve run into quite some stability problems, with arbitrary processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper or whatever. We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem being mounted read-only. MGS/MDS/OST are all on one server node. I''ve trouble understanding most of the things that lustre is writing to the logs, any pointers to additional docs would be appreciated. One consistently recurring problem is LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2 on the client. Last night, in addition clients seemed to be evicted regularly (and reconnecting) even though they were up, which may be where the random processes died. Currently, we''re running with only one client, which seems to be stable except for the error above repeating itself. I''ll be happy to provide any additional info needed. --Kai
Martin Pokorny
2007-Aug-21  10:02 UTC
[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors
Kai Germaschewski wrote:> We''ve been playing with using luster as root fs for our x86_64 based > cluster. We''ve run into quite some stability problems, with arbitrary > processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper > or whatever.My cluster is seeing similar problems. I''ve got a heterogeneous cluster with both x86_64 and i386 nodes, and I''m not using Lustre as the root fs, but I''ve noticed similar problems as you''ve described.> We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem being mounted > read-only. MGS/MDS/OST are all on one server node. I''ve trouble > understanding most of the things that lustre is writing to the logs, any > pointers to additional docs would be appreciated.Here I''m running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/MDS, a few OSS nodes, and a few clients. Mostly I''m using the node hosting the MGS/MDS as a Lustre client. Network is TCP.> One consistently recurring problem is > > LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2 > > on the client.I''m seeing exactly the same messages.> Last night, in addition clients seemed to be evicted regularly (and > reconnecting) even though they were up, which may be where the random > processes died. Currently, we''re running with only one client, which seems > to be stable except for the error above repeating itself.Occasionally I see messages similar to the following: LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187711017, 50s ago) req@0000010038a01800 x25961999/t0 o400->lustre-OST0002_UUID@10.64.95.251@tcp:28 lens 128/128 ref 1 fl Rpc:N/0/0 rc 0/-22 which is concurrent with a long pause in fs access. As far as I can tell, recovery is then successful, and the jobs keep running. The main effect seems to be that file operations on the Lustre filesystem are greatly slowed. -- Martin
Kai Germaschewski
2007-Aug-21  12:46 UTC
[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors
On Tue, 21 Aug 2007, Kai Germaschewski wrote:> One consistently recurring problem is > > LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2 > > on the client.Here''s another, potentially not even related issue: I mounted the filesystem (h101:/root) onto another server node into /mount/root for testing. Accessing it worked fine with no errors generated neither on the client nor server(s). I unmounted it and mounted it read-only, and the cd''d into /mount/root and "chroot .". This gave: Lustre: Client root-client has started Lustre: client ffff8103fba00c00 umount complete Lustre: Client root-client has started LustreError: 24772:0:(mdc_locks.c:409:mdc_enqueue()) ASSERTION(rc != -ENOENT) failed LustreError: 24772:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG Lustre: 24772:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 24772 bash R running task 0 24772 24717 (NOTLB) ffffffff88017379 ffffffff8042f3b7 ffff810003f07200 ffff8101fabb1528 ffff810003f07200 ffffffff8042f641 0000000000000000 ffffffff807570e0 0000000000007715 000000000000772a fffffffffffffe99 ffffffff80816fc0 Call Trace: [<ffffffff8042f641>] vt_console_print+0x21a/0x230 [<ffffffff8042f641>] vt_console_print+0x21a/0x230 [<ffffffff8021623a>] release_console_sem+0x1a7/0x1eb [<ffffffff8021623a>] release_console_sem+0x1a7/0x1eb [<ffffffff802921df>] kallsyms_lookup+0x49/0x91 [<ffffffff802921df>] kallsyms_lookup+0x49/0x91 [<ffffffff802627ef>] printk_address+0x96/0xa0 [<ffffffff802904a5>] __module_text_address+0x64/0x73 [<ffffffff80289451>] __kernel_text_address+0x1a/0x26 [<ffffffff80289451>] __kernel_text_address+0x1a/0x26 [<ffffffff802622bc>] dump_trace+0x247/0x274 [<ffffffff8026231d>] show_trace+0x34/0x47 [<ffffffff80262408>] _show_stack+0xd8/0xe5 [<ffffffff8800eba2>] :libcfs:lbug_with_loc+0x79/0xa3 [<ffffffff880157c9>] :libcfs:trace_refill_stock+0x0/0x63 [<ffffffff8811d7eb>] :mdc:mdc_enqueue+0x9dc/0x1465 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af [<ffffffff88186628>] :ptlrpc:ldlm_completion_ast+0x0/0x721 [<ffffffff88175a47>] :ptlrpc:ldlm_resource_putref+0x192/0x374 [<ffffffff8811e5cb>] :mdc:mdc_intent_lock+0x357/0x969 [<ffffffff88186628>] :ptlrpc:ldlm_completion_ast+0x0/0x721 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af [<ffffffff88183889>] :ptlrpc:ldlm_cancel_lru+0x80/0x32d [<ffffffff88172d2c>] :ptlrpc:__ldlm_handle2lock+0x2f5/0x354 [<ffffffff88262168>] :lustre:ll_i2gids+0x5d/0xfe [<ffffffff8826227e>] :lustre:ll_prepare_mdc_op_data+0x75/0xfb [<ffffffff88264122>] :lustre:ll_lookup_it+0x347/0x831 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af [<ffffffff8822e802>] :lustre:ll_intent_drop_lock+0x85/0x93 [<ffffffff8822ec4b>] :lustre:ll_revalidate_it+0x20c/0xabc [<ffffffff88264699>] :lustre:ll_lookup_nd+0x8d/0xfe [<ffffffff80220a50>] d_alloc+0x153/0x18f [<ffffffff802af88f>] real_lookup+0x7b/0x11e [<ffffffff8020cb20>] do_lookup+0x67/0xbf [<ffffffff80209afe>] __link_path_walk+0xa44/0xef1 [<ffffffff80209f97>] __link_path_walk+0xedd/0xef1 [<ffffffff8020e4e6>] link_path_walk+0x4a/0xbe [<ffffffff80225332>] do_filp_open+0x50/0x71 [<ffffffff8020c98c>] do_path_lookup+0x1ab/0x1cd [<ffffffff80221775>] __path_lookup_intent_open+0x54/0x92 [<ffffffff80219a7a>] open_namei+0x94/0x67e [<ffffffff802b02b9>] __user_walk_fd_it+0x48/0x53 [<ffffffff80226643>] vfs_stat_fd+0x44/0x78 [<ffffffff80225332>] do_filp_open+0x50/0x71 [<ffffffff8822e918>] :lustre:ll_intent_release+0x0/0x127 [<ffffffff802187d2>] do_sys_open+0x44/0xc8 [<ffffffff802591ce>] system_call+0x7e/0x83 LustreError: dumping log to /tmp/lustre-log.1187721493.24772 LustreError: 24772:0:(linux-debug.c:91:libcfs_run_upcall()) Error -2 invoking LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/home/kai/lustre-1.6.0.1-ql2-rc2/lnet/libcfs/tracefile.c,libcfs_assertion_failed,433; check /proc/sys/lnet/upcall --Kai
Wojciech Turek
2007-Aug-23  14:40 UTC
[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors
Hi, On 21 Aug 2007, at 17:02, Martin Pokorny wrote:> Kai Germaschewski wrote: >> We''ve been playing with using luster as root fs for our x86_64 >> based cluster. We''ve run into quite some stability problems, with >> arbitrary processes on the nodes disappearing, like sshd, gmond, >> the Myrinet mapper or whatever. > > My cluster is seeing similar problems. I''ve got a heterogeneous > cluster with both x86_64 and i386 nodes, and I''m not using Lustre > as the root fs, but I''ve noticed similar problems as you''ve described.We have similar problem on our x86_64 cluster. We are using lustre as scratch file system for running jobs.>> We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem being >> mounted read-only. MGS/MDS/OST are all on one server node. I''ve >> trouble understanding most of the things that lustre is writing to >> the logs, any pointers to additional docs would be appreciated. > > Here I''m running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/ > MDS, a few OSS nodes, and a few clients. Mostly I''m using the node > hosting the MGS/MDS as a Lustre client. Network is TCP.we are running same kernel 2.6.9-55.EL_lustre-1.6.1smp: One MGS/MDS one OSS and hundreds of nodes.> >> One consistently recurring problem is >> LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) >> ldlm_cli_enqueue: -2 >> on the client. > > I''m seeing exactly the same messages.Exactly the same behavior. These messages appear in logs when jobs starts.> >> Last night, in addition clients seemed to be evicted regularly >> (and reconnecting) even though they were up, which may be where >> the random processes died. Currently, we''re running with only one >> client, which seems to be stable except for the error above >> repeating itself. > > Occasionally I see messages similar to the following: > > LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1187711017, 50s ago) req@0000010038a01800 > x25961999/t0 o400->lustre-OST0002_UUID@10.64.95.251@tcp:28 lens > 128/128 ref 1 fl Rpc:N/0/0 rc 0/-22 > > which is concurrent with a long pause in fs access. As far as I can > tell, recovery is then successful, and the jobs keep running. The > main effect seems to be that file operations on the Lustre > filesystem are greatly slowed.I can see exactly the same behavior but some times clients doesn''t recover> > -- > Martin > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussWojciech Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070823/cce1bd1e/attachment.html