thr3ads.net - Lustre discuss - [Lustre-discuss] lustre 1.6.1: ldlm_cli

If this information is useful, please help other people find it:
Share via:

Kai Germaschewski

2007-Aug-21 09:14 UTC

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

We''ve been playing with using luster as root fs for our x86_64 based 
cluster. We''ve run into quite some stability problems, with arbitrary 
processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper 
or whatever.

We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem being
mounted
read-only. MGS/MDS/OST are all on one server node. I''ve trouble 
understanding most of the things that lustre is writing to the logs, any 
pointers to additional docs would be appreciated.

One consistently recurring problem is

LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2

on the client.

Last night, in addition clients seemed to be evicted regularly (and 
reconnecting) even though they were up, which may be where the random 
processes died. Currently, we''re running with only one client, which
seems
to be stable except for the error above repeating itself.

I''ll be happy to provide any additional info needed.

--Kai

Martin Pokorny

2007-Aug-21 10:02 UTC

head link

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Kai Germaschewski wrote:> We''ve been playing with using luster as root fs for our x86_64
based
> cluster. We''ve run into quite some stability problems, with
arbitrary
> processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper 
> or whatever.
My cluster is seeing similar problems. I''ve got a heterogeneous cluster
with both x86_64 and i386 nodes, and I''m not using Lustre as the root 
fs, but I''ve noticed similar problems as you''ve described.
> We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem being
mounted
> read-only. MGS/MDS/OST are all on one server node. I''ve trouble 
> understanding most of the things that lustre is writing to the logs, any 
> pointers to additional docs would be appreciated.
Here I''m running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/MDS, a
few OSS nodes, and a few clients. Mostly I''m using the node hosting the
MGS/MDS as a Lustre client. Network is TCP.
> One consistently recurring problem is
> 
> LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2
> 
> on the client.
I''m seeing exactly the same messages.
> Last night, in addition clients seemed to be evicted regularly (and 
> reconnecting) even though they were up, which may be where the random 
> processes died. Currently, we''re running with only one client,
which seems
> to be stable except for the error above repeating itself.
Occasionally I see messages similar to the following:

LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@ 
timeout (sent at 1187711017, 50s ago)  req@0000010038a01800 x25961999/t0 
o400->lustre-OST0002_UUID@10.64.95.251@tcp:28 lens 128/128 ref 1 fl 
Rpc:N/0/0 rc 0/-22

which is concurrent with a long pause in fs access. As far as I can 
tell, recovery is then successful, and the jobs keep running. The main 
effect seems to be that file operations on the Lustre filesystem are 
greatly slowed.

-- 
Martin

Kai Germaschewski

2007-Aug-21 12:46 UTC

head link

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

On Tue, 21 Aug 2007, Kai Germaschewski wrote:
> One consistently recurring problem is
> 
> LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2
> 
> on the client.
Here''s another, potentially not even related issue:

I mounted the filesystem (h101:/root) onto another server node into 
/mount/root for testing. Accessing it worked fine with no errors generated 
neither on the client nor server(s). I unmounted it and mounted it 
read-only, and the cd''d into /mount/root and "chroot .".

This gave:

Lustre: Client root-client has started
Lustre: client ffff8103fba00c00 umount complete
Lustre: Client root-client has started
LustreError: 24772:0:(mdc_locks.c:409:mdc_enqueue()) ASSERTION(rc != 
-ENOENT) failed
LustreError: 24772:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG
Lustre: 24772:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack 
for process 24772
bash          R  running task       0 24772  24717                     
(NOTLB)
 ffffffff88017379 ffffffff8042f3b7 ffff810003f07200 ffff8101fabb1528
 ffff810003f07200 ffffffff8042f641 0000000000000000 ffffffff807570e0
 0000000000007715 000000000000772a fffffffffffffe99 ffffffff80816fc0
Call Trace:
 [<ffffffff8042f641>] vt_console_print+0x21a/0x230
 [<ffffffff8042f641>] vt_console_print+0x21a/0x230
 [<ffffffff8021623a>] release_console_sem+0x1a7/0x1eb
 [<ffffffff8021623a>] release_console_sem+0x1a7/0x1eb
 [<ffffffff802921df>] kallsyms_lookup+0x49/0x91
 [<ffffffff802921df>] kallsyms_lookup+0x49/0x91
 [<ffffffff802627ef>] printk_address+0x96/0xa0
 [<ffffffff802904a5>] __module_text_address+0x64/0x73
 [<ffffffff80289451>] __kernel_text_address+0x1a/0x26
 [<ffffffff80289451>] __kernel_text_address+0x1a/0x26
 [<ffffffff802622bc>] dump_trace+0x247/0x274
 [<ffffffff8026231d>] show_trace+0x34/0x47
 [<ffffffff80262408>] _show_stack+0xd8/0xe5
 [<ffffffff8800eba2>] :libcfs:lbug_with_loc+0x79/0xa3
 [<ffffffff880157c9>] :libcfs:trace_refill_stock+0x0/0x63
 [<ffffffff8811d7eb>] :mdc:mdc_enqueue+0x9dc/0x1465
 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af
 [<ffffffff88186628>] :ptlrpc:ldlm_completion_ast+0x0/0x721
 [<ffffffff88175a47>] :ptlrpc:ldlm_resource_putref+0x192/0x374
 [<ffffffff8811e5cb>] :mdc:mdc_intent_lock+0x357/0x969
 [<ffffffff88186628>] :ptlrpc:ldlm_completion_ast+0x0/0x721
 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af
 [<ffffffff88183889>] :ptlrpc:ldlm_cancel_lru+0x80/0x32d
 [<ffffffff88172d2c>] :ptlrpc:__ldlm_handle2lock+0x2f5/0x354
 [<ffffffff88262168>] :lustre:ll_i2gids+0x5d/0xfe
 [<ffffffff8826227e>] :lustre:ll_prepare_mdc_op_data+0x75/0xfb
 [<ffffffff88264122>] :lustre:ll_lookup_it+0x347/0x831
 [<ffffffff88264845>] :lustre:ll_mdc_blocking_ast+0x0/0x4af
 [<ffffffff8822e802>] :lustre:ll_intent_drop_lock+0x85/0x93
 [<ffffffff8822ec4b>] :lustre:ll_revalidate_it+0x20c/0xabc
 [<ffffffff88264699>] :lustre:ll_lookup_nd+0x8d/0xfe
 [<ffffffff80220a50>] d_alloc+0x153/0x18f
 [<ffffffff802af88f>] real_lookup+0x7b/0x11e
 [<ffffffff8020cb20>] do_lookup+0x67/0xbf
 [<ffffffff80209afe>] __link_path_walk+0xa44/0xef1
 [<ffffffff80209f97>] __link_path_walk+0xedd/0xef1
 [<ffffffff8020e4e6>] link_path_walk+0x4a/0xbe
 [<ffffffff80225332>] do_filp_open+0x50/0x71
 [<ffffffff8020c98c>] do_path_lookup+0x1ab/0x1cd
 [<ffffffff80221775>] __path_lookup_intent_open+0x54/0x92
 [<ffffffff80219a7a>] open_namei+0x94/0x67e
 [<ffffffff802b02b9>] __user_walk_fd_it+0x48/0x53
 [<ffffffff80226643>] vfs_stat_fd+0x44/0x78
 [<ffffffff80225332>] do_filp_open+0x50/0x71
 [<ffffffff8822e918>] :lustre:ll_intent_release+0x0/0x127
 [<ffffffff802187d2>] do_sys_open+0x44/0xc8
 [<ffffffff802591ce>] system_call+0x7e/0x83

LustreError: dumping log to /tmp/lustre-log.1187721493.24772
LustreError: 24772:0:(linux-debug.c:91:libcfs_run_upcall()) Error -2 
invoking LNET upcall /usr/lib/lustre/lnet_upcall 
LBUG,/home/kai/lustre-1.6.0.1-ql2-rc2/lnet/libcfs/tracefile.c,libcfs_assertion_failed,433;
check /proc/sys/lnet/upcall

--Kai

Wojciech Turek

2007-Aug-23 14:40 UTC

head link

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Hi,

On 21 Aug 2007, at 17:02, Martin Pokorny wrote:
> Kai Germaschewski wrote:
>> We''ve been playing with using luster as root fs for our x86_64
>> based cluster. We''ve run into quite some stability problems,
with
>> arbitrary processes on the nodes disappearing, like sshd, gmond,  
>> the Myrinet mapper or whatever.
>
> My cluster is seeing similar problems. I''ve got a heterogeneous  
> cluster with both x86_64 and i386 nodes, and I''m not using Lustre
> as the root fs, but I''ve noticed similar problems as
you''ve described.We have similar problem on our x86_64 cluster. We are using lustre as  
scratch file system for running jobs.>> We''re running 2.6.18-vanilla + lustre 1.6.1, the filesystem
being
>> mounted read-only. MGS/MDS/OST are all on one server node.
I''ve
>> trouble understanding most of the things that lustre is writing to  
>> the logs, any pointers to additional docs would be appreciated.
>
> Here I''m running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/ 
> MDS, a few OSS nodes, and a few clients. Mostly I''m using the node
> hosting the MGS/MDS as a Lustre client. Network is TCP.we are running same kernel 2.6.9-55.EL_lustre-1.6.1smp:  One MGS/MDS  
one OSS and hundreds of nodes.>
>> One consistently recurring problem is
>> LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue())  
>> ldlm_cli_enqueue: -2
>> on the client.
>
> I''m seeing exactly the same messages.Exactly the same behavior. These messages appear in logs when jobs  
starts.>
>> Last night, in addition clients seemed to be evicted regularly  
>> (and reconnecting) even though they were up, which may be where  
>> the random processes died. Currently, we''re running with only
one
>> client, which seems to be stable except for the error above  
>> repeating itself.
>
> Occasionally I see messages similar to the following:
>
> LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@  
> timeout (sent at 1187711017, 50s ago)  req@0000010038a01800  
> x25961999/t0 o400->lustre-OST0002_UUID@10.64.95.251@tcp:28 lens  
> 128/128 ref 1 fl Rpc:N/0/0 rc 0/-22
>
> which is concurrent with a long pause in fs access. As far as I can  
> tell, recovery is then successful, and the jobs keep running. The  
> main effect seems to be that file operations on the Lustre  
> filesystem are greatly slowed.I can see exactly the same behavior but some times clients doesn''t  
recover>
> -- 
> Martin
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Wojciech

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27@cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070823/cce1bd1e/attachment.html

Lustre discuss - Aug 2007 - lustre 1.6.1: ldlm_cli_enqueue: -2 errors

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

[Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors