Dear Lustre, We have some OSS''s which are triggering this bug. I can''t find anything like it in bugzilla. If this is a known bug, can you include what you searched bugzilla for to find it? Thanks, jeff Lustre 1.6.4.2 OSS --------------------------- Jul 8 14:24:54 oss10 kernel: LustreError: 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res: 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20 remote: 0xf87d4d490599950 expref: 117 pid: 4757 Jul 8 14:24:54 oss10 kernel: LustreError: 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64 previous similar messages Jul 8 14:24:54 oss10 kernel: LustreError: 4208:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue wait took 7744804313us from 1215533749 ns: filter-lustre0-OST0009_UUID lock: 00000101b36ab180/0x9275e8a2d17f94e3 lrc: 2/0,0 mode: PW/PW res: 68100256/0 rrc: 80 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 80010020 remote: 0xdea38daf6f6ba077 expref: 100 pid: 4542 Jul 8 14:24:54 oss10 kernel: LustreError: 4208:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 7 previous similar messages Jul 8 14:25:44 oss10 kernel: LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback timer expired: evicting client eb58835d-0352-3bde-c8ba-57386954006d at NET_0x200000a01024f_UUID nid 10.1.2.79 at tcp ns: filter-lustre0-OST0009_UUID lock: 000001009a873ac0/0x9275e8a2d17f9568 lrc: 1/0,0 mode: PW/PW res: 68100256/0 rrc: 150 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 80000020 remote: 0x6913d224342a59cf expref: 93 pid: 4762 Jul 8 14:25:44 oss10 kernel: LustreError: 4221:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue wait took 7794802347us from 1215533749 ns: filter-lustre0-OST0009_UUID lock: 000001009c118c80/0x9275e8a2d17f957d lrc: 2/0,0 mode: PW/PW res: 68100256/0 rrc: 150 type: EXT [0->33554431] (req 0->4095) flags: 10020 remote: 0xa4583fd67fdc786c expref: 100 pid: 4268 Jul 8 14:25:44 oss10 kernel: LustreError: 4221:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 9 previous similar messages Jul 8 14:26:29 oss10 kernel: LustreError: 4814:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 0000010037d59a00 x6135877/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Client ------------------------- Jul 8 18:34:30 c086 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.40 at tcp. The ldlm_enqueue operation failed with -107 Jul 8 18:34:30 c086 kernel: Lustre: lustre0-OST0009-osc-00000100081c2800: Connection to service lustre0-OST0009 via nid 10.1.1.40 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jul 8 18:34:30 c086 kernel: LustreError: 167-0: This client was evicted by lustre0-OST0009; in progress operations using this service will fail. Jul 8 18:34:30 c086 kernel: LustreError: 19481:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Jul 8 18:34:30 c086 kernel: Lustre: lustre0-OST0009-osc-00000100081c2800: Connection restored to service lustre0-OST0009 using nid 10.1.1.40 at tcp. -- Jeff Blasius / jeff.blasius at yale.edu Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, Linux Systems Design & Support (LSDS) Yale University Information Technology Services (ITS)
On Jul 08, 2008 22:04 -0400, Jeff Blasius wrote:> Jul 8 14:24:54 oss10 kernel: LustreError: > 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue > wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID > lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res: > 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20 > remote: 0xf87d4d490599950 expref: 117 pid: 4757 > Jul 8 14:24:54 oss10 kernel: LustreError: > 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64 > previous similar messagesIt looks like you have many processes writing to the start of the same file. That causes unavoidable lock contention, and is most likely a bug in your program (e.g. the binary is linked with gprof and all of them are overwriting the same output file). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thank You so much! This user was sure this wasn''t the case. Eventually we decided to restart the mds. This triggered the D state python processes to return a trace indicating the problem. It turns out python started a process (popen) where std. error opened a default file name on all 160 in flight processes. This was an open, not an append, but it was enough contention to block access to the entire directory. -jeff On Wed, Jul 9, 2008 at 7:17 AM, Andreas Dilger <adilger at sun.com> wrote:> On Jul 08, 2008 22:04 -0400, Jeff Blasius wrote: >> Jul 8 14:24:54 oss10 kernel: LustreError: >> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue >> wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID >> lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res: >> 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20 >> remote: 0xf87d4d490599950 expref: 117 pid: 4757 >> Jul 8 14:24:54 oss10 kernel: LustreError: >> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64 >> previous similar messages > > It looks like you have many processes writing to the start of the > same file. That causes unavoidable lock contention, and is most > likely a bug in your program (e.g. the binary is linked with gprof > and all of them are overwriting the same output file). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >-- Jeff Blasius / jeff.blasius at yale.edu Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, Linux Systems Design & Support (LSDS) Yale University Information Technology Services (ITS)