Dear Lustre,
We have some OSS''s which are triggering this bug. I can''t find
anything like it in bugzilla. If this is a known bug, can you include
what you searched bugzilla for to find it?
Thanks,
jeff
Lustre 1.6.4.2
OSS
---------------------------
Jul 8 14:24:54 oss10 kernel: LustreError:
4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue
wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID
lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res:
68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20
remote: 0xf87d4d490599950 expref: 117 pid: 4757
Jul 8 14:24:54 oss10 kernel: LustreError:
4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64
previous similar messages
Jul 8 14:24:54 oss10 kernel: LustreError:
4208:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue
wait took 7744804313us from 1215533749 ns: filter-lustre0-OST0009_UUID
lock: 00000101b36ab180/0x9275e8a2d17f94e3 lrc: 2/0,0 mode: PW/PW res:
68100256/0 rrc: 80 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 80010020 remote: 0xdea38daf6f6ba077
expref: 100 pid: 4542
Jul 8 14:24:54 oss10 kernel: LustreError:
4208:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 7
previous similar messages
Jul 8 14:25:44 oss10 kernel: LustreError:
0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback
timer expired: evicting client
eb58835d-0352-3bde-c8ba-57386954006d at NET_0x200000a01024f_UUID nid
10.1.2.79 at tcp ns: filter-lustre0-OST0009_UUID lock:
000001009a873ac0/0x9275e8a2d17f9568 lrc: 1/0,0 mode: PW/PW res:
68100256/0 rrc: 150 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 80000020 remote: 0x6913d224342a59cf
expref: 93 pid: 4762
Jul 8 14:25:44 oss10 kernel: LustreError:
4221:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue
wait took 7794802347us from 1215533749 ns: filter-lustre0-OST0009_UUID
lock: 000001009c118c80/0x9275e8a2d17f957d lrc: 2/0,0 mode: PW/PW res:
68100256/0 rrc: 150 type: EXT [0->33554431] (req 0->4095) flags: 10020
remote: 0xa4583fd67fdc786c expref: 100 pid: 4268
Jul 8 14:25:44 oss10 kernel: LustreError:
4221:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 9
previous similar messages
Jul 8 14:26:29 oss10 kernel: LustreError:
4814:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req at 0000010037d59a00 x6135877/t0 o400-><?>@<?>:-1 lens
128/0
ref 0 fl Interpret:/0/0 rc -107/0
Client
-------------------------
Jul 8 18:34:30 c086 kernel: LustreError: 11-0: an error occurred while
communicating with 10.1.1.40 at tcp. The ldlm_enqueue operation failed with
-107
Jul 8 18:34:30 c086 kernel: Lustre: lustre0-OST0009-osc-00000100081c2800:
Connection to service lustre0-OST0009 via nid 10.1.1.40 at tcp was lost; in
progress operations using this service will wait for recovery to complete.
Jul 8 18:34:30 c086 kernel: LustreError: 167-0: This client was evicted by
lustre0-OST0009; in progress operations using this service will fail.
Jul 8 18:34:30 c086 kernel: LustreError:
19481:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Jul 8 18:34:30 c086 kernel: Lustre: lustre0-OST0009-osc-00000100081c2800:
Connection restored to service lustre0-OST0009 using nid 10.1.1.40 at tcp.
--
Jeff Blasius / jeff.blasius at yale.edu
Phone: (203)432-9940 51 Prospect Rm. 011
High Performance Computing (HPC)
UNIX Systems Administrator, Linux Systems Design & Support (LSDS)
Yale University Information Technology Services (ITS)
On Jul 08, 2008 22:04 -0400, Jeff Blasius wrote:> Jul 8 14:24:54 oss10 kernel: LustreError: > 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue > wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID > lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res: > 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20 > remote: 0xf87d4d490599950 expref: 117 pid: 4757 > Jul 8 14:24:54 oss10 kernel: LustreError: > 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64 > previous similar messagesIt looks like you have many processes writing to the start of the same file. That causes unavoidable lock contention, and is most likely a bug in your program (e.g. the binary is linked with gprof and all of them are overwriting the same output file). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thank You so much!
This user was sure this wasn''t the case. Eventually we decided to
restart the mds. This triggered the D state python processes to return
a trace indicating the problem.
It turns out python started a process (popen) where std. error opened
a default file name on all 160 in flight processes. This was an open,
not an append, but it was enough contention to block access to the
entire directory.
-jeff
On Wed, Jul 9, 2008 at 7:17 AM, Andreas Dilger <adilger at sun.com>
wrote:> On Jul 08, 2008 22:04 -0400, Jeff Blasius wrote:
>> Jul 8 14:24:54 oss10 kernel: LustreError:
>> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue
>> wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID
>> lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res:
>> 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags:
20
>> remote: 0xf87d4d490599950 expref: 117 pid: 4757
>> Jul 8 14:24:54 oss10 kernel: LustreError:
>> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64
>> previous similar messages
>
> It looks like you have many processes writing to the start of the
> same file. That causes unavoidable lock contention, and is most
> likely a bug in your program (e.g. the binary is linked with gprof
> and all of them are overwriting the same output file).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
--
Jeff Blasius / jeff.blasius at yale.edu
Phone: (203)432-9940 51 Prospect Rm. 011
High Performance Computing (HPC)
UNIX Systems Administrator, Linux Systems Design & Support (LSDS)
Yale University Information Technology Services (ITS)