thr3ads.net - Lustre discuss - [Lustre-discuss] ldlm_enqueue operation failures [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Charles Taylor

2008-Feb-18 21:29 UTC

[Lustre-discuss] ldlm_enqueue operation failures

FWIW, we got our  MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem  
to be fine.    The clients are still running 1.6.3.

Unfortunately, the upgrade did not resolve our issue.    One our  
users has an mpi app where every thread opens the same input file  
(actually several in succession).    Although we have run this job  
successfully before on up to 512 procs, it is not working now.     
Lustre seems to be locking up when all the threads go after the same  
file (to open) and we see things such as ...

Feb 18 15:42:11 r3b-s16 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:42:11 r3b-s16 kernel: LustreError: Skipped 21 previous  
similar messages
Feb 18 15:52:51 r3b-s16 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:52:51 r3b-s16 kernel: LustreError: Skipped 19 previous  
similar messages

10.13.24.40 at o2ib is our MDS.   We have 512 ll_mdt threads (the max).

The actual error in the code on some of the threads will be that the  
file was not found (even though it was clearly there) and this only  
happens after about an 8 minute timeout.

Note that we have the file system mounted with the "-o flock"  
option.     Is this part of the problem or are we hitting yet another  
bug?

Thanks,

Charlie Taylor
UF HPC Center

Oleg Drokin

2008-Feb-18 21:42 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Hello!

On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:
> Unfortunately, the upgrade did not resolve our issue.    One our
> users has an mpi app where every thread opens the same input file
> (actually several in succession).    Although we have run this job
> successfully before on up to 512 procs, it is not working now.
> Lustre seems to be locking up when all the threads go after the same
> file (to open) and we see things such as ...
Can you upload full log from start of problematic job to end somewhere?
Also somewhere when first watchdog timeouts hit, it would be nice if you
can do sysrq-t on MDS too to get traces of all threads (you need to have
big dmesg buffer for them to fit, of use serial console).
Is the job uses flocks/fcntl locks at all? if not, then don''t worry  
about
mounting with -o flock.

Bye,
     Oleg

Charles Taylor

2008-Feb-18 21:55 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Well, the log on the MDS at the time of the failure looks like...

Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 263 previous similar messages
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 427 previous similar messages
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 43116025: cookie  
0x1938027bf9d67349  req at ffff8100ae3bfc00 x10000789/t0 o35- 
 >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl  
Interpret:/0/0 rc 0/0
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
1474:mds_close()) Skipped 161 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
210:waiting_locks_callback()) ### lock callback timer expired:  
evicting client 2bdea9d4-43c3-a0b0-2822- 
c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib  ns: mds- 
ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc:  
1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type:  
IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
210:waiting_locks_callback()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
962:ldlm_handle_enqueue()) ### lock on destroyed export  
ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:  
ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:  
21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030  
remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 
127:mds_finish_transno()) commit transaction for disconnected client  
2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0

We don''t have any watchdog timeouts associated with the event so I  
don''t have any tracebacks from those.    One one of the clients we  
have...

Feb 18 15:33:17 r1b-s23 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous  
similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- 
ffff81012d370800: Connection to service ufhpc-MDT0000 via nid  
10.13.24.40 at o2ib was lost; in progress operations using thi\
s service will wait for recovery to complete.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar  
messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 167-0: This client was  
evicted by ufhpc-MDT0000; in progress operations using this service  
will fail.
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous  
similar messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 
423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 
423:mdc_finish_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- 
ffff81012d370800: Connection restored to service ufhpc-MDT0000 using  
nid 10.13.24.40 at o2ib.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar  
messages

ct

On Feb 18, 2008, at 4:42 PM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:
>
>> Unfortunately, the upgrade did not resolve our issue.    One our
>> users has an mpi app where every thread opens the same input file
>> (actually several in succession).    Although we have run this job
>> successfully before on up to 512 procs, it is not working now.
>> Lustre seems to be locking up when all the threads go after the same
>> file (to open) and we see things such as ...
>
> Can you upload full log from start of problematic job to end  
> somewhere?
> Also somewhere when first watchdog timeouts hit, it would be nice  
> if you
> can do sysrq-t on MDS too to get traces of all threads (you need to  
> have
> big dmesg buffer for them to fit, of use serial console).
> Is the job uses flocks/fcntl locks at all? if not, then don''t
worry
> about
> mounting with -o flock.
>
> Bye,
>     Oleg

Oleg Drokin

2008-Feb-18 21:57 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Hello!

On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote:> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
> 515:mgs_handle()) Skipped 263 previous similar messages
> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
> 1442:target_send_reply_msg()) @@@ processing error (-107)   
> req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens
232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
> 1442:target_send_reply_msg()) Skipped 427 previous similar messages
> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
> 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie  
> 0x1938027bf9d67349  req at ffff8100ae3bfc00 x10000789/t0 o35- 
> >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl  
> Interpret:/0/0 rc 0/0
> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
> 1474:mds_close()) Skipped 161 previous similar messages
> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
> 210:waiting_locks_callback()) ### lock callback timer expired:  
> evicting client 2bdea9d4-43c3-a0b0-2822- 
> c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib  ns: mds- 
> ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc:  
> 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type:  
> IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090
> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
> 210:waiting_locks_callback()) Skipped 3 previous similar messages
> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
> 962:ldlm_handle_enqueue()) ### lock on destroyed export  
> ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:  
> ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:  
> 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030  
> remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
> 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
> Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 
> 127:mds_finish_transno()) commit transaction for disconnected client  
> 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0
This looks like in the middle of eviction storm, and by this point MDS  
and MGS anlready evicted tons of clients for unknown reasons (should  
be in the log before those messages).

Bye,
     Oleg

Charles Taylor

2008-Feb-18 22:04 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Well, yes.   But the evictions are the result of the job trying to  
start.   Absent that, there are no evictions.    A bunch of threads  
trying to open the same file should not cause the clients to be  
evicted.    That''s an odd way of dealing with concurrency.  :)

Charlie

On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote:
>> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
>> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
>> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
>> 515:mgs_handle()) Skipped 263 previous similar messages
>> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
>> 1442:target_send_reply_msg()) @@@ processing error (-107)   
>> req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1
lens 232/0 ref 0
>> fl Interpret:/0/0 rc -107/0
>> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
>> 1442:target_send_reply_msg()) Skipped 427 previous similar messages
>> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
>> 1474:mds_close()) @@@ no handle for file close ino 43116025:  
>> cookie 0x1938027bf9d67349  req at ffff8100ae3bfc00 x10000789/t0 o35- 
>> >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl  
>> Interpret:/0/0 rc 0/0
>> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
>> 1474:mds_close()) Skipped 161 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
>> 210:waiting_locks_callback()) ### lock callback timer expired:  
>> evicting client 2bdea9d4-43c3-a0b0-2822- 
>> c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib  ns:  
>> mds-ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487  
>> lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582  
>> type: IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid  
>> 6090
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
>> 210:waiting_locks_callback()) Skipped 3 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
>> 962:ldlm_handle_enqueue()) ### lock on destroyed export  
>> ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:  
>> ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:  
>> 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030  
>> remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
>> 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 
>> 127:mds_finish_transno()) commit transaction for disconnected  
>> client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0
>
> This looks like in the middle of eviction storm, and by this point  
> MDS and MGS anlready evicted tons of clients for unknown reasons  
> (should be in the log before those messages).
>
> Bye,
>     Oleg

Oleg Drokin

2008-Feb-18 22:05 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Hello!

On Feb 18, 2008, at 5:04 PM, Charles Taylor wrote:
> Well, yes.   But the evictions are the result of the job trying to  
> start.   Absent that, there are no evictions.    A bunch of threads  
> trying to open the same file should not cause the clients to be  
> evicted.    That''s an odd way of dealing with concurrency.  :)
Right, but I need those messages about evictions to see why the  
clients are being evicted.

Bye,
     Oleg

Charles Taylor

2008-Feb-18 22:13 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

We also see these on some of the clients...

Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The mds_close operation  
failed with -116
Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous  
similar messages
Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 
97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc =  
-116
Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 
97:ll_close_inode_openhandle()) Skipped 1 previous similar message

I''m assuming some of the threads succeed in opening the file.   When  
one fails, it calls mpi_abort() at which point all those threads that  
successfully opened the file then try to close it.    Apparently they  
can''t close the file at that point either.    I''m guessing of
course,
but it seems plausible.

ct

On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote:
>> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
>> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
>> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
>> 515:mgs_handle()) Skipped 263 previous similar messages
>> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
>> 1442:target_send_reply_msg()) @@@ processing error (-107)   
>> req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1
lens 232/0 ref 0
>> fl Interpret:/0/0 rc -107/0
>> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
>> 1442:target_send_reply_msg()) Skipped 427 previous similar messages
>> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
>> 1474:mds_close()) @@@ no handle for file close ino 43116025:  
>> cookie 0x1938027bf9d67349  req at ffff8100ae3bfc00 x10000789/t0 o35- 
>> >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl  
>> Interpret:/0/0 rc 0/0
>> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
>> 1474:mds_close()) Skipped 161 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
>> 210:waiting_locks_callback()) ### lock callback timer expired:  
>> evicting client 2bdea9d4-43c3-a0b0-2822- 
>> c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib  ns:  
>> mds-ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487  
>> lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582  
>> type: IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid  
>> 6090
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
>> 210:waiting_locks_callback()) Skipped 3 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
>> 962:ldlm_handle_enqueue()) ### lock on destroyed export  
>> ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:  
>> ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:  
>> 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030  
>> remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
>> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
>> 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
>> Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 
>> 127:mds_finish_transno()) commit transaction for disconnected  
>> client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0
>
> This looks like in the middle of eviction storm, and by this point  
> MDS and MGS anlready evicted tons of clients for unknown reasons  
> (should be in the log before those messages).
>
> Bye,
>     Oleg

Oleg Drokin

2008-Feb-19 05:15 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Hello!

On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.13.24.40 at o2ib. The mds_close operation
> failed with -116
> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
> similar messages
> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc >
-116
> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
These mean client was evicted (And later successfully reconnected) after
opening file successfully.

We need all the failure/evictions info since job started to make any
meaningful progress, because as of now I have no idea why clients
were evicted.

Bye,
     Oleg

Charles Taylor

2008-Feb-19 13:45 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Yes, I understand.    Right now we are just trying to isolate our  
problems so that we don''t provide information that is not related to  
the issue.      Just to recap we were running pretty well with our  
patched 1.6.3 implementation.   However, we could not start a 512-way  
job in which each thread tries to open a single copy of the same  
file.    Inevitably, one or more threads would get a "can not open  
file" error and call mpi_abort() even though the file is there and  
many other threads open it successfully.     We thought we were  
hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so  
we upgraded our MGS/MDS and OSSs to 1.6.4.2.   We have *not* upgraded  
the clients (400+ of them) and were hoping to avoid that for the moment.

The upgrade seemed to go well and the file system is accessible on  
all the clients.     However, our 512-way application still cannot  
run.    We tried modifying the app so that each thread opens its own  
copy of the input file (i.e. file.in.<rank>) and duplicated the input  
file 512 times).    This allowed the job to start but it eventually  
failed anyway with another "can not open file" error.

ERROR (proc. 00410) - cannot open file: ./ 
skews_ms2p0.mixt.cva_00411_5.30000E-04


This seems to clearly indicate a problem with Lustre and/or our  
implementation.

On a perhaps separate note (perhaps not), since the upgrade  
yesterday, we are seeing the messages below every ten minutes.      
Perhaps we need shutdown and impose some sanity on all this but in  
reality, this is the only job that is having trouble (out of  
hundreds, sometimes thousands) and the file system seems to be  
operating just fine otherwise.

Any insight is appreciated at this point.   We''ve put a lot of effort  
into lustre at this point and would like to stick with it but right  
now it looks like it can''t scale to a 512 way job.

Thanks for the help,

Charlie



Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 202 previous similar messages
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 201 previous similar messages
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0
ref 0
fl Interpret:/0/0 rc -107/0
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0
ref 0
fl Interpret:/0/0 rc -107/0
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 201 previous similar messages
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 203 previous similar messages
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 205 previous similar messages
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 207 previous similar messages
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 209 previous similar messages
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages



On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.13.24.40 at o2ib. The mds_close operation
>> failed with -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>> similar messages
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc
>> -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>
> These mean client was evicted (And later successfully reconnected)  
> after
> opening file successfully.
>
> We need all the failure/evictions info since job started to make any
> meaningful progress, because as of now I have no idea why clients
> were evicted.
>
> Bye,
>     Oleg

Charles Taylor

2008-Feb-19 13:53 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

One more thing worth mentioning, we have no more callback or watchdog  
timer expired messages so 1.6.4.2 seems to have  fixed that.      So  
it just seems like if 512 threads try to open the same file at  
roughly the same time, we are running out of some resource on the MDS  
or OSSs that keeps Lustre from satisfying the request.

Charlie

On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:
> Yes, I understand.    Right now we are just trying to isolate our
> problems so that we don''t provide information that is not related
to
> the issue.      Just to recap we were running pretty well with our
> patched 1.6.3 implementation.   However, we could not start a 512-way
> job in which each thread tries to open a single copy of the same
> file.    Inevitably, one or more threads would get a "can not open
> file" error and call mpi_abort() even though the file is there and
> many other threads open it successfully.     We thought we were
> hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so
> we upgraded our MGS/MDS and OSSs to 1.6.4.2.   We have *not* upgraded
> the clients (400+ of them) and were hoping to avoid that for the  
> moment.
>
> The upgrade seemed to go well and the file system is accessible on
> all the clients.     However, our 512-way application still cannot
> run.    We tried modifying the app so that each thread opens its own
> copy of the input file (i.e. file.in.<rank>) and duplicated the input
> file 512 times).    This allowed the job to start but it eventually
> failed anyway with another "can not open file" error.
>
> ERROR (proc. 00410) - cannot open file: ./
> skews_ms2p0.mixt.cva_00411_5.30000E-04
>
>
> This seems to clearly indicate a problem with Lustre and/or our
> implementation.
>
> On a perhaps separate note (perhaps not), since the upgrade
> yesterday, we are seeing the messages below every ten minutes.
> Perhaps we need shutdown and impose some sanity on all this but in
> reality, this is the only job that is having trouble (out of
> hundreds, sometimes thousands) and the file system seems to be
> operating just fine otherwise.
>
> Any insight is appreciated at this point.   We''ve put a lot of
effort
> into lustre at this point and would like to stick with it but right
> now it looks like it can''t scale to a 512 way job.
>
> Thanks for the help,
>
> Charlie
>
>
>
> Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 202 previous similar messages
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 201 previous similar messages
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens
232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens
232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 201 previous similar messages
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 203 previous similar messages
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 205 previous similar messages
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 207 previous similar messages
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 209 previous similar messages
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
>
>
>
> On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.13.24.40 at o2ib. The mds_close
operation
>>> failed with -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>>> similar messages
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed:  
>>> rc >>> -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>>
>> These mean client was evicted (And later successfully reconnected)
>> after
>> opening file successfully.
>>
>> We need all the failure/evictions info since job started to make any
>> meaningful progress, because as of now I have no idea why clients
>> were evicted.
>>
>> Bye,
>>     Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Charles Taylor

2008-Feb-19 14:08 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Ok, on the host that recorded the "can not open file" error we have  
this in the log at the time of the failure.

Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c: 
1040:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c: 
1040:ll_glimpse_size()) Skipped 1 previous similar message


Is this a known problem.   Is there some resource we need to increase?

Thanks,

Charlie Taylor
UF HPC Center

On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:
> Yes, I understand.    Right now we are just trying to isolate our
> problems so that we don''t provide information that is not related
to
> the issue.      Just to recap we were running pretty well with our
> patched 1.6.3 implementation.   However, we could not start a 512-way
> job in which each thread tries to open a single copy of the same
> file.    Inevitably, one or more threads would get a "can not open
> file" error and call mpi_abort() even though the file is there and
> many other threads open it successfully.     We thought we were
> hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so
> we upgraded our MGS/MDS and OSSs to 1.6.4.2.   We have *not* upgraded
> the clients (400+ of them) and were hoping to avoid that for the  
> moment.
>
> The upgrade seemed to go well and the file system is accessible on
> all the clients.     However, our 512-way application still cannot
> run.    We tried modifying the app so that each thread opens its own
> copy of the input file (i.e. file.in.<rank>) and duplicated the input
> file 512 times).    This allowed the job to start but it eventually
> failed anyway with another "can not open file" error.
>
> ERROR (proc. 00410) - cannot open file: ./
> skews_ms2p0.mixt.cva_00411_5.30000E-04
>
>
> This seems to clearly indicate a problem with Lustre and/or our
> implementation.
>
> On a perhaps separate note (perhaps not), since the upgrade
> yesterday, we are seeing the messages below every ten minutes.
> Perhaps we need shutdown and impose some sanity on all this but in
> reality, this is the only job that is having trouble (out of
> hundreds, sometimes thousands) and the file system seems to be
> operating just fine otherwise.
>
> Any insight is appreciated at this point.   We''ve put a lot of
effort
> into lustre at this point and would like to stick with it but right
> now it looks like it can''t scale to a 512 way job.
>
> Thanks for the help,
>
> Charlie
>
>
>
> Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 202 previous similar messages
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 201 previous similar messages
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens
232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens
232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 201 previous similar messages
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 203 previous similar messages
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 205 previous similar messages
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 207 previous similar messages
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens
232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 209 previous similar messages
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
>
>
>
> On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.13.24.40 at o2ib. The mds_close
operation
>>> failed with -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>>> similar messages
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed:  
>>> rc >>> -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>>
>> These mean client was evicted (And later successfully reconnected)
>> after
>> opening file successfully.
>>
>> We need all the failure/evictions info since job started to make any
>> meaningful progress, because as of now I have no idea why clients
>> were evicted.
>>
>> Bye,
>>     Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Johann Lombardi

2008-Feb-19 14:09 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

On Mon, Feb 18, 2008 at 04:29:17PM -0500, Charles Taylor
wrote:> FWIW, we got our  MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem
> to be fine. The clients are still running 1.6.3.
> Unfortunately, the upgrade did not resolve our issue.
In another thread, I understood that you upgraded from 1.6.3 to 1.6.4.2
because you thought that you were hitting bug 13917.
However, the ELC fix in bug 13917 must be installed on the client side.

Johann

Oleg Drokin

2008-Feb-19 16:48 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

Hello!

On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:> On a perhaps separate note (perhaps not), since the upgrade
> yesterday, we are seeing the messages below every ten minutes.
> Perhaps we need shutdown and impose some sanity on all this but in
> reality, this is the only job that is having trouble (out of
> hundreds, sometimes thousands) and the file system seems to be
> operating just fine otherwise.
Messages you mentioned means that your clients are disconnected from  
MGS for some
reason. The reason is likely to be mentioned on MDS server during  
eviction.

Bye,
     Oleg

Brian J. Murrell

2008-Feb-20 14:16 UTC

head link

[Lustre-discuss] ldlm_enqueue operation failures

On Tue, 2008-02-19 at 08:53 -0500, Charles Taylor wrote:> One more thing worth mentioning, we have no more callback or watchdog  
> timer expired messages so 1.6.4.2 seems to have  fixed that.      So  
> it just seems like if 512 threads try to open the same file at  
> roughly the same time, we are running out of some resource on the MDS  
> or OSSs that keeps Lustre from satisfying the request.
As Johann said many messages ago in this thread, this looks like bug
13917 which requires that you upgrade the *clients* to 1.6.4.x and since
you are 1.6.4.2 on the servers you might as well be so on the clients.
The bug that causes this problem is a client side bug and was not fixed
until 1.6.4

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080220/deb0d75e/attachment-0002.bin

Lustre discuss - Feb 2008 - ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures

[Lustre-discuss] ldlm_enqueue operation failures