FWIW, we got our MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem to be fine. The clients are still running 1.6.3. Unfortunately, the upgrade did not resolve our issue. One our users has an mpi app where every thread opens the same input file (actually several in succession). Although we have run this job successfully before on up to 512 procs, it is not working now. Lustre seems to be locking up when all the threads go after the same file (to open) and we see things such as ... Feb 18 15:42:11 r3b-s16 kernel: LustreError: 11-0: an error occurred while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation failed with -107 Feb 18 15:42:11 r3b-s16 kernel: LustreError: Skipped 21 previous similar messages Feb 18 15:52:51 r3b-s16 kernel: LustreError: 11-0: an error occurred while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation failed with -107 Feb 18 15:52:51 r3b-s16 kernel: LustreError: Skipped 19 previous similar messages 10.13.24.40 at o2ib is our MDS. We have 512 ll_mdt threads (the max). The actual error in the code on some of the threads will be that the file was not found (even though it was clearly there) and this only happens after about an 8 minute timeout. Note that we have the file system mounted with the "-o flock" option. Is this part of the problem or are we hitting yet another bug? Thanks, Charlie Taylor UF HPC Center
Hello! On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:> Unfortunately, the upgrade did not resolve our issue. One our > users has an mpi app where every thread opens the same input file > (actually several in succession). Although we have run this job > successfully before on up to 512 procs, it is not working now. > Lustre seems to be locking up when all the threads go after the same > file (to open) and we see things such as ...Can you upload full log from start of problematic job to end somewhere? Also somewhere when first watchdog timeouts hit, it would be nice if you can do sysrq-t on MDS too to get traces of all threads (you need to have big dmesg buffer for them to fit, of use serial console). Is the job uses flocks/fcntl locks at all? if not, then don''t worry about mounting with -o flock. Bye, Oleg
Well, the log on the MDS at the time of the failure looks like... Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 263 previous similar messages Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 427 previous similar messages Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie 0x1938027bf9d67349 req at ffff8100ae3bfc00 x10000789/t0 o35- >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) Skipped 161 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client 2bdea9d4-43c3-a0b0-2822- c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib ns: mds- ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type: IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090 Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) ### lock on destroyed export ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock: ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030 remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 127:mds_finish_transno()) commit transaction for disconnected client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 We don''t have any watchdog timeouts associated with the event so I don''t have any tracebacks from those. One one of the clients we have... Feb 18 15:33:17 r1b-s23 kernel: LustreError: 11-0: an error occurred while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation failed with -107 Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- ffff81012d370800: Connection to service ufhpc-MDT0000 via nid 10.13.24.40 at o2ib was lost; in progress operations using thi\ s service will wait for recovery to complete. Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: LustreError: 167-0: This client was evicted by ufhpc-MDT0000; in progress operations using this service will fail. Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 423:mdc_finish_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- ffff81012d370800: Connection restored to service ufhpc-MDT0000 using nid 10.13.24.40 at o2ib. Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar messages ct On Feb 18, 2008, at 4:42 PM, Oleg Drokin wrote:> Hello! > > On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote: > >> Unfortunately, the upgrade did not resolve our issue. One our >> users has an mpi app where every thread opens the same input file >> (actually several in succession). Although we have run this job >> successfully before on up to 512 procs, it is not working now. >> Lustre seems to be locking up when all the threads go after the same >> file (to open) and we see things such as ... > > Can you upload full log from start of problematic job to end > somewhere? > Also somewhere when first watchdog timeouts hit, it would be nice > if you > can do sysrq-t on MDS too to get traces of all threads (you need to > have > big dmesg buffer for them to fit, of use serial console). > Is the job uses flocks/fcntl locks at all? if not, then don''t worry > about > mounting with -o flock. > > Bye, > Oleg
Hello! On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote:> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 263 previous similar messages > Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 427 previous similar messages > Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie > 0x1938027bf9d67349 req at ffff8100ae3bfc00 x10000789/t0 o35- > >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl > Interpret:/0/0 rc 0/0 > Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: > 1474:mds_close()) Skipped 161 previous similar messages > Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: > 210:waiting_locks_callback()) ### lock callback timer expired: > evicting client 2bdea9d4-43c3-a0b0-2822- > c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib ns: mds- > ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc: > 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type: > IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090 > Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: > 210:waiting_locks_callback()) Skipped 3 previous similar messages > Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: > 962:ldlm_handle_enqueue()) ### lock on destroyed export > ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock: > ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: > 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030 > remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 > Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: > 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages > Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: > 127:mds_finish_transno()) commit transaction for disconnected client > 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0This looks like in the middle of eviction storm, and by this point MDS and MGS anlready evicted tons of clients for unknown reasons (should be in the log before those messages). Bye, Oleg
Well, yes. But the evictions are the result of the job trying to start. Absent that, there are no evictions. A bunch of threads trying to open the same file should not cause the clients to be evicted. That''s an odd way of dealing with concurrency. :) Charlie On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote:> Hello! > > On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote: >> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: >> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS >> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: >> 515:mgs_handle()) Skipped 263 previous similar messages >> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: >> 1442:target_send_reply_msg()) @@@ processing error (-107) >> req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 >> fl Interpret:/0/0 rc -107/0 >> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: >> 1442:target_send_reply_msg()) Skipped 427 previous similar messages >> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: >> 1474:mds_close()) @@@ no handle for file close ino 43116025: >> cookie 0x1938027bf9d67349 req at ffff8100ae3bfc00 x10000789/t0 o35- >> >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl >> Interpret:/0/0 rc 0/0 >> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: >> 1474:mds_close()) Skipped 161 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: >> 210:waiting_locks_callback()) ### lock callback timer expired: >> evicting client 2bdea9d4-43c3-a0b0-2822- >> c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib ns: >> mds-ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 >> lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 >> type: IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid >> 6090 >> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: >> 210:waiting_locks_callback()) Skipped 3 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: >> 962:ldlm_handle_enqueue()) ### lock on destroyed export >> ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock: >> ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: >> 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030 >> remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 >> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: >> 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: >> 127:mds_finish_transno()) commit transaction for disconnected >> client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 > > This looks like in the middle of eviction storm, and by this point > MDS and MGS anlready evicted tons of clients for unknown reasons > (should be in the log before those messages). > > Bye, > Oleg
Hello! On Feb 18, 2008, at 5:04 PM, Charles Taylor wrote:> Well, yes. But the evictions are the result of the job trying to > start. Absent that, there are no evictions. A bunch of threads > trying to open the same file should not cause the clients to be > evicted. That''s an odd way of dealing with concurrency. :)Right, but I need those messages about evictions to see why the clients are being evicted. Bye, Oleg
We also see these on some of the clients... Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred while communicating with 10.13.24.40 at o2ib. The mds_close operation failed with -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous similar messages Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc = -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) Skipped 1 previous similar message I''m assuming some of the threads succeed in opening the file. When one fails, it calls mpi_abort() at which point all those threads that successfully opened the file then try to close it. Apparently they can''t close the file at that point either. I''m guessing of course, but it seems plausible. ct On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote:> Hello! > > On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote: >> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: >> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS >> Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: >> 515:mgs_handle()) Skipped 263 previous similar messages >> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: >> 1442:target_send_reply_msg()) @@@ processing error (-107) >> req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 >> fl Interpret:/0/0 rc -107/0 >> Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: >> 1442:target_send_reply_msg()) Skipped 427 previous similar messages >> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: >> 1474:mds_close()) @@@ no handle for file close ino 43116025: >> cookie 0x1938027bf9d67349 req at ffff8100ae3bfc00 x10000789/t0 o35- >> >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl >> Interpret:/0/0 rc 0/0 >> Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: >> 1474:mds_close()) Skipped 161 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: >> 210:waiting_locks_callback()) ### lock callback timer expired: >> evicting client 2bdea9d4-43c3-a0b0-2822- >> c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib ns: >> mds-ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 >> lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 >> type: IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid >> 6090 >> Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: >> 210:waiting_locks_callback()) Skipped 3 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: >> 962:ldlm_handle_enqueue()) ### lock on destroyed export >> ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock: >> ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: >> 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030 >> remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 >> Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: >> 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages >> Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: >> 127:mds_finish_transno()) commit transaction for disconnected >> client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 > > This looks like in the middle of eviction storm, and by this point > MDS and MGS anlready evicted tons of clients for unknown reasons > (should be in the log before those messages). > > Bye, > Oleg
Hello! On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred > while communicating with 10.13.24.40 at o2ib. The mds_close operation > failed with -116 > Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous > similar messages > Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: > 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc > -116 > Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: > 97:ll_close_inode_openhandle()) Skipped 1 previous similar messagesThese mean client was evicted (And later successfully reconnected) after opening file successfully. We need all the failure/evictions info since job started to make any meaningful progress, because as of now I have no idea why clients were evicted. Bye, Oleg
Yes, I understand. Right now we are just trying to isolate our problems so that we don''t provide information that is not related to the issue. Just to recap we were running pretty well with our patched 1.6.3 implementation. However, we could not start a 512-way job in which each thread tries to open a single copy of the same file. Inevitably, one or more threads would get a "can not open file" error and call mpi_abort() even though the file is there and many other threads open it successfully. We thought we were hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded the clients (400+ of them) and were hoping to avoid that for the moment. The upgrade seemed to go well and the file system is accessible on all the clients. However, our 512-way application still cannot run. We tried modifying the app so that each thread opens its own copy of the input file (i.e. file.in.<rank>) and duplicated the input file 512 times). This allowed the job to start but it eventually failed anyway with another "can not open file" error. ERROR (proc. 00410) - cannot open file: ./ skews_ms2p0.mixt.cva_00411_5.30000E-04 This seems to clearly indicate a problem with Lustre and/or our implementation. On a perhaps separate note (perhaps not), since the upgrade yesterday, we are seeing the messages below every ten minutes. Perhaps we need shutdown and impose some sanity on all this but in reality, this is the only job that is having trouble (out of hundreds, sometimes thousands) and the file system seems to be operating just fine otherwise. Any insight is appreciated at this point. We''ve put a lot of effort into lustre at this point and would like to stick with it but right now it looks like it can''t scale to a 512 way job. Thanks for the help, Charlie Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 202 previous similar messages Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) Skipped 201 previous similar messages Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) Skipped 209 previous similar messages Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 201 previous similar messages Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 209 previous similar messages Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 203 previous similar messages Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 205 previous similar messages Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 207 previous similar messages Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 209 previous similar messages Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:> Hello! > > On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote: >> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.13.24.40 at o2ib. The mds_close operation >> failed with -116 >> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous >> similar messages >> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc >> -116 >> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages > > These mean client was evicted (And later successfully reconnected) > after > opening file successfully. > > We need all the failure/evictions info since job started to make any > meaningful progress, because as of now I have no idea why clients > were evicted. > > Bye, > Oleg
One more thing worth mentioning, we have no more callback or watchdog timer expired messages so 1.6.4.2 seems to have fixed that. So it just seems like if 512 threads try to open the same file at roughly the same time, we are running out of some resource on the MDS or OSSs that keeps Lustre from satisfying the request. Charlie On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:> Yes, I understand. Right now we are just trying to isolate our > problems so that we don''t provide information that is not related to > the issue. Just to recap we were running pretty well with our > patched 1.6.3 implementation. However, we could not start a 512-way > job in which each thread tries to open a single copy of the same > file. Inevitably, one or more threads would get a "can not open > file" error and call mpi_abort() even though the file is there and > many other threads open it successfully. We thought we were > hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so > we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded > the clients (400+ of them) and were hoping to avoid that for the > moment. > > The upgrade seemed to go well and the file system is accessible on > all the clients. However, our 512-way application still cannot > run. We tried modifying the app so that each thread opens its own > copy of the input file (i.e. file.in.<rank>) and duplicated the input > file 512 times). This allowed the job to start but it eventually > failed anyway with another "can not open file" error. > > ERROR (proc. 00410) - cannot open file: ./ > skews_ms2p0.mixt.cva_00411_5.30000E-04 > > > This seems to clearly indicate a problem with Lustre and/or our > implementation. > > On a perhaps separate note (perhaps not), since the upgrade > yesterday, we are seeing the messages below every ten minutes. > Perhaps we need shutdown and impose some sanity on all this but in > reality, this is the only job that is having trouble (out of > hundreds, sometimes thousands) and the file system seems to be > operating just fine otherwise. > > Any insight is appreciated at this point. We''ve put a lot of effort > into lustre at this point and would like to stick with it but right > now it looks like it can''t scale to a 512 way job. > > Thanks for the help, > > Charlie > > > > Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 202 previous similar messages > Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 201 previous similar messages > Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 209 previous similar messages > Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 201 previous similar messages > Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 209 previous similar messages > Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 203 previous similar messages > Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 205 previous similar messages > Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 207 previous similar messages > Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 209 previous similar messages > Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > > > > On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote: > >> Hello! >> >> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote: >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred >>> while communicating with 10.13.24.40 at o2ib. The mds_close operation >>> failed with -116 >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous >>> similar messages >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: >>> rc >>> -116 >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages >> >> These mean client was evicted (And later successfully reconnected) >> after >> opening file successfully. >> >> We need all the failure/evictions info since job started to make any >> meaningful progress, because as of now I have no idea why clients >> were evicted. >> >> Bye, >> Oleg > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Ok, on the host that recorded the "can not open file" error we have this in the log at the time of the failure. Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c: 1040:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c: 1040:ll_glimpse_size()) Skipped 1 previous similar message Is this a known problem. Is there some resource we need to increase? Thanks, Charlie Taylor UF HPC Center On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:> Yes, I understand. Right now we are just trying to isolate our > problems so that we don''t provide information that is not related to > the issue. Just to recap we were running pretty well with our > patched 1.6.3 implementation. However, we could not start a 512-way > job in which each thread tries to open a single copy of the same > file. Inevitably, one or more threads would get a "can not open > file" error and call mpi_abort() even though the file is there and > many other threads open it successfully. We thought we were > hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so > we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded > the clients (400+ of them) and were hoping to avoid that for the > moment. > > The upgrade seemed to go well and the file system is accessible on > all the clients. However, our 512-way application still cannot > run. We tried modifying the app so that each thread opens its own > copy of the input file (i.e. file.in.<rank>) and duplicated the input > file 512 times). This allowed the job to start but it eventually > failed anyway with another "can not open file" error. > > ERROR (proc. 00410) - cannot open file: ./ > skews_ms2p0.mixt.cva_00411_5.30000E-04 > > > This seems to clearly indicate a problem with Lustre and/or our > implementation. > > On a perhaps separate note (perhaps not), since the upgrade > yesterday, we are seeing the messages below every ten minutes. > Perhaps we need shutdown and impose some sanity on all this but in > reality, this is the only job that is having trouble (out of > hundreds, sometimes thousands) and the file system seems to be > operating just fine otherwise. > > Any insight is appreciated at this point. We''ve put a lot of effort > into lustre at this point and would like to stick with it but right > now it looks like it can''t scale to a 512 way job. > > Thanks for the help, > > Charlie > > > > Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 202 previous similar messages > Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 201 previous similar messages > Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 209 previous similar messages > Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 201 previous similar messages > Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 209 previous similar messages > Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 207 previous similar messages > Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 203 previous similar messages > Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 205 previous similar messages > Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 207 previous similar messages > Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-107) > req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) Skipped 209 previous similar messages > Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS > Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: > 515:mgs_handle()) Skipped 205 previous similar messages > > > > On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote: > >> Hello! >> >> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote: >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred >>> while communicating with 10.13.24.40 at o2ib. The mds_close operation >>> failed with -116 >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous >>> similar messages >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: >>> rc >>> -116 >>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: >>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages >> >> These mean client was evicted (And later successfully reconnected) >> after >> opening file successfully. >> >> We need all the failure/evictions info since job started to make any >> meaningful progress, because as of now I have no idea why clients >> were evicted. >> >> Bye, >> Oleg > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Mon, Feb 18, 2008 at 04:29:17PM -0500, Charles Taylor wrote:> FWIW, we got our MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem > to be fine. The clients are still running 1.6.3. > Unfortunately, the upgrade did not resolve our issue.In another thread, I understood that you upgraded from 1.6.3 to 1.6.4.2 because you thought that you were hitting bug 13917. However, the ELC fix in bug 13917 must be installed on the client side. Johann
Hello! On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:> On a perhaps separate note (perhaps not), since the upgrade > yesterday, we are seeing the messages below every ten minutes. > Perhaps we need shutdown and impose some sanity on all this but in > reality, this is the only job that is having trouble (out of > hundreds, sometimes thousands) and the file system seems to be > operating just fine otherwise.Messages you mentioned means that your clients are disconnected from MGS for some reason. The reason is likely to be mentioned on MDS server during eviction. Bye, Oleg
On Tue, 2008-02-19 at 08:53 -0500, Charles Taylor wrote:> One more thing worth mentioning, we have no more callback or watchdog > timer expired messages so 1.6.4.2 seems to have fixed that. So > it just seems like if 512 threads try to open the same file at > roughly the same time, we are running out of some resource on the MDS > or OSSs that keeps Lustre from satisfying the request.As Johann said many messages ago in this thread, this looks like bug 13917 which requires that you upgrade the *clients* to 1.6.4.x and since you are 1.6.4.2 on the servers you might as well be so on the clients. The bug that causes this problem is a client side bug and was not fixed until 1.6.4 b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080220/deb0d75e/attachment-0002.bin