Jan H. Julian
2007-Mar-03 17:58 UTC
[Lustre-discuss] (lov_request.c:180:lov_update_enqueue_set()) error: enqueue objid
We are starting to investigate extremely slow performance on one of our test jobs using lustre.1.4.7.1 and have encountered the following error message in the job output:> >Mar 3 09:45:52 mt006 kernel: LustreError: >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4Mar 3 09:45:52 mt006 kernel: LustreError: 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2 previous similar messages I think there was a mention in earlier discussion that a message similar to this was actually a recovery message. Could someone please give tell me what this means. No lustre dump have been found on the cluster related to this error. -- Jan Julian University of Alaska, ARSC mailto:julian@arsc.edu (907) 450-8641 910 Yukon Drive, Suite 001 http://www.arsc.edu Fax: 450-8605 Fairbanks, AK 99775-6020 USA
Andreas Dilger
2007-Mar-03 19:02 UTC
[Lustre-discuss] (lov_request.c:180:lov_update_enqueue_set()) error: enqueue objid
On Mar 03, 2007 15:57 -0900, Jan H. Julian wrote:> We are starting to investigate extremely slow performance on one of > our test jobs using lustre.1.4.7.1 and have encountered the following > error message in the job output: > > >Mar 3 09:45:52 mt006 kernel: LustreError: > >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue > >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4 > Mar 3 09:45:52 mt006 kernel: LustreError: > 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2 > previous similar messagesThis is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means your job was killed with CTRL-C when it was stuck. The server was not responsive to the client and should be investigated. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Jan H. Julian
2007-Mar-12 16:45 UTC
[Lustre-discuss] (lov_request.c:180:lov_update_enqueue_set()) error: enqueue objid
Andreas, Thanks for the reply. I think that you are referring to the lustre server and client rather than the mpi server (mother superior) and the clients in your response below. We are seeing many messages of this type which correspond to the job being ended in PBS and a message sent to the job std out "FATAL from PE 34: open_param_file: Input file INPUT/HIM_input does not exist." I think you are saying that the lustre server is unresponsive, but would like to confirm. We''re not seeing many other messages which we can tie to the job exit. Jan At 10:02 AM +0800 3/4/07, Andreas Dilger wrote:>On Mar 03, 2007 15:57 -0900, Jan H. Julian wrote: >> We are starting to investigate extremely slow performance on one of >> our test jobs using lustre.1.4.7.1 and have encountered the following >> error message in the job output: >> >> >Mar 3 09:45:52 mt006 kernel: LustreError: >> >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue >> >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4 >> Mar 3 09:45:52 mt006 kernel: LustreError: >> 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2 >> previous similar messages > >This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means >your job was killed with CTRL-C when it was stuck. The server was >not responsive to the client and should be investigated. > >Cheers, Andreas >-- >Andreas Dilger >Principal Software Engineer >Cluster File Systems, Inc.-- Jan Julian University of Alaska, ARSC mailto:julian@arsc.edu (907) 450-8641 910 Yukon Drive, Suite 001 http://www.arsc.edu Fax: 450-8605 Fairbanks, AK 99775-6020 USA
Andreas Dilger
2007-Mar-13 00:41 UTC
[Lustre-discuss] (lov_request.c:180:lov_update_enqueue_set()) error: enqueue objid
On Mar 12, 2007 15:44 -0800, Jan H. Julian wrote:> Thanks for the reply. I think that you are referring to the lustre > server and client rather than the mpi server (mother superior) and > the clients in your response below.Correct.> We are seeing many messages of this type which correspond to the job > being ended in PBS and a message sent to the job std out "FATAL from > PE 34: open_param_file: Input file INPUT/HIM_input does not exist." > I think you are saying that the lustre server is unresponsive, but > would like to confirm. We''re not seeing many other messages which we > can tie to the job exit.In particular, this Lustre client is reporting that while waiting for a lock enqueue to OST index #21 the process was killed. That would lead investigation to whatever node OST index #21 is on to verify its health.> At 10:02 AM +0800 3/4/07, Andreas Dilger wrote: > >On Mar 03, 2007 15:57 -0900, Jan H. Julian wrote: > >> We are starting to investigate extremely slow performance on one of > >> our test jobs using lustre.1.4.7.1 and have encountered the following > >> error message in the job output: > >> > >> >Mar 3 09:45:52 mt006 kernel: LustreError: > >> >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue > >> >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4 > >> Mar 3 09:45:52 mt006 kernel: LustreError: > >> 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2 > >> previous similar messages > > > >This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means > >your job was killed with CTRL-C when it was stuck. The server was > >not responsive to the client and should be investigated.Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.