Hi, I have this on the client. It died and then recovered. But in the meantime job that got on that client has died also. Could anybody give me any ideas where to look for the solution? We are running lustre 1.6.1 on 2.6.9-55.EL_lustre-1.6.1smp Find attached lctl dk log_file and /tmp/logfile from client node ?? following is /var/log/messages from client ################################ Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(lustre_dlm.h: 660:lock_bitlock()) ASSERTION(lock->l_pidb == 0) failed Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(tracefile.c: 433:libcfs_assertion_failed()) LBUG Aug 24 15:06:41 node-c64 kernel: Lustre: 14972:0:(linux-debug.c: 168:libcfs_debug_dumpstack()) showing stack for process 14972 Aug 24 15:06:41 node-c64 kernel: zeus3d.exe R running task 0 14972 14964 14976 (NOTLB) Aug 24 15:06:41 node-c64 kernel: ffffff3134393732 ffffffffa029438b ffffffffa0294370 0000000000003a7c Aug 24 15:06:41 node-c64 kernel: 0000000000000000 ffffffffa0292b8a 000000000000067b ffffffffa0294383 Aug 24 15:06:41 node-c64 kernel: ffffffffa0294377 0000000000003a7c Aug 24 15:06:41 node-c64 kernel: Call Trace:<ffffffff80232861> {complement_pos+12} <ffffffff801f832f>{vgacon_cursor+0} Aug 24 15:06:41 node-c64 kernel: <ffffffff802350f1> {poke_blanked_console+67} <ffffffff8023748b>{vt_console_print+726} Aug 24 15:06:41 node-c64 kernel: <ffffffff80137bf6> {release_console_sem+369} <ffffffff80137e24>{vprintk+498} Aug 24 15:06:41 node-c64 kernel: <ffffffff80137e24>{vprintk +498} <ffffffff80137ece>{printk+141} Aug 24 15:06:41 node-c64 kernel: <ffffffff80148c7f> {__kernel_text_address+26} <ffffffff801115c0>{show_trace+375} Aug 24 15:06:41 node-c64 kernel: <ffffffff801116fc>{show_stack +241} <ffffffffa0288cb3>{:libcfs:lbug_with_loc+115} Aug 24 15:06:41 node-c64 kernel: <ffffffffa028f894> {:libcfs:libcfs_assertion_failed+84} Aug 24 15:06:41 node-c64 kernel: <ffffffffa038606f> {:ptlrpc:lock_res_and_lock+111} <ffffffffa039ded8> {:ptlrpc:ldlm_cli_cancel_local+184} Aug 24 15:06:41 node-c64 kernel: <ffffffffa03a034e> {:ptlrpc:ldlm_cancel_resource_local+606} Aug 24 15:06:41 node-c64 kernel: <ffffffff80132010> {try_to_wake_up+876} <ffffffffa0443f1b>{:mdc:mdc_resource_get_unused +443} Aug 24 15:06:41 node-c64 kernel: <ffffffffa0446e02> {:mdc:mdc_enqueue+850} <ffffffffa03b0c2f>{:ptlrpc:__ptlrpc_free_req +1663} Aug 24 15:06:41 node-c64 kernel: <ffffffffa03b0e29> {:ptlrpc:__ptlrpc_req_finished+393} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04483f1> {:mdc:mdc_intent_lock+1057} <ffffffff80132010>{try_to_wake_up+876} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04cfbf0> {:lustre:ll_mdc_blocking_ast+0} <ffffffffa039bf50> {:ptlrpc:ldlm_completion_ast+0} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04cfbf0> {:lustre:ll_mdc_blocking_ast+0} <ffffffffa039bf50> {:ptlrpc:ldlm_completion_ast+0} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d02ad> {:lustre:ll_i2gids+93} <ffffffffa04d042b> {:lustre:ll_prepare_mdc_op_data+139} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d0d31> {:lustre:ll_lookup_it+913} <ffffffffa04cfbf0> {:lustre:ll_mdc_blocking_ast+0} Aug 24 15:06:41 node-c64 kernel: <ffffffff80185981> {link_path_walk+179} <ffffffffa04b2978>{:lustre:ll_inode_permission+184} Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d1345> {:lustre:ll_lookup_nd+149} <ffffffff8018ec90>{d_alloc+436} Aug 24 15:06:41 node-c64 kernel: <ffffffff80185db0> {__lookup_hash+263} <ffffffff80186577>{open_namei+331} Aug 24 15:06:41 node-c64 kernel: <ffffffff801770a5>{filp_open +95} <ffffffff8015ff7b>{cache_alloc_refill+390} Aug 24 15:06:41 node-c64 kernel: <ffffffffa0498ba0> {:lustre:ll_intent_release+0} <ffffffff801772a5>{sys_open+57} Aug 24 15:06:42 node-c64 kernel: <ffffffff8011022a>{system_call +126} Aug 24 15:06:42 node-c64 kernel: LustreError: dumping log to /tmp/ lustre-log.1187964401.14972 Aug 24 15:07:41 node-c64 kernel: LustreError: 11-0: an error ocurred while communicating with (no nid) The obd_ping operation failed with -107 Aug 24 15:07:41 node-c64 kernel: Lustre: home-md-MDT0000- mdc-0000010006aba400: Connection to service home-md-MDT0000 via nid 10.143.245.3@tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 24 15:07:41 node-c64 kernel: LustreError: 167-0: This client was evicted by home-md-MDT0000; in progress operations using this service will fail. Aug 24 15:07:41 node-c64 kernel: Lustre: home-md-MDT0000- mdc-0000010006aba400: Connection restored to service home-md-MDT0000 using nid 10.143.245.3@tcp. /var/log/messages from MDS/MGS Aug 24 15:07:31 storage03 kernel: LustreError: 0:0:(ldlm_lockd.c: 214:waiting_locks_callback()) ### lock callback timer expired: evicting client 6c71ca6e-e5b7-0101-616b- fc98b14982f9@NET_0x200000a8f0340_UUID nid 10.143.3.64@tcp ns: mds- home-md-MDT0000_UUID lock: 0000010122200d40/0x605e057ec5c51085 lrc: 1/0,0 mode: CR/CR res: 32409917/1775389910 bits 0x3 rrc: 105 type: IBT flags: 4000030 remote: 0x6c74bf325b27703 expref: 9 pid 17954 Aug 24 15:07:31 storage03 kernel: LustreError: 0:0:(ldlm_lockd.c: 214:waiting_locks_callback()) Skipped 8 previous similar messages Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(handler.c: 1499:mds_handle()) operation 400 on unconnected MDS from 12345-10.143.3.64@tcp Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(handler.c: 1499:mds_handle()) Skipped 3 previous similar messages Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(ldlm_lib.c: 1395:target_send_reply_msg()) @@@ processing error (-107) req@000001011d0f4200 x9691/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(ldlm_lib.c: 1395:target_send_reply_msg()) Skipped 67 previous similar messages Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(mds_open.c: 1461:mds_close()) @@@ no handle for file close ino 32424818: cookie 0x605e057ec2607c51 req@0000010084fbb000 x10016/t0 o35->7c60a35b-0787- d4e1-b54b-946d8faea7c1@NET_0x200000a8f033d_UUID:-1 lens 296/448 ref 0 fl Interpret:/0/0 rc 0/0 Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(mds_open.c: 1461:mds_close()) Skipped 1 previous similar message Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(ldlm_lib.c: 1395:target_send_reply_msg()) @@@ processing error (-116) req@0000010084fbb000 x10016/t0 o35->7c60a35b-0787-d4e1- b54b-946d8faea7c1@NET_0x200000a8f033d_UUID:-1 lens 296/448 ref 0 fl Interpret:/0/0 rc -116/0 Wojciech Turek Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- Skipped content of type multipart/mixed
Wojciech Turek
2007-Aug-24 11:02 UTC
Fwd: [Lustre-discuss] libcfs_assertion_failed()) LBUG
Hi, I forgot to mention that there are user processes blocked on that client node. Lustre file system is mounted on the node and I can accesses it except directory where the stuck binaries are located. If I try to open this directory my terminal sessions hangs but I don''t get anything in syslog. ps command shows following: F S UID PID PPID PGID SID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 D jcbp2 14968 1 14930 14930 81 85 0 - 45118 - 03:13 ? 11:52:25 /home/jcbp2/runs/256_400_128_y2pi_nodissip/ zeus3d.exe 0 D jcbp2 14969 1 14930 14930 81 76 0 - 45118 - 03:13 ? 11:53:18 /home/jcbp2/runs/256_400_128_y2pi_nodissip/ zeus3d.exe 0 D jcbp2 14972 1 14930 14930 81 76 0 - 45118 lbug_w 03:13 ? 11:53:20 /home/jcbp2/runs/256_400_128_y2pi_nodissip/ zeus3d.exe 0 D jcbp2 14973 1 14930 14930 81 76 0 - 45118 - 03:13 ? 11:52:33 /home/jcbp2/runs/256_400_128_y2pi_nodissip/ zeus3d.exe third line in WCHAN column you can notice lbug_w. I guess this is some kind of special wait that and that is why I can''t get rid of that processes. Is it possible to get rid of this lbug_w and unblock these processes. We would like to recover this node to completely healthy state without rebooting it. Thanks for your help. Wojciech Begin forwarded message:> From: Wojciech Turek <wjt27@cam.ac.uk> > Date: 24 August 2007 17:45:32 BDT > To: Lustre-discuss <lustre-discuss@clusterfs.com> > Subject: [Lustre-discuss] libcfs_assertion_failed()) LBUG > > Hi, > > I have this on the client. It died and then recovered. But in the > meantime job that got on that client has died also. Could anybody > give me any ideas where to look for the solution? > We are running lustre 1.6.1 on 2.6.9-55.EL_lustre-1.6.1smp > > Find attached lctl dk log_file and /tmp/logfile from client node??> > following is /var/log/messages from client > ################################ > Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(lustre_dlm.h: > 660:lock_bitlock()) ASSERTION(lock->l_pidb == 0) failed > Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(tracefile.c: > 433:libcfs_assertion_failed()) LBUG > Aug 24 15:06:41 node-c64 kernel: Lustre: 14972:0:(linux-debug.c: > 168:libcfs_debug_dumpstack()) showing stack for process 14972 > Aug 24 15:06:41 node-c64 kernel: zeus3d.exe R running > task 0 14972 14964 14976 (NOTLB) > Aug 24 15:06:41 node-c64 kernel: ffffff3134393732 ffffffffa029438b > ffffffffa0294370 0000000000003a7c > Aug 24 15:06:41 node-c64 kernel: 0000000000000000 > ffffffffa0292b8a 000000000000067b ffffffffa0294383 > Aug 24 15:06:41 node-c64 kernel: ffffffffa0294377 > 0000000000003a7c > Aug 24 15:06:41 node-c64 kernel: Call Trace:<ffffffff80232861> > {complement_pos+12} <ffffffff801f832f>{vgacon_cursor+0} > Aug 24 15:06:41 node-c64 kernel: <ffffffff802350f1> > {poke_blanked_console+67} <ffffffff8023748b>{vt_console_print+726} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80137bf6> > {release_console_sem+369} <ffffffff80137e24>{vprintk+498} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80137e24>{vprintk > +498} <ffffffff80137ece>{printk+141} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80148c7f> > {__kernel_text_address+26} <ffffffff801115c0>{show_trace+375} > Aug 24 15:06:41 node-c64 kernel: <ffffffff801116fc> > {show_stack+241} <ffffffffa0288cb3>{:libcfs:lbug_with_loc+115} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa028f894> > {:libcfs:libcfs_assertion_failed+84} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa038606f> > {:ptlrpc:lock_res_and_lock+111} <ffffffffa039ded8> > {:ptlrpc:ldlm_cli_cancel_local+184} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa03a034e> > {:ptlrpc:ldlm_cancel_resource_local+606} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80132010> > {try_to_wake_up+876} <ffffffffa0443f1b>{:mdc:mdc_resource_get_unused > +443} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa0446e02> > {:mdc:mdc_enqueue+850} <ffffffffa03b0c2f>{:ptlrpc:__ptlrpc_free_req > +1663} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa03b0e29> > {:ptlrpc:__ptlrpc_req_finished+393} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04483f1> > {:mdc:mdc_intent_lock+1057} <ffffffff80132010>{try_to_wake_up+876} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04cfbf0> > {:lustre:ll_mdc_blocking_ast+0} <ffffffffa039bf50> > {:ptlrpc:ldlm_completion_ast+0} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04cfbf0> > {:lustre:ll_mdc_blocking_ast+0} <ffffffffa039bf50> > {:ptlrpc:ldlm_completion_ast+0} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d02ad> > {:lustre:ll_i2gids+93} <ffffffffa04d042b> > {:lustre:ll_prepare_mdc_op_data+139} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d0d31> > {:lustre:ll_lookup_it+913} <ffffffffa04cfbf0> > {:lustre:ll_mdc_blocking_ast+0} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80185981> > {link_path_walk+179} <ffffffffa04b2978>{:lustre:ll_inode_permission > +184} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa04d1345> > {:lustre:ll_lookup_nd+149} <ffffffff8018ec90>{d_alloc+436} > Aug 24 15:06:41 node-c64 kernel: <ffffffff80185db0> > {__lookup_hash+263} <ffffffff80186577>{open_namei+331} > Aug 24 15:06:41 node-c64 kernel: <ffffffff801770a5>{filp_open > +95} <ffffffff8015ff7b>{cache_alloc_refill+390} > Aug 24 15:06:41 node-c64 kernel: <ffffffffa0498ba0> > {:lustre:ll_intent_release+0} <ffffffff801772a5>{sys_open+57} > Aug 24 15:06:42 node-c64 kernel: <ffffffff8011022a> > {system_call+126} > Aug 24 15:06:42 node-c64 kernel: LustreError: dumping log to /tmp/ > lustre-log.1187964401.14972 > Aug 24 15:07:41 node-c64 kernel: LustreError: 11-0: an error > ocurred while communicating with (no nid) The obd_ping operation > failed with -107 > Aug 24 15:07:41 node-c64 kernel: Lustre: home-md-MDT0000- > mdc-0000010006aba400: Connection to service home-md-MDT0000 via nid > 10.143.245.3@tcp was lost; in progress operations using this > service will wait for recovery to complete. > Aug 24 15:07:41 node-c64 kernel: LustreError: 167-0: This client > was evicted by home-md-MDT0000; in progress operations using this > service will fail. > Aug 24 15:07:41 node-c64 kernel: Lustre: home-md-MDT0000- > mdc-0000010006aba400: Connection restored to service home-md- > MDT0000 using nid 10.143.245.3@tcp. > > > /var/log/messages from MDS/MGS > > Aug 24 15:07:31 storage03 kernel: LustreError: 0:0:(ldlm_lockd.c: > 214:waiting_locks_callback()) ### lock callback timer expired: > evicting client 6c71ca6e-e5b7-0101-616b- > fc98b14982f9@NET_0x200000a8f0340_UUID nid 10.143.3.64@tcp ns: mds- > home-md-MDT0000_UUID lock: 0000010122200d40/0x605e057ec5c51085 lrc: > 1/0,0 mode: CR/CR res: 32409917/1775389910 bits 0x3 rrc: 105 type: > IBT flags: 4000030 remote: 0x6c74bf325b27703 expref: 9 pid 17954 > Aug 24 15:07:31 storage03 kernel: LustreError: 0:0:(ldlm_lockd.c: > 214:waiting_locks_callback()) Skipped 8 previous similar messages > Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(handler.c: > 1499:mds_handle()) operation 400 on unconnected MDS from > 12345-10.143.3.64@tcp > Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(handler.c: > 1499:mds_handle()) Skipped 3 previous similar messages > Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(ldlm_lib.c: > 1395:target_send_reply_msg()) @@@ processing error (-107) > req@000001011d0f4200 x9691/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl > Interpret:/0/0 rc -107/0 > Aug 24 15:07:41 storage03 kernel: LustreError: 17980:0:(ldlm_lib.c: > 1395:target_send_reply_msg()) Skipped 67 previous similar messages > Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(mds_open.c: > 1461:mds_close()) @@@ no handle for file close ino 32424818: cookie > 0x605e057ec2607c51 req@0000010084fbb000 x10016/t0 o35- > >7c60a35b-0787-d4e1-b54b-946d8faea7c1@NET_0x200000a8f033d_UUID:-1 > lens 296/448 ref 0 fl Interpret:/0/0 rc 0/0 > Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(mds_open.c: > 1461:mds_close()) Skipped 1 previous similar message > Aug 24 15:13:36 storage03 kernel: LustreError: 17124:0:(ldlm_lib.c: > 1395:target_send_reply_msg()) @@@ processing error (-116) > req@0000010084fbb000 x10016/t0 o35->7c60a35b-0787-d4e1- > b54b-946d8faea7c1@NET_0x200000a8f033d_UUID:-1 lens 296/448 ref 0 fl > Interpret:/0/0 rc -116/0 > > Wojciech Turek > > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27@cam.ac.uk > tel. +441223763517 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- Skipped content of type multipart/mixed
On Fri, Aug 24, 2007 at 05:45:32PM +0100, Wojciech Turek wrote:>I have this on the client. It died and then recovered. But in the >meantime job that got on that client has died also. Could anybody >give me any ideas where to look for the solution? >We are running lustre 1.6.1 on 2.6.9-55.EL_lustre-1.6.1smpthe patch in this bugzilla entry fixes it for me: https://bugzilla.lustre.org/show_bug.cgi?id=13220 cheers, robin>Find attached lctl dk log_file and /tmp/logfile from client node >?????? > >following is /var/log/messages from client >################################ >Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(lustre_dlm.h: >660:lock_bitlock()) ASSERTION(lock->l_pidb == 0) failed >Aug 24 15:06:41 node-c64 kernel: LustreError: 14972:0:(tracefile.c: >433:libcfs_assertion_failed()) LBUG >Aug 24 15:06:41 node-c64 kernel: Lustre: 14972:0:(linux-debug.c: >168:libcfs_debug_dumpstack()) showing stack for process 14972 >Aug 24 15:06:41 node-c64 kernel: zeus3d.exe R running task