Daniel Basabe
2009-Aug-20 10:43 UTC
[Lustre-discuss] LBUG in lustre 1.8.1 when client mounts something with bind option
Hi Recently I upgrade lustre to 1.8.1 from 1.6.6. Before, I''ve never got problems with lustre. My clients mount the lustre filesystem under /clusterha, at this point everything works ok. But when I try, for example, this: # mount -o bind,rw /clusterha/home /home It produces an LBUG in the MGS: LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg->lm_bufcount > segment) failed LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG Lustre: 5164:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 5164 ll_mdt_18 R running task 0 5164 1 5165 5163 (L-TLB) 0000000000000000 ffffffff887f6d5a ffff8104173e0280 ffffffff887f646e ffffffff887f6462 0000000000000086 0000000000000002 ffffffff801616e5 0000000000000001 0000000000000000 ffffffff802f6aa0 0000000000000000 Call Trace: [<ffffffff8009daf8>] autoremove_wake_function+0x9/0x2e [<ffffffff80088819>] __wake_up_common+0x3e/0x68 [<ffffffff80088819>] __wake_up_common+0x3e/0x68 [<ffffffff8002e6ba>] __wake_up+0x38/0x4f [<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b [<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b [<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b [<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b [<ffffffff8006bb5d>] printk_address+0x9f/0xab [<ffffffff8008f800>] printk+0x8/0xbd [<ffffffff8008f84a>] printk+0x52/0xbd [<ffffffff800a2e08>] module_text_address+0x33/0x3c [<ffffffff8009c088>] kernel_text_address+0x1a/0x26 [<ffffffff8006b843>] dump_trace+0x211/0x23a [<ffffffff8006b8a0>] show_trace+0x34/0x47 [<ffffffff8006b9a5>] _show_stack+0xdb/0xea [<ffffffff887ebada>] :libcfs:lbug_with_loc+0x7a/0xd0 [<ffffffff887f3c70>] :libcfs:tracefile_init+0x0/0x110 [<ffffffff8894c218>] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 [<ffffffff88c53529>] :mds:mds_getattr_lock+0xc59/0xce0 [<ffffffff8894aea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 [<ffffffff8883c923>] :lnet:lnet_ni_send+0x93/0xd0 [<ffffffff8883ed23>] :lnet:lnet_send+0x973/0x9a0 [<ffffffff88c4dfca>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 [<ffffffff88c59a76>] :mds:mds_intent_policy+0x636/0xc10 [<ffffffff8890d6f6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 [<ffffffff8890ad46>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 [<ffffffff88926acf>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 [<ffffffff88889e48>] :obdclass:lustre_hash_add+0x218/0x2e0 [<ffffffff8892f530>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d [<ffffffff8892d669>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 [<ffffffff88c57630>] :mds:mds_handle+0x4080/0x4cb0 [<ffffffff80148d4f>] __next_cpu+0x19/0x28 [<ffffffff80148d4f>] __next_cpu+0x19/0x28 [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 [<ffffffff8894fa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff80089d89>] enqueue_task+0x41/0x56 [<ffffffff8895472d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 [<ffffffff88956e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c [<ffffffff80088819>] __wake_up_common+0x3e/0x68 [<ffffffff8895a908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 [<ffffffff8008a3ef>] default_wake_function+0x0/0xe [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff889596f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1250760001.5164 Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog triggered for pid 5164: it was inactive for 200.00s Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 5164 ll_mdt_18 D ffff81000102df80 0 5164 1 5165 5163 (L-TLB) ffff810411625810 0000000000000046 0000000000000000 0000000000000000 ffff8104116257d0 0000000000000009 ffff810413360080 ffff81042fe9d100 00008b3a197b862f 0000000000000ed5 ffff810413360268 000000050000028f Call Trace: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe [<ffffffff887ebb26>] :libcfs:lbug_with_loc+0xc6/0xd0 [<ffffffff887f3c70>] :libcfs:tracefile_init+0x0/0x110 [<ffffffff8894c218>] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 [<ffffffff88c53529>] :mds:mds_getattr_lock+0xc59/0xce0 [<ffffffff8894aea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 [<ffffffff8883c923>] :lnet:lnet_ni_send+0x93/0xd0 [<ffffffff8883ed23>] :lnet:lnet_send+0x973/0x9a0 [<ffffffff88c4dfca>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 [<ffffffff88c59a76>] :mds:mds_intent_policy+0x636/0xc10 [<ffffffff8890d6f6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 [<ffffffff8890ad46>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 [<ffffffff88926acf>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 [<ffffffff88889e48>] :obdclass:lustre_hash_add+0x218/0x2e0 [<ffffffff8892f530>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d [<ffffffff8892d669>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 [<ffffffff88c57630>] :mds:mds_handle+0x4080/0x4cb0 [<ffffffff80148d4f>] __next_cpu+0x19/0x28 [<ffffffff80148d4f>] __next_cpu+0x19/0x28 [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 [<ffffffff8894fa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff80089d89>] enqueue_task+0x41/0x56 [<ffffffff8895472d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 [<ffffffff88956e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c [<ffffffff80088819>] __wake_up_common+0x3e/0x68 [<ffffffff8895a908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 [<ffffffff8008a3ef>] default_wake_function+0x0/0xe [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff889596f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1250760201.5164 Lustre: 5162:0:(service.c:786:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/5), not sending early reply req at ffff81040f602400 x1311420275108778/t0 o101->3f31386b-70e3-8c4f-6ecf-83adfc123156 at NET_0x20000c0a80a03_UUID:0/0 lens544/600 e 24 to 0 dl 1250760601 ref 2 fl Interpret:/0/0 rc 0/0 And the client hangs. With 1.6.6 this action worked fine. Another difference with the previous configuration is that in the new one, I''ve created a link agregation in tcp0 device both in the MGS side and in the client side: #cat /etc/modprobe.conf alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss alias scsi_hostadapter1 ata_piix alias scsi_hostadapter2 qla2xxx alias bond0 bonding options bond0 mode=4 alias ib0 ib_ipoib alias ib1 ib_ipoib options lnet accept=all networks=o2ib0(ib0),tcp0(bond0) alias net-pf-27 ib_sdp My currently configuration has three OST''s connected to MGS with infiniband (o2ib) and tcp ethernet (bond0): MGS: Reading CONFIGS/mountdata Read previous values: Target: shared-MDT0000 Index: 0 Lustre FS: shared Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: failover.node=10.0.0.200 at o2ib0,192.168.10.200 at tcp0 mdt.group_upcall=/usr/sbin/l_getgroups OST 0: Target: shared-OST0000 Index: 0 Lustre FS: shared Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: failover.node=10.0.0.7 at o2ib0,192.168.10.7 at tcp0 mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp OST 1: Target: shared-OST0001 Index: 1 Lustre FS: shared Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: failover.node=10.0.0.6 at o2ib0,192.168.10.6 at tcp0 mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp OST 2: Target: shared-OST0002 Index: 2 Lustre FS: shared Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: failover.node=10.0.0.201 at o2ib0,192.168.10.201 at tcp0 mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp I attach the dump log. Does anyone know what is happening? Thanks. Regards. -- Daniel Basabe del Pino ------------------------ Administrador de Sistemas HPC BULL / Secretar?a General Adjunta de Inform?tica CSIC Tlfno: 915642963 Ext: 272 -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-log.1250760001.5164 Type: application/octet-stream Size: 1052892 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090820/229b3364/attachment-0001.obj
Johann Lombardi
2009-Aug-20 11:25 UTC
[Lustre-discuss] LBUG in lustre 1.8.1 when client mounts something with bind option
On Aug 20, 2009, at 12:43 PM, Daniel Basabe wrote:> LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2()) > ASSERTION(msg->lm_bufcount > segment) failed > LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG > Lustre: 5164:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing > stack for > process 5164This problem is under investigation in bugzilla ticket 20020. Johann