Lundgren, Andrew
2009-Oct-06 17:14 UTC
[Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.
We have a few 1.8.0 clusters running. We have seen multiple instances now where the clients loose connectivity to the MDT. The MDS logs indicate that there is some sort of problem on the MDT. The following is a typical output: Oct 6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out (limit 7s). Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.16 at tcp was evicted due to a lock blocking callback to 207.123.49.16 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback to 207.123.49.9 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999] Oct 6 02:56:18 mint1502 kernel: CPU 2: Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: 00000206 Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75 Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000 Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200 Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0 Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000 Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 6 02:56:18 mint1502 kernel: Oct 6 02:56:18 mint1502 kernel: Call Trace: Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 6 02:56:19 mint1502 kernel: Those go on for a while until it recovers: Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out (limit 26s). Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven''t heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 seconds. I think it''s dead, and I am evicting it. Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000a_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038 Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail. Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000b_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49.37 at tcp. Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Anyone seen this before? -- Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091006/491037b1/attachment-0001.html
Lundgren, Andrew
2009-Oct-06 17:20 UTC
[Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.
Oh man, that should have read LOSING! From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Lundgren, Andrew Sent: Tuesday, October 06, 2009 11:14 AM To: lustre-discuss Subject: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering. We have a few 1.8.0 clusters running. We have seen multiple instances now where the clients loose connectivity to the MDT. The MDS logs indicate that there is some sort of problem on the MDT. The following is a typical output: Oct 6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out (limit 7s). Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.16 at tcp was evicted due to a lock blocking callback to 207.123.49.16 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback to 207.123.49.9 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999] Oct 6 02:56:18 mint1502 kernel: CPU 2: Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: 00000206 Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75 Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000 Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200 Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0 Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000 Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 6 02:56:18 mint1502 kernel: Oct 6 02:56:18 mint1502 kernel: Call Trace: Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 6 02:56:19 mint1502 kernel: Those go on for a while until it recovers: Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out (limit 26s). Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven''t heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 seconds. I think it''s dead, and I am evicting it. Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000a_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038 Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail. Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000b_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49.37 at tcp. Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Anyone seen this before? -- Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091006/cfa28cd2/attachment.html
Đào Thị Thảo
2009-Oct-07 03:43 UTC
[Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.
hi, I have the same problem with Lundgren. I don''t understand why it''s happen. While, my network is still stable. This problem is repeated. Some time, client can''t connection restored to service (OSTs or MDS) . My provisional measure is rebooting the node. Andrew, can you explain more detail and guide how to fix it? On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew < Andrew.Lundgren at level3.com> wrote:> Oh man, that should have read LOSING! > > > > *From:* lustre-discuss-bounces at lists.lustre.org [mailto: > lustre-discuss-bounces at lists.lustre.org] *On Behalf Of *Lundgren, Andrew > *Sent:* Tuesday, October 06, 2009 11:14 AM > *To:* lustre-discuss > *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > We have a few 1.8.0 clusters running. We have seen multiple instances now > where the clients loose connectivity to the MDT. The MDS logs indicate that > there is some sort of problem on the MDT. > > > > The following is a typical output: > > > > Oct 6 02:56:08 mint1502 kernel: LustreError: > 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), > evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID > > Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000x1314647461000449/t0 > o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 > lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 > > Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent > from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out > (limit 7s). > > Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: > Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; > in progress operations using this service will fail. > > Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: > Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; > in progress operations using this service will fail. > > Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49.16 at tcp was evicted due to a lock blocking > callback to 207.123.49.16 at tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback > to 207.123.49.9 at tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_mdt_rdpg_04:28999] > > Oct 6 02:56:18 mint1502 kernel: CPU 2: > > Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) > fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) > lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) > ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) > ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) > ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) > dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) > i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) > parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) > i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) > serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) > dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) > libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) > scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > > Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: > G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 > > Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] > [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 > > Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: > 00000206 > > Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: > 000000000000328c RCX: 000000000000ba75 > > Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: > ffffffff802f0d80 RDI: ffffc200109d78cc > > Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: > ffff810001016e60 R09: 0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: > 0000000000000150 R12: ffff810448dfd200 > > Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: > 0000000000000286 R15: ffff8104c02ad7f0 > > Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) > GS:ffff81051fc53d40(0000) knlGS:0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > > Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: > 0000000000201000 CR4: 00000000000006e0 > > Oct 6 02:56:18 mint1502 kernel: > > Oct 6 02:56:18 mint1502 kernel: Call Trace: > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] > :obdclass:class_disconnect+0x378/0x400 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] > :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] > :mds:mds_disconnect+0x121/0xe30 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] > lock_timer_base+0x1b/0x3c > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] > try_to_del_timer_sync+0x51/0x5a > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] > :obdclass:class_fail_export+0x384/0x4c0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] > :mds:mds_readpage+0x1571/0x18a0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] > :ptlrpc:lustre_msg_set_limit+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] > :mds:mds_handle+0x2140/0x4c80 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] > :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] > :ptlrpc:ptlrpc_check_req+0x1d/0x110 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] > :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] > thread_return+0x0/0xdf > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] > __wake_up_common+0x3e/0x68 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] > :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] > audit_syscall_exit+0x31b/0x336 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] > :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Oct 6 02:56:19 mint1502 kernel: > > > > Those go on for a while until it recovers: > > > > Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent > from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out > (limit 26s). > > Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar > messages > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried > all connections, increasing latency to 26s > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar > messages > > Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven''t heard from client > e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 > seconds. I think it''s dead, and I am evicting it. > > Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar > messages > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > content-OST000a_UUID at 207.123.49.37@tcp changed server handle from > 0x25a7c6432e7fffec to 0x25a7c643369c7038 > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was > evicted by content-OST000a; in progress operations using this service will > fail. > > Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > content-OST000b_UUID at 207.123.49.37@tcp changed server handle from > 0x25a7c6432e7ffff3 to 0x25a7c643369c703f > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are > active, abort quota recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous > similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection > restored to service content-OST000a using nid 207.123.49.37 at tcp. > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: > content-OST000a_UUID now active, resetting orphans > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > > > Anyone seen this before? > > > > -- > > Andrew > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091007/ad3035e5/attachment-0001.html
Lundgren, Andrew
2009-Oct-07 05:13 UTC
[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.
I haven?t done anything yet. The machines seem to reconnect without intervention. But the problem occurs again later. From: ??o Th? Th?o [mailto:thaodt at isds.vn] Sent: Tuesday, October 06, 2009 9:43 PM To: Lundgren, Andrew Cc: lustre-discuss Subject: Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering. hi, I have the same problem with Lundgren. I don''t understand why it''s happen. While, my network is still stable. This problem is repeated. Some time, client can''t connection restored to service (OSTs or MDS) . My provisional measure is rebooting the node. Andrew, can you explain more detail and guide how to fix it? On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew <Andrew.Lundgren at level3.com<mailto:Andrew.Lundgren at level3.com>> wrote: Oh man, that should have read LOSING! From: lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>] On Behalf Of Lundgren, Andrew Sent: Tuesday, October 06, 2009 11:14 AM To: lustre-discuss Subject: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering. We have a few 1.8.0 clusters running. We have seen multiple instances now where the clients lose connectivity to the MDT. The MDS logs indicate that there is some sort of problem on the MDT. The following is a typical output: Oct 6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out (limit 7s). Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail. Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.16 at tcp was evicted due to a lock blocking callback to 207.123.49.16 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback to 207.123.49.9 at tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999] Oct 6 02:56:18 mint1502 kernel: CPU 2: Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: 00000206 Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75 Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000 Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200 Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0 Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000 Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 6 02:56:18 mint1502 kernel: Oct 6 02:56:18 mint1502 kernel: Call Trace: Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 6 02:56:19 mint1502 kernel: Those go on for a while until it recovers: Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out (limit 26s). Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven''t heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 seconds. I think it''s dead, and I am evicting it. Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000a_UUID at 207.123.49.37<mailto:content-OST000a_UUID at 207.123.49.37>@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038 Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail. Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000b_UUID at 207.123.49.37<mailto:content-OST000b_UUID at 207.123.49.37>@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49.37 at tcp. Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Anyone seen this before? -- Andrew _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org> http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091006/19a38faf/attachment-0001.html
Christopher J.Walker
2009-Oct-09 10:25 UTC
[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.
Lundgren, Andrew wrote:> I haven?t done anything yet. The machines seem to reconnect without > intervention. But the problem occurs again later. > > > > *From:* ??o Th? Th?o [mailto:thaodt at isds.vn] > *Sent:* Tuesday, October 06, 2009 9:43 PM > *To:* Lundgren, Andrew > *Cc:* lustre-discuss > *Subject:* Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > hi, > I have the same problem with Lundgren. > I don''t understand why it''s happen. While, my network is still stable. > This problem is repeated. Some time, client can''t connection restored to > service (OSTs or MDS) . My provisional measure is rebooting the node. > Andrew, can you explain more detail and guide how to fix it? > > On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew > <Andrew.Lundgren at level3.com <mailto:Andrew.Lundgren at level3.com>> wrote: > > Oh man, that should have read LOSING! > > > > *From:* lustre-discuss-bounces at lists.lustre.org > <mailto:lustre-discuss-bounces at lists.lustre.org> > [mailto:lustre-discuss-bounces at lists.lustre.org > <mailto:lustre-discuss-bounces at lists.lustre.org>] *On Behalf Of > *Lundgren, Andrew > *Sent:* Tuesday, October 06, 2009 11:14 AM > *To:* lustre-discuss > *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > We have a few 1.8.0 clusters running. We have seen multiple instances > now where the clients lose connectivity to the MDT. The MDS logs > indicate that there is some sort of problem on the MDT. > > > > The following is a typical output: > > > > Oct 6 02:56:08 mint1502 kernel: LustreError: > 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), > evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID > > Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000 > x1314647461000449/t0 > o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 > lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 > > Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent > from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out > (limit 7s). > > Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: > content-OST001d-osc: Connection to service content-OST001d via nid > 207.123.49.16 at tcp was lost; in progress operations using this service > will fail. > > Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: > content-OST001c-osc: Connection to service content-OST001c via nid > 207.123.49.16 at tcp was lost; in progress operations using this service > will fail. > > Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49.16 at tcp was evicted due to a lock blocking > callback to 207.123.49.16 at tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49.9 at tcp was evicted due to a lock blocking > callback to 207.123.49.9 at tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_mdt_rdpg_04:28999] > > Oct 6 02:56:18 mint1502 kernel: CPU 2: > > Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) > fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) > mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) > lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) > bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) > xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) > x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) > backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) > acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) > i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) > pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) > dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) > scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) > mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) > uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > > Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 > Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 > > Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] > [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 > > Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: > 00000206 > > Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: > 000000000000328c RCX: 000000000000ba75 > > Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: > ffffffff802f0d80 RDI: ffffc200109d78cc > > Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: > ffff810001016e60 R09: 0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: > 0000000000000150 R12: ffff810448dfd200 > > Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: > 0000000000000286 R15: ffff8104c02ad7f0 > > Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) > GS:ffff81051fc53d40(0000) knlGS:0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > > Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: > 0000000000201000 CR4: 00000000000006e0 > > Oct 6 02:56:18 mint1502 kernel: > > Oct 6 02:56:18 mint1502 kernel: Call Trace: > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] > :obdclass:class_disconnect+0x378/0x400 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] > :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] > :mds:mds_disconnect+0x121/0xe30 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] > lock_timer_base+0x1b/0x3c > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] > try_to_del_timer_sync+0x51/0x5a > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] > :obdclass:class_fail_export+0x384/0x4c0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] > :mds:mds_readpage+0x1571/0x18a0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] > :ptlrpc:lustre_msg_set_limit+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] > :mds:mds_handle+0x2140/0x4c80 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] > :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] > :ptlrpc:ptlrpc_check_req+0x1d/0x110 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] > :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] > thread_return+0x0/0xdf > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] > __wake_up_common+0x3e/0x68 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] > :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] > audit_syscall_exit+0x31b/0x336 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] > :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Oct 6 02:56:19 mint1502 kernel: > > > > Those go on for a while until it recovers: > > > > Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent > from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out > (limit 26s). > > Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar > messages > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: > tried all connections, increasing latency to 26s > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous > similar messages > > Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven''t heard from client > e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 > seconds. I think it''s dead, and I am evicting it. > > Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > content-OST000a_UUID at 207.123.49.37 > <mailto:content-OST000a_UUID at 207.123.49.37>@tcp changed server handle > from 0x25a7c6432e7fffec to 0x25a7c643369c7038 > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was > evicted by content-OST000a; in progress operations using this service > will fail. > > Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > content-OST000b_UUID at 207.123.49.37 > <mailto:content-OST000b_UUID at 207.123.49.37>@tcp changed server handle > from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are > active, abort quota recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous > similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection > restored to service content-OST000a using nid 207.123.49.37 at tcp. > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: > content-OST000a_UUID now active, resetting orphans > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > >I''ve just seen it with a 1.8.1 server (and 1.6 clients): Oct 4 18:21:30 sn01 kernel: CPU 5: Oct 4 18:21:30 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) s g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mp tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:30 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] [<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0 Oct 4 18:21:30 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202 Oct 4 18:21:30 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000002662 RCX: 00000000000083dc Oct 4 18:21:30 sn01 kernel: RDX: 00000000000002dc RSI: ffffffff802f9e80 RDI: ffffc2001225361c Oct 4 18:21:30 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 R09: 0000000000000000 Oct 4 18:21:30 sn01 kernel: R10: ffff8100aa9d5cc0 R11: 0000000000000150 R12: ffff8100acf36800 Oct 4 18:21:30 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff810123af79d0 Oct 4 18:21:30 sn01 kernel: FS: 00002ab6a6887230(0000) GS:ffff810106751740(0000) knlGS:0000000000000000 Oct 4 18:21:30 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 4 18:21:30 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 4 18:21:30 sn01 kernel: Oct 4 18:21:30 sn01 kernel: Call Trace: Oct 4 18:21:30 sn01 kernel: [<ffffffff8850ba88>] :obdclass:class_disconnect+0x398/0x420 Oct 4 18:21:30 sn01 kernel: [<ffffffff888a65b1>] :mds:mds_disconnect+0x121/0xe40 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c5ea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885cb834>] :ptlrpc:lustre_msg_set_timeout+0x34/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c8a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Oct 4 18:21:30 sn01 kernel: [<ffffffff88591381>] :ptlrpc:target_handle_disconnect+0x461/0x560 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c4f35>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 Oct 4 18:21:30 sn01 kernel: [<ffffffff888a2324>] :mds:mds_handle+0xd74/0x4cb0 Oct 4 18:21:30 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Oct 4 18:21:30 sn01 kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Oct 4 18:21:30 sn01 kernel: [<ffffffff885caa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 4 18:21:30 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Oct 4 18:21:30 sn01 kernel: [<ffffffff885cf72d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d1e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 4 18:21:30 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Oct 4 18:21:30 sn01 kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d5908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 4 18:21:30 sn01 kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Oct 4 18:21:30 sn01 kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d46f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 4 18:21:30 sn01 kernel: Oct 4 18:21:40 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_mdt_39:831] Oct 4 18:21:40 sn01 kernel: CPU 5: Oct 4 18:21:40 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) s g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mp tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:40 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:40 sn01 kernel: RIP: 0010:[<ffffffff88504b98>] [<ffffffff88504b98>] :obdclass:lustre_hash_for_each_empty+0x218/0x2b0 Oct 4 18:21:40 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000206 Oct 4 18:21:40 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000000543 RCX: 000000000000bf15 Oct 4 18:21:40 sn01 kernel: RDX: 000000000000036e RSI: ffffffff802f9e80 RDI: ffffc2001223243c Oct 4 18:21:40 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 R09: 0000000000000000 Oct 4 18:21:40 sn01 kernel: R10: ffff8100a59c1e00 R11: 0000000000000150 R12: ffff8100a5f6f800 Oct 4 18:21:40 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff810123af79d0 Oct 4 18:21:40 sn01 kernel: FS: 00002ab6a6887230(0000) GS:ffff810106751740(0000) knlGS:0000000000000000 Oct 4 18:21:40 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 4 18:21:40 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 4 18:21:40 sn01 kernel: Oct 4 18:21:40 sn01 kernel: [<ffffffff888a65b1>] :mds:mds_disconnect+0x121/0xe40 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c5ea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885cb834>] :ptlrpc:lustre_msg_set_timeout+0x34/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c8a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Oct 4 18:21:40 sn01 kernel: [<ffffffff88591381>] :ptlrpc:target_handle_disconnect+0x461/0x560 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c4f35>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 Oct 4 18:21:40 sn01 kernel: [<ffffffff888a2324>] :mds:mds_handle+0xd74/0x4cb0 Oct 4 18:21:40 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Oct 4 18:21:40 sn01 kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Oct 4 18:21:40 sn01 kernel: [<ffffffff885caa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 4 18:21:40 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Oct 4 18:21:40 sn01 kernel: [<ffffffff885cf72d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d1e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 4 18:21:40 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Oct 4 18:21:40 sn01 kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d5908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 4 18:21:40 sn01 kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Oct 4 18:21:40 sn01 kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d46f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 4 18:21:40 sn01 kernel: s ago has timed out (limit 7s). 0/0 rc 0/0 will wait for recovery to complete. Oct 4 18:21:48 sn01 kernel: Lustre: Skipped 1 previous similar message Oct 4 18:21:48 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp. Oct 4 18:21:50 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_mdt_39:831] Oct 4 18:21:50 sn01 kernel: CPU 5: tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:50 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:50 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] [<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0 Oct 4 18:21:50 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202 Which repeats several times, before Oct 4 18:24:31 sn01 kernel: Lustre: 831:0:(service.c:1369:ptlrpc_server_handle_request()) @@@ Request x44330119 took longer than estimated (100+90s); client may timeout. req at ffff810188454c00 x44330119/t0 o39->e64e59dd-a66e-9210-ed51-5bcdebd7470d at NET_0x200000a010713_UUID:0/0 lens 128/128 e 0 to 0 dl 1254676981 ref 1 fl Complete:/0/0 rc 0/0 Oct 4 18:24:31 sn01 kernel: LustreError: 4417:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff81009432b800 x1310439138925892/t0 o400->lustre_0-OST001 f_UUID at 10.1.4.118@tcp:28/4 lens 192/384 e 0 to 1 dl 1254676908 ref 1 fl Complete:XN/0/0 rc -11/0 Oct 4 18:24:31 sn01 kernel: LustreError: 4417:0:(events.c:66:request_out_callback()) Skipped 6 previous similar messages Oct 4 18:24:31 sn01 kernel: Lustre: 864:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre_0-MDT0000: dbb6a3f0-9aca-15cb-c931-831d85f03526 reconnecting Oct 4 18:24:31 sn01 kernel: Lustre: 885:0:(ldlm_lib.c:766:target_handle_connect()) lustre_0-MDT0000: exp ffff8101888d4200 already connecting Oct 4 18:24:31 sn01 kernel: LustreError: 885:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-114) req at ffff8101ad1b6c00 x2087169/t0 o38->dbb6a3f0-9aca-1 5cb-c931-831d85f03526 at NET_0x200000a01070d_UUID:0/0 lens 304/200 e 0 to 0 dl 1254677171 ref 1 fl Interpret:/0/0 rc -114/0 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926075 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926074 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926073 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926133 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926135 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926136 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926134 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926184 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926186 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926185 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4.118 at tcp portal 4 match 1310439138926187 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp. Chris
Brian J. Murrell
2009-Oct-09 12:45 UTC
[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.
On Fri, 2009-10-09 at 11:25 +0100, Christopher J.Walker wrote:> > I''ve just seen it with a 1.8.1 server (and 1.6 clients): > > Oct 4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 > Oct 4 18:21:30 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] [<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0This is bug 19557. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/701990dd/attachment.bin