Thomas Wakefield
2009-Mar-20 14:15 UTC
[Lustre-discuss] MDS crashing after upgrade to 1.6.7
My mds has crashed a couple times after about 48 hours of running on 1.6.7, I can''t decide if it is a bug or a disk issue. Prior to the upgrade, my uptime was 250+ days, so i think the hardware is stable. Any help would be very appreciated. Server is a dell 2950, dual OS drives, dual raid 1 drives for the MDS, running LVM for snapshots. Log from right before the crash: Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception. Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676124 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676125 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676127 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676374 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676375 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676376 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 22676377 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 11042873 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 11042919 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical block 11042920 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1): ldiskfs_get_inode_loc: unable to read inode block - inode=21347250, block=21333880 Mar 20 09:58:24 disk0 syslogd 1.4.1: restart. Then after a reboot here is a DMESG bug that concerned me: LustreError: 10661:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 390 previous similar messages LustreError: 10689:0:(handler.c:1601:mds_handle()) Skipped 390 previous similar messages BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680] CPU 3: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U) dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U) joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U) shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Pid: 10680, comm: ll_mdt_19 Tainted: G 2.6.18-92.1.17.el5_lustre. 1.6.7smp #1 RIP: 0010:[<ffffffff8870e3e0>] [<ffffffff8870e3e0>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 RSP: 0018:ffff81040dc339e0 EFLAGS: 00000206 RAX: ffff81040ab94a00 RBX: 0000000000001f0e RCX: 0000000000007b38 RDX: 000000000000020a RSI: ffffffff802f0d80 RDI: ffffc20010d5e0ec RBP: ffffffff88788912 R08: ffff81000101f3e0 R09: 0000000000000000 R10: ffff81040ab920c0 R11: 0000000000000150 R12: ffff81040ab94a00 R13: 0000000000000286 R14: 0000000000000286 R15: ffff81040dc33980 FS: 00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0 Call Trace: [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400 [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310 [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40 [<ffffffff88712f44>] :obdclass:class_disconnect_export_list +0x5e4/0x780 [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports +0x137/0x1a0 [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390 [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30 [<ffffffff80143809>] __next_cpu+0x19/0x28 [<ffffffff80143809>] __next_cpu+0x19/0x28 [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 [<ffffffff80062f4b>] thread_return+0x0/0xdf [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 [<ffffffff8008abb9>] default_wake_function+0x0/0xe [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- MDT0000: 3b041098-4f33-6c6a-6e5c-3af7b34062e8 reconnecting LustreError: 10663:0:(handler.c:1601:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.64.14 at o2ib LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-107) req at ffff81040962ec00 x52078437/t0 o101-><?>@<? >:0/0 lens 232/0 e 0 to 0 dl 1237557961 ref 1 fl Interpret:/0/0 rc -107/0 LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 777 previous similar messages LustreError: 10663:0:(handler.c:1601:mds_handle()) Skipped 779 previous similar messages Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) data- MDT0000: 2a1f051a-1043-af7c-75c5-65af1bced83b reconnecting Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- MDT0000: 99fbab1c-b702-fab2-a584-6c31ac18a5f4 reconnecting Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680] CPU 3: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U) dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U) joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U) shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Pid: 10680, comm: ll_mdt_19 Tainted: G 2.6.18-92.1.17.el5_lustre. 1.6.7smp #1 RIP: 0010:[<ffffffff800649e6>] [<ffffffff800649e6>] _write_lock+0x7/0xf RSP: 0018:ffff81040dc339d8 EFLAGS: 00000246 RAX: ffff81040826ea00 RBX: 0000000000006227 RCX: 000000000000aee7 RDX: 00000000000002e4 RSI: ffffffff802f0d80 RDI: ffffc20010da127c RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000 R10: ffff81040826fe00 R11: 0000000000000150 R12: 0000000000000286 R13: ffff81040dc33980 R14: ffff81040826fe58 R15: ffff81040826fe00 FS: 00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0 Call Trace: [<ffffffff8870e3fe>] :obdclass:lustre_hash_for_each_empty+0x20e/0x290 [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400 [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310 [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40 [<ffffffff88712f44>] :obdclass:class_disconnect_export_list +0x5e4/0x780 [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports +0x137/0x1a0 [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390 [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30 [<ffffffff80143809>] __next_cpu+0x19/0x28 [<ffffffff80143809>] __next_cpu+0x19/0x28 [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 [<ffffffff80062f4b>] thread_return+0x0/0xdf [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 [<ffffffff8008abb9>] default_wake_function+0x0/0xe [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Lustre: data-MDT0000: sending delayed replies to recovered clients Lustre: data-MDT0000: recovery complete: rc 0 Lustre: MDS data-MDT0000: data-OST0000_UUID now active, resetting orphans Lustre: MDS data-MDT0000: data-OST0001_UUID now active, resetting orphans LustreError: 10616:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62 of 124 llog-records failed: -2 LustreError: 10617:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62 of 124 llog-records failed: -2 LustreError: 10668:0:(handler.c:1601:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.64.14 at o2ib LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-107) req at ffff81040c50c400 x52079502/t0 o101-><?>@<? >:0/0 lens 232/0 e 0 to 0 dl 1237557977 ref 1 fl Interpret:/0/0 rc -107/0 LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 1058 previous similar messages LustreError: 10668:0:(handler.c:1601:mds_handle()) Skipped 1060 previous similar messages LustreError: 10687:0:(handler.c:1601:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.64.14 at o2ib LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-107) req at ffff81042dcecc00 x52082030/t0 o101-><?>@<? >:0/0 lens 232/0 e 0 to 0 dl 1237558009 ref 1 fl Interpret:/0/0 rc -107/0 LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 2518 previous similar messages LustreError: 10687:0:(handler.c:1601:mds_handle()) Skipped 2518 previous similar messages Thomas Wakefield Jr. Systems Administrator COLA/IGES 301-902-1268 twake at cola.iges.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/060deb76/attachment-0001.html
To me it looks like a disk issue. A very strange one at that. On Mar 20, 2009, at 10:15 AM, Thomas Wakefield wrote:> My mds has crashed a couple times after about 48 hours of running on > 1.6.7, I can''t decide if it is a bug or a disk issue. Prior to the > upgrade, my uptime was 250+ days, so i think the hardware is stable. > > Any help would be very appreciated. > > Server is a dell 2950, dual OS drives, dual raid 1 drives for the > MDS, running LVM for snapshots. > > > Log from right before the crash: > > Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating > snapshot: Unable to allocate exception. > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676124 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676125 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676127 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676374 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676375 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676376 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 22676377 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 11042873 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 11042919 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, > logical block 11042920 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1): > ldiskfs_get_inode_loc: unable to read inode block - inode=21347250, > block=21333880 > Mar 20 09:58:24 disk0 syslogd 1.4.1: restart. > > > > Then after a reboot here is a DMESG bug that concerned me: > > LustreError: 10661:0:(ldlm_lib.c:1643:target_send_reply_msg()) > Skipped 390 previous similar messages > LustreError: 10689:0:(handler.c:1601:mds_handle()) Skipped 390 > previous similar messages > BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680] > CPU 3: > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) > crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) > ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) > ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U) > dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) > sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) > ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) > crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) > ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) > backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) > acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U) > joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U) > shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U) > serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U) > dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U) > scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) > Pid: 10680, comm: ll_mdt_19 Tainted: G > 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 > RIP: 0010:[<ffffffff8870e3e0>] > [<ffffffff8870e3e0>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 > RSP: 0018:ffff81040dc339e0 EFLAGS: 00000206 > RAX: ffff81040ab94a00 RBX: 0000000000001f0e RCX: 0000000000007b38 > RDX: 000000000000020a RSI: ffffffff802f0d80 RDI: ffffc20010d5e0ec > RBP: ffffffff88788912 R08: ffff81000101f3e0 R09: 0000000000000000 > R10: ffff81040ab920c0 R11: 0000000000000150 R12: ffff81040ab94a00 > R13: 0000000000000286 R14: 0000000000000286 R15: ffff81040dc33980 > FS: 00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: > 0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0 > > Call Trace: > [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400 > [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310 > [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40 > [<ffffffff88712f44>] :obdclass:class_disconnect_export_list > +0x5e4/0x780 > [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports > +0x137/0x1a0 > [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390 > [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 > [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30 > [<ffffffff80143809>] __next_cpu+0x19/0x28 > [<ffffffff80143809>] __next_cpu+0x19/0x28 > [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 > [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 > [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 > [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request > +0xa93/0x1150 > [<ffffffff80062f4b>] thread_return+0x0/0xdf > [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f > [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100 > [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c > [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe > [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 > [<ffffffff8008abb9>] default_wake_function+0x0/0xe > [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- > MDT0000: 3b041098-4f33-6c6a-6e5c-3af7b34062e8 reconnecting > LustreError: 10663:0:(handler.c:1601:mds_handle()) operation 101 on > unconnected MDS from 12345-192.168.64.14 at o2ib > LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ > processing error (-107) req at ffff81040962ec00 x52078437/t0 o101-><? > >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237557961 ref 1 fl Interpret:/0/0 > rc -107/0 > LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) > Skipped 777 previous similar messages > LustreError: 10663:0:(handler.c:1601:mds_handle()) Skipped 779 > previous similar messages > Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) data- > MDT0000: 2a1f051a-1043-af7c-75c5-65af1bced83b reconnecting > Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 > previous similar message > Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- > MDT0000: 99fbab1c-b702-fab2-a584-6c31ac18a5f4 reconnecting > Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 > previous similar messages > BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680] > CPU 3: > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) > crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) > ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) > ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U) > dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) > sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) > ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) > crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) > ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) > backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) > acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U) > joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U) > shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U) > serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U) > dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U) > scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) > Pid: 10680, comm: ll_mdt_19 Tainted: G > 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 > RIP: 0010:[<ffffffff800649e6>] [<ffffffff800649e6>] _write_lock > +0x7/0xf > RSP: 0018:ffff81040dc339d8 EFLAGS: 00000246 > RAX: ffff81040826ea00 RBX: 0000000000006227 RCX: 000000000000aee7 > RDX: 00000000000002e4 RSI: ffffffff802f0d80 RDI: ffffc20010da127c > RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000 > R10: ffff81040826fe00 R11: 0000000000000150 R12: 0000000000000286 > R13: ffff81040dc33980 R14: ffff81040826fe58 R15: ffff81040826fe00 > FS: 00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: > 0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0 > > Call Trace: > [<ffffffff8870e3fe>] :obdclass:lustre_hash_for_each_empty+0x20e/0x290 > [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400 > [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310 > [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40 > [<ffffffff88712f44>] :obdclass:class_disconnect_export_list > +0x5e4/0x780 > [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports > +0x137/0x1a0 > [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390 > [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 > [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30 > [<ffffffff80143809>] __next_cpu+0x19/0x28 > [<ffffffff80143809>] __next_cpu+0x19/0x28 > [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 > [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 > [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 > [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request > +0xa93/0x1150 > [<ffffffff80062f4b>] thread_return+0x0/0xdf > [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f > [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100 > [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c > [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe > [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 > [<ffffffff8008abb9>] default_wake_function+0x0/0xe > [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Lustre: data-MDT0000: sending delayed replies to recovered clients > Lustre: data-MDT0000: recovery complete: rc 0 > Lustre: MDS data-MDT0000: data-OST0000_UUID now active, resetting > orphans > Lustre: MDS data-MDT0000: data-OST0001_UUID now active, resetting > orphans > LustreError: 10616:0:(llog_server.c:471:llog_origin_handle_cancel()) > Cancel 62 of 124 llog-records failed: -2 > LustreError: 10617:0:(llog_server.c:471:llog_origin_handle_cancel()) > Cancel 62 of 124 llog-records failed: -2 > LustreError: 10668:0:(handler.c:1601:mds_handle()) operation 101 on > unconnected MDS from 12345-192.168.64.14 at o2ib > LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ > processing error (-107) req at ffff81040c50c400 x52079502/t0 o101-><? > >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237557977 ref 1 fl Interpret:/0/0 > rc -107/0 > LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) > Skipped 1058 previous similar messages > LustreError: 10668:0:(handler.c:1601:mds_handle()) Skipped 1060 > previous similar messages > LustreError: 10687:0:(handler.c:1601:mds_handle()) operation 101 on > unconnected MDS from 12345-192.168.64.14 at o2ib > LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ > processing error (-107) req at ffff81042dcecc00 x52082030/t0 o101-><? > >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237558009 ref 1 fl Interpret:/0/0 > rc -107/0 > LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) > Skipped 2518 previous similar messages > LustreError: 10687:0:(handler.c:1601:mds_handle()) Skipped 2518 > previous similar messages > > > > > > Thomas Wakefield Jr. > Systems Administrator COLA/IGES > 301-902-1268 > twake at cola.iges.org >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/6e616c9e/attachment-0001.html
Brian J. Murrell
2009-Mar-20 16:15 UTC
[Lustre-discuss] MDS crashing after upgrade to 1.6.7
On Fri, 2009-03-20 at 10:15 -0400, Thomas Wakefield wrote:> > Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating > snapshot: Unable to allocate exception.This is an LVM (device-mapper in fact) error. Nothing to do with Lustre.> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical > block 22676124 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1 > Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical > block 22676125 > Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1These are all generic kernel I/O messages, probably related to the prior LVM error.> Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1): > ldiskfs_get_inode_loc: unable to read inode block - inode=21347250, > block=21333880This looks like ldiskfs reporting an error likely caused by the previous errors.> Then after a reboot here is a DMESG bug that concerned me:Probably a result of the problems with your disk. I would suggest getting your disk problems fixed, completely, first, then doing an fsck on the MDT to see what damage the disk problems caused the filesystem b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/5bd6ea95/attachment.bin
I have been having the same problem on one the clusters where I upgraded the Lustre servers from 1.6.6 to 1.6.7. In this cluster all the worker nodes/clients run the Lustre-1.6.6 kernel: 2.6.18-92.1.10.el5_lustre.1.6.6smp. I don''t get any hardware errors on the Lustre servers but I get pretty much the same errors in the Call trace on/var/log/kernel on the MDT servers and OST servers. This was the last call trace that I got on my MDT: Mar 20 13:05:40 lustre3 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ldlm_cn_02:4302] Mar 20 13:05:40 lustre3 kernel: CPU 1: Mar 20 13:05:40 lustre3 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) ln et(U) lvfs(U) libcfs(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) sunrpc(U) vfat(U) fat(U) dm_multipath(U) video(U) sbs(U) bac klight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_nforce2(U) sr_mod(U) shpchp(U) forcedeth(U) i2c_core(U) serio_raw(U) cdrom(U) pcspk r(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_nv(U) pata_acpi(U) libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd (U) ehci_hcd(U) Mar 20 13:05:40 lustre3 kernel: Pid: 4302, comm: ldlm_cn_02 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 20 13:05:40 lustre3 kernel: RIP: 0010:[<ffffffff885eb3a9>] [<ffffffff885eb3a9>] :obdclass:class_handle2object+0xe9/0x160 Mar 20 13:05:40 lustre3 kernel: RSP: 0018:ffff81020720bb90 EFLAGS: 00000216 Mar 20 13:05:40 lustre3 kernel: RAX: ffffc200009be818 RBX: bbdfa43ac476d656 RCX: ffff81000100e8e0 Mar 20 13:05:40 lustre3 kernel: RDX: ffff81026da41a00 RSI: 0000000000000000 RDI: bbdfa43ac476d656 Mar 20 13:05:40 lustre3 kernel: RBP: ffffffff88686959 R08: ffff81025ca06200 R09: 5a5a5a5a5a5a5a5a Mar 20 13:05:40 lustre3 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff81020720bbf0 Mar 20 13:05:40 lustre3 kernel: R13: ffffffff8866e61f R14: ffffc200040f6220 R15: ffff81025ca06200 Mar 20 13:05:40 lustre3 kernel: FS: 00002ac881c2d220(0000) GS:ffff810107799440(0000) knlGS:0000000000000000 Mar 20 13:05:40 lustre3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 20 13:05:40 lustre3 kernel: CR2: 00002acecca1d000 CR3: 000000041fe59000 CR4: 00000000000006e0 Mar 20 13:05:40 lustre3 kernel: Mar 20 13:05:40 lustre3 kernel: Call Trace: Mar 20 13:05:40 lustre3 kernel: [<ffffffff88656526>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 I am planning on upgrading the worker nodes and client to the latest Redhat kernel 2.6.18-128.1.1.el5 and installing the patch-less clients to see if that fixes the problem. I don''t have a problem on the other cluster where the servers are running Lustre-1.6.7 and the clients are running 2.6.21 kernel.org kernel with patch-less clients. Nirmal