thr3ads.net - Lustre discuss - [Lustre-discuss] MDS crashing after upgrade to 1.6.7 [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Thomas Wakefield

2009-Mar-20 14:15 UTC

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

My mds has crashed a couple times after about 48 hours of running on  
1.6.7,  I can''t decide if it is a bug or a disk issue.  Prior to the  
upgrade, my uptime was 250+ days, so i think the hardware is stable.

Any help would be very appreciated.

Server is a dell 2950, dual OS drives, dual raid 1 drives for the MDS,  
running LVM for snapshots.


Log from right before the crash:

Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating  
snapshot: Unable to allocate exception.
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676124
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676125
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676127
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676374
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676375
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676376
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676377
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042873
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042919
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042920
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1):  
ldiskfs_get_inode_loc: unable to read inode block - inode=21347250,  
block=21333880
Mar 20 09:58:24 disk0 syslogd 1.4.1: restart.



Then after a reboot here is a DMESG bug that concerned me:

LustreError: 10661:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
390 previous similar messages
LustreError: 10689:0:(handler.c:1601:mds_handle()) Skipped 390  
previous similar messages
BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
CPU 3:
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U)  
mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U)  
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 10680, comm: ll_mdt_19 Tainted: G      2.6.18-92.1.17.el5_lustre. 
1.6.7smp #1
RIP: 0010:[<ffffffff8870e3e0>]   
[<ffffffff8870e3e0>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
RSP: 0018:ffff81040dc339e0  EFLAGS: 00000206
RAX: ffff81040ab94a00 RBX: 0000000000001f0e RCX: 0000000000007b38
RDX: 000000000000020a RSI: ffffffff802f0d80 RDI: ffffc20010d5e0ec
RBP: ffffffff88788912 R08: ffff81000101f3e0 R09: 0000000000000000
R10: ffff81040ab920c0 R11: 0000000000000150 R12: ffff81040ab94a00
R13: 0000000000000286 R14: 0000000000000286 R15: ffff81040dc33980
FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0

Call Trace:
  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
  [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
+0x5e4/0x780
  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
+0x137/0x1a0
  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
  [<ffffffff80062f4b>] thread_return+0x0/0xdf
  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 3b041098-4f33-6c6a-6e5c-3af7b34062e8 reconnecting
LustreError: 10663:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81040962ec00 x52078437/t0
o101-><?>@<?
 >:0/0 lens 232/0 e 0 to 0 dl 1237557961 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
777 previous similar messages
LustreError: 10663:0:(handler.c:1601:mds_handle()) Skipped 779  
previous similar messages
Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 2a1f051a-1043-af7c-75c5-65af1bced83b reconnecting
Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1  
previous similar message
Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 99fbab1c-b702-fab2-a584-6c31ac18a5f4 reconnecting
Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2  
previous similar messages
BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
CPU 3:
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U)  
mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U)  
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 10680, comm: ll_mdt_19 Tainted: G      2.6.18-92.1.17.el5_lustre. 
1.6.7smp #1
RIP: 0010:[<ffffffff800649e6>]  [<ffffffff800649e6>]
_write_lock+0x7/0xf
RSP: 0018:ffff81040dc339d8  EFLAGS: 00000246
RAX: ffff81040826ea00 RBX: 0000000000006227 RCX: 000000000000aee7
RDX: 00000000000002e4 RSI: ffffffff802f0d80 RDI: ffffc20010da127c
RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000
R10: ffff81040826fe00 R11: 0000000000000150 R12: 0000000000000286
R13: ffff81040dc33980 R14: ffff81040826fe58 R15: ffff81040826fe00
FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0

Call Trace:
  [<ffffffff8870e3fe>] :obdclass:lustre_hash_for_each_empty+0x20e/0x290
  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
  [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
+0x5e4/0x780
  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
+0x137/0x1a0
  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
  [<ffffffff80062f4b>] thread_return+0x0/0xdf
  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: data-MDT0000: sending delayed replies to recovered clients
Lustre: data-MDT0000: recovery complete: rc 0
Lustre: MDS data-MDT0000: data-OST0000_UUID now active, resetting  
orphans
Lustre: MDS data-MDT0000: data-OST0001_UUID now active, resetting  
orphans
LustreError: 10616:0:(llog_server.c:471:llog_origin_handle_cancel())  
Cancel 62 of 124 llog-records failed: -2
LustreError: 10617:0:(llog_server.c:471:llog_origin_handle_cancel())  
Cancel 62 of 124 llog-records failed: -2
LustreError: 10668:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81040c50c400 x52079502/t0
o101-><?>@<?
 >:0/0 lens 232/0 e 0 to 0 dl 1237557977 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
1058 previous similar messages
LustreError: 10668:0:(handler.c:1601:mds_handle()) Skipped 1060  
previous similar messages
LustreError: 10687:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81042dcecc00 x52082030/t0
o101-><?>@<?
 >:0/0 lens 232/0 e 0 to 0 dl 1237558009 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
2518 previous similar messages
LustreError: 10687:0:(handler.c:1601:mds_handle()) Skipped 2518  
previous similar messages





Thomas Wakefield Jr.
Systems Administrator COLA/IGES
301-902-1268
twake at cola.iges.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/060deb76/attachment-0001.html

Aaron Knister

2009-Mar-20 14:34 UTC

head link

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

To me it looks like a disk issue. A very strange one at that.

On Mar 20, 2009, at 10:15 AM, Thomas Wakefield wrote:
> My mds has crashed a couple times after about 48 hours of running on  
> 1.6.7,  I can''t decide if it is a bug or a disk issue.  Prior to
the
> upgrade, my uptime was 250+ days, so i think the hardware is stable.
>
> Any help would be very appreciated.
>
> Server is a dell 2950, dual OS drives, dual raid 1 drives for the  
> MDS, running LVM for snapshots.
>
>
> Log from right before the crash:
>
> Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating  
> snapshot: Unable to allocate exception.
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676124
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676125
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676127
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676374
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676375
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676376
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 22676377
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 11042873
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 11042919
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1,  
> logical block 11042920
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1):  
> ldiskfs_get_inode_loc: unable to read inode block - inode=21347250,  
> block=21333880
> Mar 20 09:58:24 disk0 syslogd 1.4.1: restart.
>
>
>
> Then after a reboot here is a DMESG bug that concerned me:
>
> LustreError: 10661:0:(ldlm_lib.c:1643:target_send_reply_msg())  
> Skipped 390 previous similar messages
> LustreError: 10689:0:(handler.c:1601:mds_handle()) Skipped 390  
> previous similar messages
> BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
> CPU 3:
> Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
> crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
> ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
> ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
> dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
> sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
> crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U)  
> ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U)  
> backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
> joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
> shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
> serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
> dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
> scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
> Pid: 10680, comm: ll_mdt_19 Tainted: G       
> 2.6.18-92.1.17.el5_lustre.1.6.7smp #1
> RIP: 0010:[<ffffffff8870e3e0>]   
> [<ffffffff8870e3e0>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
> RSP: 0018:ffff81040dc339e0  EFLAGS: 00000206
> RAX: ffff81040ab94a00 RBX: 0000000000001f0e RCX: 0000000000007b38
> RDX: 000000000000020a RSI: ffffffff802f0d80 RDI: ffffc20010d5e0ec
> RBP: ffffffff88788912 R08: ffff81000101f3e0 R09: 0000000000000000
> R10: ffff81040ab920c0 R11: 0000000000000150 R12: ffff81040ab94a00
> R13: 0000000000000286 R14: 0000000000000286 R15: ffff81040dc33980
> FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0
>
> Call Trace:
>  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
>  [<ffffffff8878a5b0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
>  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
>  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
>  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
> +0x5e4/0x780
>  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
> +0x137/0x1a0
>  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
>  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
>  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
>  [<ffffffff80143809>] __next_cpu+0x19/0x28
>  [<ffffffff80143809>] __next_cpu+0x19/0x28
>  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
>  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
>  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
>  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request 
> +0xa93/0x1150
>  [<ffffffff80062f4b>] thread_return+0x0/0xdf
>  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
>  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
>  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
>  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
>  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
>  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
>  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
> MDT0000: 3b041098-4f33-6c6a-6e5c-3af7b34062e8 reconnecting
> LustreError: 10663:0:(handler.c:1601:mds_handle()) operation 101 on  
> unconnected MDS from 12345-192.168.64.14 at o2ib
> LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
> processing error (-107)  req at ffff81040962ec00 x52078437/t0
o101-><?
> >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237557961 ref 1 fl
Interpret:/0/0
> rc -107/0
> LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg())  
> Skipped 777 previous similar messages
> LustreError: 10663:0:(handler.c:1601:mds_handle()) Skipped 779  
> previous similar messages
> Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
> MDT0000: 2a1f051a-1043-af7c-75c5-65af1bced83b reconnecting
> Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1  
> previous similar message
> Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
> MDT0000: 99fbab1c-b702-fab2-a584-6c31ac18a5f4 reconnecting
> Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2  
> previous similar messages
> BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
> CPU 3:
> Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
> crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
> ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
> ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
> dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
> sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
> crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U)  
> ib_ipath(U) mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U)  
> backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
> joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
> shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
> serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
> dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
> scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
> Pid: 10680, comm: ll_mdt_19 Tainted: G       
> 2.6.18-92.1.17.el5_lustre.1.6.7smp #1
> RIP: 0010:[<ffffffff800649e6>]  [<ffffffff800649e6>]
_write_lock
> +0x7/0xf
> RSP: 0018:ffff81040dc339d8  EFLAGS: 00000246
> RAX: ffff81040826ea00 RBX: 0000000000006227 RCX: 000000000000aee7
> RDX: 00000000000002e4 RSI: ffffffff802f0d80 RDI: ffffc20010da127c
> RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000
> R10: ffff81040826fe00 R11: 0000000000000150 R12: 0000000000000286
> R13: ffff81040dc33980 R14: ffff81040826fe58 R15: ffff81040826fe00
> FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0
>
> Call Trace:
>  [<ffffffff8870e3fe>]
:obdclass:lustre_hash_for_each_empty+0x20e/0x290
>  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
>  [<ffffffff8878a5b0>]
:ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
>  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
>  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
>  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
> +0x5e4/0x780
>  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
> +0x137/0x1a0
>  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
>  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
>  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
>  [<ffffffff80143809>] __next_cpu+0x19/0x28
>  [<ffffffff80143809>] __next_cpu+0x19/0x28
>  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
>  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
>  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
>  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request 
> +0xa93/0x1150
>  [<ffffffff80062f4b>] thread_return+0x0/0xdf
>  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
>  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
>  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
>  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
>  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
>  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
>  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Lustre: data-MDT0000: sending delayed replies to recovered clients
> Lustre: data-MDT0000: recovery complete: rc 0
> Lustre: MDS data-MDT0000: data-OST0000_UUID now active, resetting  
> orphans
> Lustre: MDS data-MDT0000: data-OST0001_UUID now active, resetting  
> orphans
> LustreError: 10616:0:(llog_server.c:471:llog_origin_handle_cancel())  
> Cancel 62 of 124 llog-records failed: -2
> LustreError: 10617:0:(llog_server.c:471:llog_origin_handle_cancel())  
> Cancel 62 of 124 llog-records failed: -2
> LustreError: 10668:0:(handler.c:1601:mds_handle()) operation 101 on  
> unconnected MDS from 12345-192.168.64.14 at o2ib
> LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
> processing error (-107)  req at ffff81040c50c400 x52079502/t0
o101-><?
> >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237557977 ref 1 fl
Interpret:/0/0
> rc -107/0
> LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg())  
> Skipped 1058 previous similar messages
> LustreError: 10668:0:(handler.c:1601:mds_handle()) Skipped 1060  
> previous similar messages
> LustreError: 10687:0:(handler.c:1601:mds_handle()) operation 101 on  
> unconnected MDS from 12345-192.168.64.14 at o2ib
> LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
> processing error (-107)  req at ffff81042dcecc00 x52082030/t0
o101-><?
> >@<?>:0/0 lens 232/0 e 0 to 0 dl 1237558009 ref 1 fl
Interpret:/0/0
> rc -107/0
> LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg())  
> Skipped 2518 previous similar messages
> LustreError: 10687:0:(handler.c:1601:mds_handle()) Skipped 2518  
> previous similar messages
>
>
>
>
>
> Thomas Wakefield Jr.
> Systems Administrator COLA/IGES
> 301-902-1268
> twake at cola.iges.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/6e616c9e/attachment-0001.html

Brian J. Murrell

2009-Mar-20 16:15 UTC

head link

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

On Fri, 2009-03-20 at 10:15 -0400, Thomas Wakefield wrote:
> 
> Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating
> snapshot: Unable to allocate exception.
This is an LVM (device-mapper in fact) error.  Nothing to do with
Lustre.
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical
> block 22676124
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
> Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical
> block 22676125
> Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
These are all generic kernel I/O messages, probably related to the prior
LVM error.
> Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1):
> ldiskfs_get_inode_loc: unable to read inode block - inode=21347250,
> block=21333880
This looks like ldiskfs reporting an error likely caused by the previous
errors.
> Then after a reboot here is a DMESG bug that concerned me:
Probably a result of the problems with your disk.  I would suggest
getting your disk problems fixed, completely, first, then doing an fsck
on the MDT to see what damage the disk problems caused the filesystem

b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/5bd6ea95/attachment.bin

Nirmal Seenu

2009-Mar-20 20:52 UTC

head link

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

I have been having the same problem on one the clusters where I upgraded 
the Lustre servers from 1.6.6 to 1.6.7. In this cluster all the worker 
nodes/clients run the Lustre-1.6.6 kernel: 
2.6.18-92.1.10.el5_lustre.1.6.6smp.

I don''t get any hardware errors on the Lustre servers but I get pretty 
much the same errors in the Call trace on/var/log/kernel on the MDT 
servers and OST servers.

This was the last call trace that I got on my MDT:
Mar 20 13:05:40 lustre3 kernel: BUG: soft lockup - CPU#1 stuck for 10s! 
[ldlm_cn_02:4302]
Mar 20 13:05:40 lustre3 kernel: CPU 1:
Mar 20 13:05:40 lustre3 kernel: Modules linked in: obdfilter(U) ost(U) 
mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) 
lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) ln
et(U) lvfs(U) libcfs(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) 
ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) hidp(U) l2cap(U) 
bluetooth(U) sunrpc(U) vfat(U) fat(U) dm_multipath(U) video(U) sbs(U) bac
klight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) 
ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_nforce2(U) sr_mod(U) 
shpchp(U) forcedeth(U) i2c_core(U) serio_raw(U) cdrom(U) pcspk
r(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) 
usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_nv(U) pata_acpi(U) 
libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd
(U) ehci_hcd(U)
Mar 20 13:05:40 lustre3 kernel: Pid: 4302, comm: ldlm_cn_02 Tainted: G 
     2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Mar 20 13:05:40 lustre3 kernel: RIP: 0010:[<ffffffff885eb3a9>] 
[<ffffffff885eb3a9>] :obdclass:class_handle2object+0xe9/0x160
Mar 20 13:05:40 lustre3 kernel: RSP: 0018:ffff81020720bb90  EFLAGS: 00000216
Mar 20 13:05:40 lustre3 kernel: RAX: ffffc200009be818 RBX: 
bbdfa43ac476d656 RCX: ffff81000100e8e0
Mar 20 13:05:40 lustre3 kernel: RDX: ffff81026da41a00 RSI: 
0000000000000000 RDI: bbdfa43ac476d656
Mar 20 13:05:40 lustre3 kernel: RBP: ffffffff88686959 R08: 
ffff81025ca06200 R09: 5a5a5a5a5a5a5a5a
Mar 20 13:05:40 lustre3 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
5a5a5a5a5a5a5a5a R12: ffff81020720bbf0
Mar 20 13:05:40 lustre3 kernel: R13: ffffffff8866e61f R14: 
ffffc200040f6220 R15: ffff81025ca06200
Mar 20 13:05:40 lustre3 kernel: FS:  00002ac881c2d220(0000) 
GS:ffff810107799440(0000) knlGS:0000000000000000
Mar 20 13:05:40 lustre3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Mar 20 13:05:40 lustre3 kernel: CR2: 00002acecca1d000 CR3: 
000000041fe59000 CR4: 00000000000006e0
Mar 20 13:05:40 lustre3 kernel:
Mar 20 13:05:40 lustre3 kernel: Call Trace:
Mar 20 13:05:40 lustre3 kernel:  [<ffffffff88656526>] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0

I am planning on upgrading the worker nodes and client to the latest 
Redhat kernel 2.6.18-128.1.1.el5 and installing the patch-less clients 
to see if that fixes the problem.

I don''t have a problem on the other cluster where the servers are 
running Lustre-1.6.7 and the clients are running 2.6.21 kernel.org 
kernel with patch-less clients.

Nirmal

Lustre discuss - Mar 2009 - MDS crashing after upgrade to 1.6.7

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

[Lustre-discuss] MDS crashing after upgrade to 1.6.7

[Lustre-discuss] MDS crashing after upgrade to 1.6.7