Yesterday we had an OST fail. I had to e2fsck it to fix it (although there was some corruption). All servers are: RH 5.3 with Lustre 1.8.1.1 Mellanox QDR IB Fiber connected storage. I left the OST mounted yesterday but it was deactivated on the mds. [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active 0 At some point this morning the OSS locked up and had to be rebooted. I couldn''t even access it on the console. I see I/O errors when trying to copy files that exist on that OST (lfs find -O scratch-OST000e_UUID /scratch) Now we''re seeing CPU lockups on the OSS Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] dmesg: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] CPU 2: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U) crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd( U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper( U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hw mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) pcspkr( U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 12066, comm: ll_ost_500 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 RIP: 0010:[<ffffffff88b3bb64>] [<ffffffff88b3bb64>] :ldiskfs:ldiskfs_find_entry+0x244/0x5c0 RSP: 0018:ffff8106656e9610 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004 RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000 R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550 R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae FS: 00002aeb511bd220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0 Call Trace: [<ffffffff80089bdd>] dequeue_task+0x18/0x37 [<ffffffff80063098>] thread_return+0x62/0xfe [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290 [<ffffffff800366e8>] __lookup_hash+0x10b/0x130 [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14 [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61 [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730 [<ffffffff8026f08b>] __down_trylock+0x44/0x4e [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50 [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170 [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690 [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70 [<ffffffff800d73cd>] free_block+0x126/0x143 [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390 [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff80089d8d>] enqueue_task+0x41/0x56 [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 [<ffffffff80063098>] thread_return+0x62/0xfe [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 [<ffffffff8008a3f3>] default_wake_function+0x0/0xe [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 On the MDS I se this in dmesg: BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659] CPU 1: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs(U) lockd(U) fscache(U) nfs_acl(U ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) cpufreq_ondemand(U) acpi_cp ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) ahci(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) shpchp(U) aacraid(U) mppUpper(U) sg(U ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11659, comm: ll_evictor Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf RSP: 0018:ffff81034fcefc78 EFLAGS: 00000246 RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117 RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000 R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286 R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200 FS: 00002af849256220(0000) GS:ffff81010c4994c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0 Call Trace: [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420 [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40 [<ffffffff8014b87a>] snprintf+0x44/0x4c [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0 [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 [<ffffffff8008a3f3>] default_wake_function+0x0/0xe [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Can anyone shed some light on this? Thanks Erik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091231/f929f75e/attachment.html
My best guess is that there''s something still not right on-disk. This is my $.02, but I''d suggest running another fsck on that ost if you haven''t already. On Dec 31, 2009, at 12:20 PM, Erik Froese wrote:> Yesterday we had an OST fail. I had to e2fsck it to fix it (although there was some corruption). > > All servers are: > RH 5.3 with Lustre 1.8.1.1 > Mellanox QDR IB > Fiber connected storage. > > I left the OST mounted yesterday but it was deactivated on the mds. > [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active > 0 > > At some point this morning the OSS locked up and had to be rebooted. I couldn''t even access it on the console. > > I see I/O errors when trying to copy files that exist on that OST (lfs find -O scratch-OST000e_UUID /scratch) > > Now we''re seeing CPU lockups on the OSS > > Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] > Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > > dmesg: > > BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > CPU 2: > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U) crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd( > U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl > e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper( > U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hw > mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) pcspkr( > U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc > i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > Pid: 12066, comm: ll_ost_500 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 > RIP: 0010:[<ffffffff88b3bb64>] [<ffffffff88b3bb64>] :ldiskfs:ldiskfs_find_entry+0x244/0x5c0 > RSP: 0018:ffff8106656e9610 EFLAGS: 00000202 > RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004 > RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb > RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000 > R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550 > R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae > FS: 00002aeb511bd220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0 > > Call Trace: > [<ffffffff80089bdd>] dequeue_task+0x18/0x37 > [<ffffffff80063098>] thread_return+0x62/0xfe > [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290 > [<ffffffff800366e8>] __lookup_hash+0x10b/0x130 > [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14 > [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61 > [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730 > [<ffffffff8026f08b>] __down_trylock+0x44/0x4e > [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a > [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b > [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50 > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 > [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 > [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170 > [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690 > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 > [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 > [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 > [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70 > [<ffffffff800d73cd>] free_block+0x126/0x143 > [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390 > [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > [<ffffffff80089d8d>] enqueue_task+0x41/0x56 > [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 > [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 > [<ffffffff80063098>] thread_return+0x62/0xfe > [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c > [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe > [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > On the MDS I se this in dmesg: > BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659] > CPU 1: > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs(U) lockd(U) fscache(U) nfs_acl(U > ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT > (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) cpufreq_ondemand(U) acpi_cp > ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um > ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi > _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash > (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) ahci(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) shpchp(U) aacraid(U) mppUpper(U) sg(U > ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > Pid: 11659, comm: ll_evictor Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 > RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf > RSP: 0018:ffff81034fcefc78 EFLAGS: 00000246 > RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117 > RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc > RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000 > R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286 > R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200 > FS: 00002af849256220(0000) GS:ffff81010c4994c0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0 > > Call Trace: > [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 > [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420 > [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40 > [<ffffffff8014b87a>] snprintf+0x44/0x4c > [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0 > [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Can anyone shed some light on this? > > Thanks > Erik > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
I think you''re right. I''m re fsking it right now. Erik On Thu, Dec 31, 2009 at 1:29 PM, Aaron Knister <aaron.knister at gmail.com>wrote:> My best guess is that there''s something still not right on-disk. This is my > $.02, but I''d suggest running another fsck on that ost if you haven''t > already. > > On Dec 31, 2009, at 12:20 PM, Erik Froese wrote: > > > Yesterday we had an OST fail. I had to e2fsck it to fix it (although > there was some corruption). > > > > All servers are: > > RH 5.3 with Lustre 1.8.1.1 > > Mellanox QDR IB > > Fiber connected storage. > > > > I left the OST mounted yesterday but it was deactivated on the mds. > > [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active > > 0 > > > > At some point this morning the OSS locked up and had to be rebooted. I > couldn''t even access it on the console. > > > > I see I/O errors when trying to copy files that exist on that OST (lfs > find -O scratch-OST000e_UUID /scratch) > > > > Now we''re seeing CPU lockups on the OSS > > > > Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] > > Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_ost_500:12066] > > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! > [ll_ost_26:10467] > > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_ost_500:12066] > > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! > [ll_ost_26:10467] > > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_ost_500:12066] > > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! > [ll_ost_26:10467] > > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_ost_500:12066] > > > > dmesg: > > > > BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] > > CPU 2: > > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U) > crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) > obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd( > > U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) > ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) > xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl > > e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) > ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) > ib_addr(U) ib_ipoib(U) ipoib_helper( > > U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) > ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) > dm_multipath(U) scsi_dh(U) video(U) hw > > mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) > acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) > cdrom(U) mlx4_core(U) pcspkr( > > U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) > dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) > usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc > > i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U) > scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > > Pid: 12066, comm: ll_ost_500 Tainted: G > 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 > > RIP: 0010:[<ffffffff88b3bb64>] [<ffffffff88b3bb64>] > :ldiskfs:ldiskfs_find_entry+0x244/0x5c0 > > RSP: 0018:ffff8106656e9610 EFLAGS: 00000202 > > RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004 > > RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb > > RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000 > > R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550 > > R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae > > FS: 00002aeb511bd220(0000) GS:ffff81010c499240(0000) > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0 > > > > Call Trace: > > [<ffffffff80089bdd>] dequeue_task+0x18/0x37 > > [<ffffffff80063098>] thread_return+0x62/0xfe > > [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290 > > [<ffffffff800366e8>] __lookup_hash+0x10b/0x130 > > [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14 > > [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61 > > [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730 > > [<ffffffff8026f08b>] __down_trylock+0x44/0x4e > > [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a > > [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b > > [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50 > > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 > > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 > > [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 > > [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170 > > [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690 > > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 > > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 > > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 > > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 > > [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 > > [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 > > [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70 > > [<ffffffff800d73cd>] free_block+0x126/0x143 > > [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390 > > [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > > [<ffffffff80089d8d>] enqueue_task+0x41/0x56 > > [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 > > [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 > > [<ffffffff80063098>] thread_return+0x62/0xfe > > [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c > > [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe > > [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe > > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 > > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 > > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > > > On the MDS I se this in dmesg: > > BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659] > > CPU 1: > > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) > lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) > nfs(U) lockd(U) fscache(U) nfs_acl(U > > ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U) > autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) > ip_conntrack_netbios_ns(U) ipt_REJECT > > (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) > ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) > x_tables(U) cpufreq_ondemand(U) acpi_cp > > ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) > ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) > xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um > > ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) > dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) > button(U) battery(U) asus_acpi(U) acpi > > _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) > cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U) > dm_message(U) dm_region_hash > > (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) ahci(U) > libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) shpchp(U) > aacraid(U) mppUpper(U) sg(U > > ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) > ehci_hcd(U) > > Pid: 11659, comm: ll_evictor Tainted: G > 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 > > RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf > > RSP: 0018:ffff81034fcefc78 EFLAGS: 00000246 > > RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117 > > RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc > > RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000 > > R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286 > > R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200 > > FS: 00002af849256220(0000) GS:ffff81010c4994c0(0000) > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0 > > > > Call Trace: > > [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 > > [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420 > > [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40 > > [<ffffffff8014b87a>] snprintf+0x44/0x4c > > [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0 > > [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 > > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe > > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 > > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 > > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > > > Can anyone shed some light on this? > > > > Thanks > > Erik > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100101/27a2a112/attachment.html
After the e2fsck I was able to mount the OST cleanly and the CPU locks went away. Thanks! Erik On Fri, Jan 1, 2010 at 7:27 PM, Erik Froese <erik.froese at gmail.com> wrote:> I think you''re right. I''m re fsking it right now. > Erik > > > On Thu, Dec 31, 2009 at 1:29 PM, Aaron Knister <aaron.knister at gmail.com>wrote: > >> My best guess is that there''s something still not right on-disk. This is >> my $.02, but I''d suggest running another fsck on that ost if you haven''t >> already. >> >> On Dec 31, 2009, at 12:20 PM, Erik Froese wrote: >> >> > Yesterday we had an OST fail. I had to e2fsck it to fix it (although >> there was some corruption). >> > >> > All servers are: >> > RH 5.3 with Lustre 1.8.1.1 >> > Mellanox QDR IB >> > Fiber connected storage. >> > >> > I left the OST mounted yesterday but it was deactivated on the mds. >> > [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active >> > 0 >> > >> > At some point this morning the OSS locked up and had to be rebooted. I >> couldn''t even access it on the console. >> > >> > I see I/O errors when trying to copy files that exist on that OST (lfs >> find -O scratch-OST000e_UUID /scratch) >> > >> > Now we''re seeing CPU lockups on the OSS >> > >> > Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467] >> > Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! >> [ll_ost_500:12066] >> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! >> [ll_ost_26:10467] >> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! >> [ll_ost_500:12066] >> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! >> [ll_ost_26:10467] >> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! >> [ll_ost_500:12066] >> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! >> [ll_ost_26:10467] >> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! >> [ll_ost_500:12066] >> > >> > dmesg: >> > >> > BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066] >> > CPU 2: >> > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U) >> crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) >> obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd( >> > U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) >> ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) >> xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl >> > e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) >> ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) >> ib_addr(U) ib_ipoib(U) ipoib_helper( >> > U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) >> ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) >> dm_multipath(U) scsi_dh(U) video(U) hw >> > mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) >> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) >> cdrom(U) mlx4_core(U) pcspkr( >> > U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) >> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) >> usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc >> > i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U) >> scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) >> > Pid: 12066, comm: ll_ost_500 Tainted: G >> 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 >> > RIP: 0010:[<ffffffff88b3bb64>] [<ffffffff88b3bb64>] >> :ldiskfs:ldiskfs_find_entry+0x244/0x5c0 >> > RSP: 0018:ffff8106656e9610 EFLAGS: 00000202 >> > RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004 >> > RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb >> > RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000 >> > R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550 >> > R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae >> > FS: 00002aeb511bd220(0000) GS:ffff81010c499240(0000) >> knlGS:0000000000000000 >> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> > CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0 >> > >> > Call Trace: >> > [<ffffffff80089bdd>] dequeue_task+0x18/0x37 >> > [<ffffffff80063098>] thread_return+0x62/0xfe >> > [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290 >> > [<ffffffff800366e8>] __lookup_hash+0x10b/0x130 >> > [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14 >> > [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61 >> > [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730 >> > [<ffffffff8026f08b>] __down_trylock+0x44/0x4e >> > [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a >> > [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b >> > [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50 >> > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 >> > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 >> > [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 >> > [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170 >> > [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690 >> > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 >> > [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 >> > [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 >> > [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0 >> > [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 >> > [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 >> > [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70 >> > [<ffffffff800d73cd>] free_block+0x126/0x143 >> > [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390 >> > [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 >> > [<ffffffff80089d8d>] enqueue_task+0x41/0x56 >> > [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 >> > [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 >> > [<ffffffff80063098>] thread_return+0x62/0xfe >> > [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c >> > [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe >> > [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 >> > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe >> > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 >> > [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> > [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 >> > [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> > >> > On the MDS I se this in dmesg: >> > BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659] >> > CPU 1: >> > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) >> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) >> nfs(U) lockd(U) fscache(U) nfs_acl(U >> > ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U) >> autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) >> ip_conntrack_netbios_ns(U) ipt_REJECT >> > (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) >> ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) >> x_tables(U) cpufreq_ondemand(U) acpi_cp >> > ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) >> ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) >> xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um >> > ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) >> dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) >> button(U) battery(U) asus_acpi(U) acpi >> > _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) >> cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U) >> dm_message(U) dm_region_hash >> > (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) >> ahci(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) >> shpchp(U) aacraid(U) mppUpper(U) sg(U >> > ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) >> ehci_hcd(U) >> > Pid: 11659, comm: ll_evictor Tainted: G >> 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 >> > RIP: 0010:[<ffffffff80064aee>] [<ffffffff80064aee>] _write_lock+0x7/0xf >> > RSP: 0018:ffff81034fcefc78 EFLAGS: 00000246 >> > RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117 >> > RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc >> > RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000 >> > R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286 >> > R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200 >> > FS: 00002af849256220(0000) GS:ffff81010c4994c0(0000) >> knlGS:0000000000000000 >> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> > CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0 >> > >> > Call Trace: >> > [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0 >> > [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420 >> > [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40 >> > [<ffffffff8014b87a>] snprintf+0x44/0x4c >> > [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0 >> > [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0 >> > [<ffffffff8008a3f3>] default_wake_function+0x0/0xe >> > [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357 >> > [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> > [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0 >> > [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> > >> > Can anyone shed some light on this? >> > >> > Thanks >> > Erik >> > >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100102/5fa03d7b/attachment-0001.html