We are having a problem with a MDS server (which also has 1 OST) on the box. When the server boots up, we notice there is an ll_mdt process running at 100% and we keep on waiting close to 10-15 mins. We only have 8 clients. (I assume this normal recovery process). However if I manually mount up the mdt without any recovery everything is fine mount -t lustre /dev/foo -o abort_recov /mnt/lustre BUT the server crashes again after 18-24 hours. I am trying to get to the bottom of this crash. I am not sure whats causing the problem and hopefully I am doing something foolish. There are 2 OSTs connecting to this MDS. MDS Server Version: Redhat 5.1-1.2 Running, 2.6.18-92.1.17.el5_lustre.1.6.7smp cat /proc/fs/lustre/version lustre: 1.6.7 kernel: patchless_client build: 1.6.7-19691231170000-PRISTINE-.cache.build.BUILD.lustre-kernel-2.6.18.lustre.linux-2.6.18-92.1.17.el5_lustre.1.6.7smp client# lfs check mds lfs002-MDT0000-mdc-ffff81102ac40000 active. lfs002-MDT0000-mdc-ffff810fd264bc00 active. client# lfs check osts lfs002-OST0000-osc-ffff81102ac40000 active. lfs002-OST0001-osc-ffff81102ac40000 active. lfs002-OST0000-osc-ffff810fd264bc00 active. lfs002-OST0001-osc-ffff810fd264bc00 active. lfs002-OST0002-osc-ffff810fd264bc00 active. lfs002-OST0003-osc-ffff810fd264bc00 active. lfs002-OST0004-osc-ffff810fd264bc00 active. lfs002-OST0005-osc-ffff810fd264bc00 active. mds# lctl dl 0 UP mgs MGS MGS 25 1 UP mgc MGC141.128.90.153 at tcp b6d875c0-6b30-5a2d-92d3-600ef3324c50 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lfs002-mdtlov lfs002-mdtlov_UUID 4 4 UP mds lfs002-MDT0000 lfs002-MDT0000_UUID 21 5 UP osc lfs002-OST0000-osc lfs002-mdtlov_UUID 5 6 UP osc lfs002-OST0001-osc lfs002-mdtlov_UUID 5 7 UP ost OSS OSS_uuid 3 8 UP obdfilter lfs002-OST0001 lfs002-OST0001_UUID 23 The clients are running: Redhat 5.2 2.6.18-92.1.10.el5 cat /proc/fs/lustre/version lustre: 1.6.6 kernel: patchless build: 1.6.6-19691231190000-PRISTINE-.usr.src.linux-2.6.18-92.1.10.el5 Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:11:02 protected_host_01 kernel: RIP: 0010:[<ffffffff888ed8df>] [<ffffffff888ed8df>] :ldiskfs:do_split+0x3ef/0x560 Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 EFLAGS: 00000216 Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: ffff8103cd52177c RDI: ffff8103cd52176c Mar 12 10:11:02 protected_host_01 kernel: RBP: ffffffff8000b071 R08: ffff8103cd5216ec R09: 00000000010a0014 Mar 12 10:11:02 protected_host_01 kernel: R10: 00007a6700000008 R11: 00007a672e767363 R12: 000000000064dc69 Mar 12 10:11:02 protected_host_01 kernel: R13: ffffffff80019496 R14: ffff81040ed0f4c0 R15: 0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: FS: 00002b7545c3b220(0000) GS:ffff81042fea79c0(0000) knlGS:0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 12 10:11:02 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 12 10:11:02 protected_host_01 kernel: Mar 12 10:11:02 protected_host_01 kernel: Call Trace: Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888ee3b5>] :ldiskfs:ldiskfs_add_entry+0x4f5/0x980 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88034f74>] :jbd:journal_dirty_metadata+0x1b5/0x1e3 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889a6840>] :mds:mds_get_parent_child_locked+0x750/0x8e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888eee56>] :ldiskfs:ldiskfs_add_nondir+0x26/0x90 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888ef776>] :ldiskfs:ldiskfs_create+0xf6/0x140 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8896f412>] :fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8003a075>] vfs_create+0xe6/0x158 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889c7140>] :mds:mds_open+0x14b0/0x317e Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8876c241>] :ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8876d47f>] :ksocklnd:ksocknal_launch_packet+0x2df/0x3d0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889a1f49>] :mds:mds_reint_rec+0x1d9/0x2b0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889cad82>] :mds:mds_open_unpack+0x312/0x430 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88994d4a>] :mds:mds_reint+0x35a/0x420 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889934db>] :mds:fixup_handle_for_resent_req+0x25b/0x2c0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88998dfc>] :mds:mds_intent_policy+0x48c/0xc30 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886ab526>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886a8d18>] :ptlrpc:ldlm_lock_enqueue+0x188/0x990 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886c36ff>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8862c688>] :obdclass:lustre_hash_add+0x208/0x2d0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886cc2a0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886ca3f9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8899d615>] :mds:mds_handle+0x4075/0x4d30 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800d40d5>] cache_flusharray+0x2f/0xa3 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886e65a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886eecfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f0b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f3103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff885967c6>] :libcfs:lcw_update_time+0x16/0x100 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f65f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f53e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Mar 12 10:11:02 protected_host_01 kernel: Mar 12 10:17:06 protected_host_01 kernel: BUG: soft lockup - CPU#6 stuck for 10s! [ll_mdt_10:10375] Mar 12 10:17:06 protected_host_01 kernel: CPU 6: Mar 12 10:17:06 protected_host_01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) bonding(U) dm_round_robin(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pata_acpi(U) lpfc(U) ide_cd(U) bnx2(U) e1000e(U) cdrom(U) shpchp(U) scsi_transport_fc(U) hpwdt(U) i5000_edac(U) edac_mc(U) pcspkr(U) serio_raw(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) ata_piix(U) sata_nv(U) libata(U) mptsas(U) scsi_transport_sas(U) mptspi(U) mptscsih(U) scsi_transport_spi(U) mptbase(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Mar 12 10:17:06 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:17:06 protected_host_01 kernel: RIP: 0010:[<ffffffff888ed8f0>] [<ffffffff888ed8f0>] :ldiskfs:do_split+0x400/0x560 Mar 12 10:17:06 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 EFLAGS: 00000246 Mar 12 10:17:06 protected_host_01 kernel: RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: RDX: 0000000000000080 RSI: ffff8103cd52177c RDI: ffff8103cd52176c Mar 12 10:17:06 protected_host_01 kernel: RBP: ffffffff8000b071 R08: ffff8103cd5216ec R09: 00000000010a0014 Mar 12 10:17:06 protected_host_01 kernel: R10: 00007a6700000008 R11: 00007a672e767363 R12: 000000000064dc69 Mar 12 10:17:06 protected_host_01 kernel: R13: ffffffff80019496 R14: ffff81040ed0f4c0 R15: 0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: FS: 00002b7545c3b220(0000) GS:ffff81042fea79c0(0000) knlGS:0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 12 10:17:06 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 12 10:17:06 protected_host_01 kernel: Mar 12 10:17:06 protected_host_01 kernel: Call Trace: Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888ee3b5>] :ldiskfs:ldiskfs_add_entry+0x4f5/0x980 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88034f74>] :jbd:journal_dirty_metadata+0x1b5/0x1e3 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889a6840>] :mds:mds_get_parent_child_locked+0x750/0x8e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888eee56>] :ldiskfs:ldiskfs_add_nondir+0x26/0x90 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888ef776>] :ldiskfs:ldiskfs_create+0xf6/0x140 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8896f412>] :fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8003a075>] vfs_create+0xe6/0x158 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889c7140>] :mds:mds_open+0x14b0/0x317e Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8876c241>] :ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8876d47f>] :ksocklnd:ksocknal_launch_packet+0x2df/0x3d0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889a1f49>] :mds:mds_reint_rec+0x1d9/0x2b0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889cad82>] :mds:mds_open_unpack+0x312/0x430 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88994d4a>] :mds:mds_reint+0x35a/0x420 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889934db>] :mds:fixup_handle_for_resent_req+0x25b/0x2c0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88998dfc>] :mds:mds_intent_policy+0x48c/0xc30 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886ab526>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886a8d18>] :ptlrpc:ldlm_lock_enqueue+0x188/0x990 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886c36ff>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8862c688>] :obdclass:lustre_hash_add+0x208/0x2d0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886cc2a0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886ca3f9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8899d615>] :mds:mds_handle+0x4075/0x4d30 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800d40d5>] cache_flusharray+0x2f/0xa3 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886e65a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886eecfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f0b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f3103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff885967c6>] :libcfs:lcw_update_time+0x16/0x100 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f65f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f53e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Any thoughts? TIA
On Saturday 14 March 2009, Mag Gam wrote:> We are having a problem with a MDS server (which also has 1 OST) on the > box. > > When the server boots up, we notice there is an ll_mdt process running > at 100% and we keep on waiting close to 10-15 mins. We only have 8 > clients. (I assume this normal recovery process). However if I > manually mount up the mdt without any recovery everything is fineHmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed in 1.6.5. Although I''m not sure, since we switched NFS exports to unfs3 ever since the problem came up.> > Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 > Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 > Mar 12 10:11:02 protected_host_01 kernel: RIP: > 0010:[<ffffffff888ed8df>] [<ffffffff888ed8df>] > > :ldiskfs:do_split+0x3ef/0x560 > > Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 > EFLAGS: 00000216 > Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: > 0000000000000080 RCX: 0000000000000000 > Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: > ffff8103cd52177c RDI: ffff8103cd52176cAny chance you can send traces with line wrap disabled? With line wrapping it is quite hard to read. Cheers, Bernd
Hey Bernd: Thanks for the reply. Interesting. We are using with NFS too. Is there something in particular we need to do like "enable port 988 in /etc/modules.conf" which I think I am already doing.> Any chance you can send traces with line wrap disabled? With line wrapping it > is quite hard to read.Ofcourse! I even posted a bug report with the /tmp/lustre.log https://bugzilla.lustre.org/show_bug.cgi?id=18802 Let me know if you need anything else. TIA On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert <bernd.schubert at fastmail.fm> wrote:> On Saturday 14 March 2009, Mag Gam wrote: >> We are having a problem with a MDS server (which also has 1 OST) on the >> box. >> >> When the server boots up, we notice there is an ll_mdt process running >> at 100% and we keep on waiting close ?to 10-15 mins. We only have 8 >> clients. (I assume this normal recovery process). However if I >> manually mount up the mdt without any recovery everything is fine > > Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed > in 1.6.5. Although I''m not sure, since we switched NFS exports to unfs3 ever > since the problem came up. > >> >> Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 >> Tainted: G ? ? ?2.6.18-92.1.17.el5_lustre.1.6.7smp #1 >> Mar 12 10:11:02 protected_host_01 kernel: RIP: >> 0010:[<ffffffff888ed8df>] ?[<ffffffff888ed8df>] >> >> :ldiskfs:do_split+0x3ef/0x560 >> >> Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 >> EFLAGS: 00000216 >> Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: >> 0000000000000080 RCX: 0000000000000000 >> Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: >> ffff8103cd52177c RDI: ffff8103cd52176c > > Any chance you can send traces with line wrap disabled? With line wrapping it > is quite hard to read. > > > Cheers, > Bernd > >
This happened again :-( Basically, there is a process called "ll_mdt30" which is taking up 100% of the CPU. I am not sure what its doing but I can''t even reboot the system. I have to hard reboot. Also, I checked my other OSTs and MDS and I don''t have anything special for NFS in /etc/modules.conf On Sat, Mar 14, 2009 at 8:35 AM, Mag Gam <magawake at gmail.com> wrote:> Hey Bernd: > > Thanks for the reply. > > Interesting. We are using with NFS too. Is there something in > particular we need to do like "enable port 988 in /etc/modules.conf" > which I think I am already doing. > > >> Any chance you can send traces with line wrap disabled? With line wrapping it >> is quite hard to read. > Ofcourse! I even posted a bug report with the /tmp/lustre.log > https://bugzilla.lustre.org/show_bug.cgi?id=18802 > > Let me know if you need anything else. > > TIA > > > > On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert > <bernd.schubert at fastmail.fm> wrote: >> On Saturday 14 March 2009, Mag Gam wrote: >>> We are having a problem with a MDS server (which also has 1 OST) on the >>> box. >>> >>> When the server boots up, we notice there is an ll_mdt process running >>> at 100% and we keep on waiting close ?to 10-15 mins. We only have 8 >>> clients. (I assume this normal recovery process). However if I >>> manually mount up the mdt without any recovery everything is fine >> >> Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed >> in 1.6.5. Although I''m not sure, since we switched NFS exports to unfs3 ever >> since the problem came up. >> >>> >>> Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 >>> Tainted: G ? ? ?2.6.18-92.1.17.el5_lustre.1.6.7smp #1 >>> Mar 12 10:11:02 protected_host_01 kernel: RIP: >>> 0010:[<ffffffff888ed8df>] ?[<ffffffff888ed8df>] >>> >>> :ldiskfs:do_split+0x3ef/0x560 >>> >>> Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 >>> EFLAGS: 00000216 >>> Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: >>> 0000000000000080 RCX: 0000000000000000 >>> Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: >>> ffff8103cd52177c RDI: ffff8103cd52176c >> >> Any chance you can send traces with line wrap disabled? With line wrapping it >> is quite hard to read. >> >> >> Cheers, >> Bernd >> >> >
Hello Mag, sorry for my late reply. I think there is a misunderstanding. The bug I''m talking about is if you export Lustre by knfsd. It is not important if you do use any other NFS services on your MDS/OSS system. But if you should export Lustre by NFS using the kernel export nfs daemon, try to disable that. Cheers, Bernd On Sunday 15 March 2009, Mag Gam wrote:> This happened again :-( > > Basically, there is a process called "ll_mdt30" which is taking up > 100% of the CPU. I am not sure what its doing but I can''t even reboot > the system. I have to hard reboot. > > Also, I checked my other OSTs and MDS and I don''t have anything > special for NFS in /etc/modules.conf > > On Sat, Mar 14, 2009 at 8:35 AM, Mag Gam <magawake at gmail.com> wrote: > > Hey Bernd: > > > > Thanks for the reply. > > > > Interesting. We are using with NFS too. Is there something in > > particular we need to do like "enable port 988 in /etc/modules.conf" > > which I think I am already doing. > > > >> Any chance you can send traces with line wrap disabled? With line > >> wrapping it is quite hard to read. > > > > Ofcourse! I even posted a bug report with the /tmp/lustre.log > > https://bugzilla.lustre.org/show_bug.cgi?id=18802 > > > > Let me know if you need anything else. > > > > TIA > > > > > > > > On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert > > > > <bernd.schubert at fastmail.fm> wrote: > >> On Saturday 14 March 2009, Mag Gam wrote: > >>> We are having a problem with a MDS server (which also has 1 OST) on the > >>> box. > >>> > >>> When the server boots up, we notice there is an ll_mdt process running > >>> at 100% and we keep on waiting close to 10-15 mins. We only have 8 > >>> clients. (I assume this normal recovery process). However if I > >>> manually mount up the mdt without any recovery everything is fine > >> > >> Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be > >> fixed in 1.6.5. Although I''m not sure, since we switched NFS exports to > >> unfs3 ever since the problem came up. > >> > >>> Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 > >>> Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 > >>> Mar 12 10:11:02 protected_host_01 kernel: RIP: > >>> 0010:[<ffffffff888ed8df>] [<ffffffff888ed8df>] > >>> > >>> :ldiskfs:do_split+0x3ef/0x560 > >>> > >>> Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 > >>> EFLAGS: 00000216 > >>> Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: > >>> 0000000000000080 RCX: 0000000000000000 > >>> Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: > >>> ffff8103cd52177c RDI: ffff8103cd52176c > >> > >> Any chance you can send traces with line wrap disabled? With line > >> wrapping it is quite hard to read. > >> > >> > >> Cheers, > >> Bernd
Hey Bernd: Thanks for the response. I have a bigger problem now. My ll_mdt is always at 100% even if I mount up my MDS with -o abort recovery. I am not sure what to do to get my filesystem on track now. Any ideas? I am getting kind of desperate now :-( On Sun, Mar 15, 2009 at 9:22 AM, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:> Hello Mag, > > sorry for my late reply. I think there is a misunderstanding. The bug I''m > talking about is if you export Lustre by knfsd. It is not important if you do > use any other NFS services on your MDS/OSS system. But if you should export > Lustre by NFS using the kernel export nfs daemon, try to disable that. > > > Cheers, > Bernd > > On Sunday 15 March 2009, Mag Gam wrote: >> This happened again :-( >> >> Basically, there is a process called "ll_mdt30" which is taking up >> 100% of the CPU. I am not sure what its doing but I can''t even reboot >> the system. I have to hard reboot. >> >> Also, I checked my other OSTs and MDS and I don''t have anything >> special for NFS in /etc/modules.conf >> >> On Sat, Mar 14, 2009 at 8:35 AM, Mag Gam <magawake at gmail.com> wrote: >> > Hey Bernd: >> > >> > Thanks for the reply. >> > >> > Interesting. We are using with NFS too. Is there something in >> > particular we need to do like "enable port 988 in /etc/modules.conf" >> > which I think I am already doing. >> > >> >> Any chance you can send traces with line wrap disabled? With line >> >> wrapping it is quite hard to read. >> > >> > Ofcourse! I even posted a bug report with the /tmp/lustre.log >> > https://bugzilla.lustre.org/show_bug.cgi?id=18802 >> > >> > Let me know if you need anything else. >> > >> > TIA >> > >> > >> > >> > On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert >> > >> > <bernd.schubert at fastmail.fm> wrote: >> >> On Saturday 14 March 2009, Mag Gam wrote: >> >>> We are having a problem with a MDS server (which also has 1 OST) on the >> >>> box. >> >>> >> >>> When the server boots up, we notice there is an ll_mdt process running >> >>> at 100% and we keep on waiting close ?to 10-15 mins. We only have 8 >> >>> clients. (I assume this normal recovery process). However if I >> >>> manually mount up the mdt without any recovery everything is fine >> >> >> >> Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be >> >> fixed in 1.6.5. Although I''m not sure, since we switched NFS exports to >> >> unfs3 ever since the problem came up. >> >> >> >>> Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 >> >>> Tainted: G ? ? ?2.6.18-92.1.17.el5_lustre.1.6.7smp #1 >> >>> Mar 12 10:11:02 protected_host_01 kernel: RIP: >> >>> 0010:[<ffffffff888ed8df>] ?[<ffffffff888ed8df>] >> >>> >> >>> :ldiskfs:do_split+0x3ef/0x560 >> >>> >> >>> Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 >> >>> EFLAGS: 00000216 >> >>> Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: >> >>> 0000000000000080 RCX: 0000000000000000 >> >>> Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: >> >>> ffff8103cd52177c RDI: ffff8103cd52176c >> >> >> >> Any chance you can send traces with line wrap disabled? With line >> >> wrapping it is quite hard to read. >> >> >> >> >> >> Cheers, >> >> Bernd > > >