Hendelman, Rob
2008-Dec-18 19:47 UTC
[Lustre-discuss] Watchdog triggered for pid 5383: it was inactive for 100s (+ stack trace)
Is this something to be concerned about? We have quite a few of these. This is on our mgs/mds box. Our mgs/mds aren''t in one filesystem (separate spindles with separate spindles for journals as well), but are on the same box. Thanks, Robert *** Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 5383 ll_mdt_13 D ffff81052a1b2d40 0 5383 1 5384 5382 (L-TLB) ffff810528c256c0 0000000000000046 ffff81000100c400 0000000000000282 000000000000000a ffff81052dac1100 ffffffff802dcae0 001ca31dee07f389 0000000000007c00 ffff81052dac12e8 5a5a5a5a00000000 ffff81052f764770 Call Trace: [<ffffffff800610a0>] wait_for_completion+0x74/0x9d [<ffffffff80088431>] default_wake_function+0x0/0xe [<ffffffff80097e35>] call_usermodehelper_keys+0xea/0xff [<ffffffff80097e4a>] __call_usermodehelper+0x0/0x4f [<ffffffff884065af>] :lvfs:upcall_cache_get_entry+0x5bf/0xa50 [<ffffffff8009b3c6>] autoremove_wake_function+0x9/0x2e [<ffffffff800868b0>] __wake_up_common+0x3e/0x68 [<ffffffff8851b36b>] :ptlrpc:lustre_msg_string+0x18b/0x2f0 [<ffffffff887587f5>] :mds:mds_init_ucred+0x95/0xd0 [<ffffffff8872826f>] :mds:mds_getattr_lock+0x34f/0xc70 [<ffffffff8858f1c4>] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270 [<ffffffff8851d00b>] :ptlrpc:lustre_pack_reply+0x9ab/0xab0 [<ffffffff88729161>] :mds:mds_intent_policy+0x5d1/0xbe0 [<ffffffff88424ca7>] :lnet:lnet_prep_send+0x67/0xb0 [<ffffffff884e9776>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [<ffffffff884e6183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [<ffffffff884e3bbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [<ffffffff88506610>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [<ffffffff88502d50>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [<ffffffff88506b80>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [<ffffffff88732c6d>] :mds:mds_handle+0x46dd/0x58ff [<ffffffff88481c82>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff8851d230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff8851ade5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff88522a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff80060f29>] thread_return+0x0/0xeb [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 [<ffffffff883db056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c [<ffffffff8852547c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80088431>] default_wake_function+0x0/0xe [<ffffffff8005bfb1>] child_rip+0xa/0x11 [<ffffffff88524ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005bfa7>] child_rip+0x0/0x11 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Andreas Dilger
2008-Dec-22 08:15 UTC
[Lustre-discuss] Watchdog triggered for pid 5383: it was inactive for 100s (+ stack trace)
On Dec 18, 2008 13:47 -0600, Hendelman, Rob wrote:> Is this something to be concerned about? We have quite a few of these. This is on our mgs/mds box. > > Our mgs/mds aren''t in one filesystem (separate spindles with separate spindles for journals as well), but are on the same box. > > Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for > process 5383You missed some important messages above this that explain why this was hit.> Call Trace: > [<ffffffff80097e35>] call_usermodehelper_keys+0xea/0xff > [<ffffffff80097e4a>] __call_usermodehelper+0x0/0x4f > [<ffffffff884065af>] :lvfs:upcall_cache_get_entry+0x5bf/0xa50This implies that you are using something like LDAP for users/groups on the MDS, and it can''t reply in a timely manner (e.g. within several seconds). You can tune this to put less load on your LDAP server by increasing /proc/fs/lustre/mds/myth-MDT0000/group_expire_interval (number of seconds to refresh a user->group mapping, default 600s). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.