stumped
2008-Apr-30 14:53 UTC
[Lustre-discuss] How long should an MDS failover event block
Greetings, I''m using 1.6.4.3 and have 2 MDSes servers with heartbeat. I have 2 OSSes (each with one OST). A client mounts the filesystem with mount -t lustre mds01 at tcp0:mds02 at tcp0:/i3_lfs3 /mnt/lustre/i3_lfs3 and everything is great. I then run this on the client: while true; do echo "----------- $(date)"; touch $(date +%s); sleep 2; done So I turn off the power on mds01 and the output of the above script is: ----------- Wed Apr 30 08:31:29 MDT 2008 ----------- Wed Apr 30 08:31:31 MDT 2008 ----------- Wed Apr 30 08:34:53 MDT 2008 ----------- Wed Apr 30 08:38:15 MDT 2008 ----------- Wed Apr 30 08:38:17 MDT 2008 In other words, my client blocks for almost 7 minutes. Is this what I should expect? Can I reasonably shrink this time? Since some of the time is spent with the MDT being "recovered" - as shown by doing cat /proc/fs/lustre/mds/i3_lfs3-MDT0000/recovery_status can I expect that as my file system has more data on it (eventually it''ll be about 700TB) that this time will also increase. Also, the MDS gives up this stack trace; doesn''t look good; should I be worried? As I said, after 7 minutes or so everything seems to be working (though I''m not really doing much with the filesystem yet). Thanks JR Apr 30 08:36:33 mds02 kernel: Lustre: 30256:0:(ldlm_lib.c: 747:target_handle_connect()) i3_lfs3-MDT0000: refuse reconnection from b1d47bed-6eca-7ccf-45e8-c39f930b361f at 10.200.20.63@tcp to 0xffff810203c6a000; still busy with 3 active RPCs Apr 30 08:36:33 mds02 kernel: LustreError: 30256:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-16) req at ffff810227188e00 x19317/t0 o38->b1d47bed-6eca-7ccf-45e8- c39f930b361f at NET_0x200000ac8143f_UUID:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 30253: it was inactive for 100s Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(linux-debug.c: 168:libcfs_debug_dumpstack()) showing stack for process 30253 Apr 30 08:37:00 mds02 kernel: ll_mdt_12 S ffff810001071780 0 30253 1 30254 30252 (L-TLB) Apr 30 08:37:00 mds02 kernel: ffff810201c5b798 0000000000000046 ffff810201c5b7b8 000000000000000a Apr 30 08:37:00 mds02 kernel: ffff8102024559d8 ffff810202455790 ffff8102278ad040 00003b4ad4efe62e Apr 30 08:37:00 mds02 kernel: 00000000000068eb 0000000500000000 Apr 30 08:37:00 mds02 kernel: Call Trace: <ffffffff8013acbf>{lock_timer_base+27} <ffffffff884f0f70>{:ptlrpc:ldlm_expired_completion_wait+0} Apr 30 08:37:00 mds02 kernel: <ffffffff802d91a1>{schedule_timeout+138} <ffffffff8013b1d8>{process_timeout+0} Apr 30 08:37:00 mds02 kernel: <ffffffff884f23cd>{:ptlrpc:ldlm_completion_ast+861} Apr 30 08:37:00 mds02 kernel: <ffffffff884da629>{:ptlrpc:ldlm_lock_enqueue+1369} <ffffffff8012ae07>{default_wake_function+0} Apr 30 08:37:00 mds02 kernel: <ffffffff884d6e8a>{:ptlrpc:ldlm_lock_addref_internal_nolock+58} Apr 30 08:37:00 mds02 kernel: <ffffffff884eed90>{:ptlrpc:ldlm_blocking_ast+0} <ffffffff884f0004>{:ptlrpc:ldlm_cli_enqueue_local+1108} Apr 30 08:37:00 mds02 kernel: <ffffffff8874ce27>{:mds:mds_fid2locked_dentry+471} <ffffffff884f2070>{:ptlrpc:ldlm_completion_ast+0} Apr 30 08:37:00 mds02 kernel: <ffffffff8874d5e7>{:mds:mds_getattr_lock+1783} <ffffffff887588a1>{:mds:mds_intent_policy+1489} Apr 30 08:37:00 mds02 kernel: <ffffffff884dd7d6>{:ptlrpc:ldlm_resource_putref+438} Apr 30 08:37:00 mds02 kernel: <ffffffff884da1c3>{:ptlrpc:ldlm_lock_enqueue+243} <ffffffff884d7bfd>{:ptlrpc:ldlm_lock_create+2477} Apr 30 08:37:00 mds02 kernel: <ffffffff8850f1e5>{:ptlrpc:lustre_swab_buf+197} <ffffffff884fa8c0>{:ptlrpc:ldlm_server_completion_ast+0} Apr 30 08:37:00 mds02 kernel: <ffffffff884f6ff7>{:ptlrpc:ldlm_handle_enqueue+3495} Apr 30 08:37:00 mds02 kernel: <ffffffff884fae30>{:ptlrpc:ldlm_server_blocking_ast+0} Apr 30 08:37:00 mds02 kernel: <ffffffff88753fad>{:mds:mds_handle +18141} <ffffffff88479d92>{:obdclass:class_handle2object+210} Apr 30 08:37:00 mds02 kernel: <ffffffff88511660>{:ptlrpc:lustre_swab_ptlrpc_body+0} Apr 30 08:37:00 mds02 kernel: <ffffffff8850f1e5>{:ptlrpc:lustre_swab_buf+197} <ffffffff8013adf9>{__mod_timer+173} Apr 30 08:37:00 mds02 kernel: <ffffffff88519327>{:ptlrpc:ptlrpc_main+4903} <ffffffff8012ae07>{default_wake_function+0} Apr 30 08:37:00 mds02 kernel: <ffffffff8010bb8a>{child_rip+8} <ffffffff88518000>{:ptlrpc:ptlrpc_main+0} Apr 30 08:37:00 mds02 kernel: <ffffffff8010bb82>{child_rip+0} Apr 30 08:37:00 mds02 kernel: LustreError: 30253:0:(ldlm_request.c: 64:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 12095661 20, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-i3_lfs3-MDT0000_UUID lock: ffff8102002cda80/0xdd61b8a90a 25a009 lrc: 3/1,0 mode: --/CR res: 1096876033/1115803706 bits 0x3 rrc: 4 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 30253 Apr 30 08:37:00 mds02 kernel: LustreError: dumping log to /tmp/lustre- log.1209566220.30253
Andreas Dilger
2008-May-02 22:14 UTC
[Lustre-discuss] How long should an MDS failover event block
On Apr 30, 2008 07:53 -0700, stumped wrote:> I''m using 1.6.4.3 and have 2 MDSes servers with heartbeat. > I have 2 OSSes (each with one OST). A client mounts the filesystem > with > > mount -t lustre mds01 at tcp0:mds02 at tcp0:/i3_lfs3 /mnt/lustre/i3_lfs3 > > and everything is great. > > I then run this on the client: > > while true; do echo "----------- $(date)"; touch $(date +%s); sleep 2; > done > > So I turn off the power on mds01 and the output of the above script > is: > > ----------- Wed Apr 30 08:31:29 MDT 2008 > ----------- Wed Apr 30 08:31:31 MDT 2008 > ----------- Wed Apr 30 08:34:53 MDT 2008 > ----------- Wed Apr 30 08:38:15 MDT 2008 > ----------- Wed Apr 30 08:38:17 MDT 2008 > > In other words, my client blocks for almost 7 minutes. > > Is this what I should expect?Yes, this is pretty reasonable. We are working to reduce this delay in 1.8 with a feature called "adaptive timeouts" that will tune the Lustre recovery timeout as a function of how busy the servers are. Currently the default timeout is 100s and for large clusters (say 500 or more clients) the recommended delay is 300s because under load a server is in essence undergoing a denial-of-service attack from the clients, and each client might get 1/1000th of a server''s idle performance, which is sometimes hard to distinguish from server failure.> Can I reasonably shrink this time? Since some of the time is spent > with the MDT being "recovered" - as shown by doing > cat /proc/fs/lustre/mds/i3_lfs3-MDT0000/recovery_status > can I expect that as my file system has more data on it (eventually > it''ll be about 700TB) that this time will also increase.The recovery time is not a function of the amount of data on the filesystem.> Also, the MDS gives up this stack trace; doesn''t look good; should I > be worried? As I said, after 7 minutes or so everything seems to be > working (though I''m not really doing much with the filesystem yet). > > > Apr 30 08:36:33 mds02 kernel: Lustre: 30256:0:(ldlm_lib.c: > 747:target_handle_connect()) i3_lfs3-MDT0000: refuse reconnection from > b1d47bed-6eca-7ccf-45e8-c39f930b361f at 10.200.20.63@tcp to > 0xffff810203c6a000; still busy with 3 active RPCs > Apr 30 08:36:33 mds02 kernel: LustreError: 30256:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) @@@ processing error (-16) > req at ffff810227188e00 > x19317/t0 o38->b1d47bed-6eca-7ccf-45e8- > c39f930b361f at NET_0x200000ac8143f_UUID:-1 lens 304/200 ref 0 fl > Interpret:/0/0 rc -16/0 > Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) > Watchdog triggered for pid 30253: it was inactive for 100s > Apr 30 08:37:00 mds02 kernel: LustreError: 30253:0:(ldlm_request.c: > 64:ldlm_expired_completion_wait()) ### lock timed out (enqueued at > 1209566120, 100s ago); not entering recovery in server code, just going back > to sleep ns: mds-i3_lfs3-MDT0000_UUID lock: ffff8102002cda80/0xdd61b8a90a > 25a009 lrc: 3/1,0 mode: --/CR res: 1096876033/1115803706 bits 0x3 rrc: > 4 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 30253 > Apr 30 08:37:00 mds02 kernel: LustreError: dumping log to /tmp/lustre- > log.1209566220.30253Probably not ideal, but not fatal. There should be a message after this which reports that this thread was responsive after 120s or similar... Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.