Dear all, Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following: Server side: May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: Call Trace: May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] start_this_handle+0x301/0x3cb [jbd2] May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] autoremove_wake_function+0x0/0x2e May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] jbd2_journal_start+0xab/0xdf [jbd2] May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] filter_version_get_check+0x91/0x2a0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] filter_setattr_internal+0x90d/0x1de0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] lookup_one_len+0x53/0x61 May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] filter_fid2dentry+0x512/0x740 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] filter_fmd_get+0x2b7/0x320 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] filter_setattr+0x1c1/0x3b0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 [ost] May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53 May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68 May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011 Client side: May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123 at tcp. The ost_connect operation failed with -16 When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system. I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. Could someone give us some advice or any suggestions to solve this problem? Thank you very much in advance. Best Regards Qiulan Huang 2012-05-30 ===================================================================Computing center,the Institute of High Energy Physics, China Huang, Qiulan Tel: (+86) 10 8823 6010-105 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049 P.R. China Email: huangql at ihep.ac.cn ===================================================================
Adrian Ulrich
2012-May-30 05:51 UTC
[Lustre-discuss] [wc-discuss] The ost_connect operation failed with -16
Hello,> May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123 at tcp. The ost_connect operation failed with -16Error -16 stands for -EBUSY> When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system.That''s to be expected in such a situation. I suppose that ''lfs check servers'' returned ''temporarily unavailable'' for some OSTs ?> I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck.No: Users shouldn''t get an error in such a situation: The filesystem will just hang until the situation recovered (= the client was able to re-connect to the OST). Regards, Adrian
HUANG Qiulan
2012-May-30 10:47 UTC
[Lustre-discuss] [SPAM] Re: [wc-discuss] The ost_connect operation failed with -16
Hi Adrian,> -----????----- > ???: "Adrian Ulrich" <adrian at blinkenlights.ch> > ????: 2012?5?30? ??? > ???: huangql <huangql at ihep.ac.cn> > ??: lustre-discuss <lustre-discuss at lists.lustre.org>, wc-discuss <wc-discuss at whamcloud.com> > ??: [SPAM] Re: [wc-discuss] The ost_connect operation failed with -16 > > Hello, > > > > May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123 at tcp. The ost_connect operation failed with -16 > > Error -16 stands for -EBUSY > > > > When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system. > > That''s to be expected in such a situation. I suppose that ''lfs check servers'' returned ''temporarily unavailable'' for some OSTs ? >Yes, we can get ''temporarily unavailable'' message on the client and it can reconnect to the OST after minutes or even hours. However, the users cannot do any interactive actions in the file system which is not accepted by them. Do you have some other measures to solve this problem?> > > I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. > > No: Users shouldn''t get an error in such a situation: The filesystem will just hang until the situation recovered (= the client was able to re-connect to the OST). > > > > Regards, > Adrian
Liang Zhen
2012-May-30 11:04 UTC
[Lustre-discuss] [wc-discuss] The ost_connect operation failed with -16
Hi, I think you might hit this: http://jira.whamcloud.com/browse/LU-952 , you can find the patch from this ticket Regards Liang On May 30, 2012, at 11:21 AM, huangql wrote:> Dear all, > > Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following: > > Server side: > > May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D > umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message > May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 > May 30 11:06:31 boss07 kernel: > May 30 11:06:31 boss07 kernel: Call Trace: > May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] start_this_handle+0x301/0x3cb [jbd2] > May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] autoremove_wake_function+0x0/0x2e > May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] jbd2_journal_start+0xab/0xdf [jbd2] > May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] > May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] filter_version_get_check+0x91/0x2a0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f > May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] filter_setattr_internal+0x90d/0x1de0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] lookup_one_len+0x53/0x61 > May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] filter_fid2dentry+0x512/0x740 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] filter_fmd_get+0x2b7/0x320 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 > May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] filter_setattr+0x1c1/0x3b0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 [ost] > May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 > May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53 > May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68 > May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > May 30 11:06:31 boss07 kernel: > May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011 > > > Client side: > > May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123 at tcp. The ost_connect operation failed with -16 > > When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system. > I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. > > Could someone give us some advice or any suggestions to solve this problem? > > Thank you very much in advance. > > > Best Regards > Qiulan Huang > 2012-05-30 > ===================================================================> Computing center,the Institute of High Energy Physics, China > Huang, Qiulan Tel: (+86) 10 8823 6010-105 > P.O. Box 918-7 Fax: (+86) 10 8823 6839 > Beijing 100049 P.R. China Email: huangql at ihep.ac.cn > =================================================================== > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120530/21de54cc/attachment.html
Niu
2012-May-31 09:23 UTC
[Lustre-discuss] [SPAM] Re: [wc-discuss] The ost_connect operation failedwith -16
Hi, Qiulan LU-952 is about a deadlock issue, was the quota enabled? You could try to disable quota and see if the problem is gone. Thanks - Niu> Hi Zhen, > Many thanks to your prompt reply. I have disabled the > writhetrhough_cache and read_cache to see the problem but it still > hung thread when there is heavy IO. > May 31 08:38:43 boss33 kernel: LustreError: dumping log to > /tmp/lustre-log.1338424722.5303 > May 31 08:38:43 boss33 kernel: Pid: 5262, comm: ll_ost_io_48 > May 31 08:38:43 boss33 kernel: > May 31 08:38:43 boss33 kernel: Call Trace: > May 31 08:38:43 boss33 kernel: [<ffffffff887bd451>] > ksocknal_queue_tx_locked+0x451/0x490 [ksocklnd] > May 31 08:38:43 boss33 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92 > May 31 08:38:43 boss33 kernel: [<ffffffff889843df>] > ldiskfs_get_blocks+0x5f/0x2e0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff889851a0>] > ldiskfs_get_block+0xc0/0x120 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] > ldiskfs_bmap+0x0/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff80033615>] > generic_block_bmap+0x37/0x41 > May 31 08:38:43 boss33 kernel: [<ffffffff800341ad>] > mapping_tagged+0x3c/0x47 > May 31 08:38:43 boss33 kernel: [<ffffffff88981f88>] > ldiskfs_bmap+0x28/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] > ldiskfs_bmap+0x0/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88a4a288>] > filter_commitrw_write+0x398/0x2be0 [obdfilter] > May 31 08:38:43 boss33 kernel: [<ffffffff889e6e5c>] > ost_checksum_bulk+0x30c/0x5b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff889e6c38>] > ost_checksum_bulk+0xe8/0x5b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff889edcf9>] > ost_brw_write+0x1c99/0x2480 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff8872e658>] > ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff886f98b0>] > target_committed_to_req+0x40/0x120 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8008cf93>] > default_wake_function+0x0/0xe > May 31 08:38:43 boss33 kernel: [<ffffffff88732bc8>] > lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff889f108e>] > ost_handle+0x2bae/0x55b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 > May 31 08:38:43 boss33 kernel: [<ffffffff800767ae>] > smp_send_reschedule+0x4e/0x53 > May 31 08:38:43 boss33 kernel: [<ffffffff8874215a>] > ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff887428a8>] > ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8008b3bd>] > __wake_up_common+0x3e/0x68 > May 31 08:38:43 boss33 kernel: [<ffffffff88743817>] > ptlrpc_main+0xf37/0x10f0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > May 31 08:38:43 boss33 kernel: [<ffffffff887428e0>] > ptlrpc_main+0x0/0x10f0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > May 31 08:38:43 boss33 kernel: > May 31 08:38:43 boss33 kernel: LustreError: dumping log to > /tmp/lustre-log.1338424722.5262 > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal > start 48s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start > 48s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal > start 187s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start > 187s due to heavy IO load > I have not patched the bug because the all servers is online. Could > you know how to deal with it without affecting users? > Thank you very much. > Cheers, > Qiulan > ===================================================================> Computing center,the Institute of High Energy Physics, China > Huang, Qiulan Tel: (+86) 10 8823 6010-105 > P.O. Box 918-7 Fax: (+86) 10 8823 6839 > Beijing 100049 P.R. China Email: huangql at ihep.ac.cn > <mailto:huangql at ihep.ac.cn> > ==================================================================> 2012-05-31 > ------------------------------------------------------------------------ > huangql > ------------------------------------------------------------------------ > *????* Liang Zhen > *?????* 2012-05-30 19:12:15 > *????* huangql > *???* lustre-discuss; wc-discuss > *???* [SPAM] Re: [wc-discuss] The ost_connect operation failedwith -16 > Hi, I think you might hit this: > http://jira.whamcloud.com/browse/LU-952 , you can find the patch from > this ticket > > Regards > Liang > > On May 30, 2012, at 11:21 AM, huangql wrote: > >> Dear all, >> >> Recently we found the problem in OSS that some threads might be hung >> when the server got heavy IO load. In this case, some clients will be >> evicted or refused by some OSTs and got the error messages as following: >> >> Server side: >> >> May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was >> inactive for 200.00s. The thread might be hung, or it might only be >> slow and will resume later. D >> umping the stack trace for debugging purposes: May 30 11:06:31 boss07 >> kernel: Lustre: Skipped 1 previous similar message >> May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 >> May 30 11:06:31 boss07 kernel: >> May 30 11:06:31 boss07 kernel: Call Trace: >> May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] >> start_this_handle+0x301/0x3cb [jbd2] >> May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] >> autoremove_wake_function+0x0/0x2e >> May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] >> jbd2_journal_start+0xab/0xdf [jbd2] >> May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] >> fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] >> May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] >> filter_version_get_check+0x91/0x2a0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] >> __lookup_hash+0x61/0x12f >> May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] >> filter_setattr_internal+0x90d/0x1de0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] >> lookup_one_len+0x53/0x61 >> May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] >> filter_fid2dentry+0x512/0x740 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] >> filter_fmd_get+0x2b7/0x320 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 >> May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] >> filter_setattr+0x1c1/0x3b0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] >> lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] >> ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] >> lustre_msg_get_version+0x35/0xf0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] >> ost_handle+0x25db/0x55b0 [ost] >> May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 >> May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] >> smp_send_reschedule+0x4e/0x53 >> May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] >> ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] >> ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] >> __wake_up_common+0x3e/0x68 >> May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] >> ptlrpc_main+0xf37/0x10f0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] >> ptlrpc_main+0x0/0x10f0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> May 30 11:06:31 boss07 kernel: >> May 30 11:06:31 boss07 kernel: LustreError: dumping log to >> /tmp/lustre-log.1338347191.8011 >> >> >> Client side: >> >> May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred >> while communicating with 192.168.50.123 at tcp. The ost_connect >> operation failed with -16 >> >> When you got this error message, you failed to run "ls", "df" ,"vi", >> "touch" and so on, which affect us to do anything in the file system. >> I think the ost_connect failure could report some error messages to >> users instead of causing any interactive actions stuck. >> >> Could someone give us some advice or any suggestions to solve this >> problem? >> >> Thank you very much in advance. >> >> >> Best Regards >> Qiulan Huang >> 2012-05-30 >> ===================================================================>> Computing center,the Institute of High Energy Physics, China >> Huang, Qiulan Tel: (+86) 10 8823 6010-105 >> P.O. Box 918-7 Fax: (+86) 10 8823 6839 >> Beijing 100049 P.R. China Email: huangql at ihep.ac.cn >> <mailto:huangql at ihep.ac.cn> >> ==================================================================>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120531/5a4badb5/attachment-0001.html