David Noriega
2012-Feb-01 18:57 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
As of late I''ve been seeing alot of these messages: Lustre: Service thread pid 22974 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 22974, comm: ll_ost_io_233 Call Trace: [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90 [<ffffffff8001546f>] sync_buffer+0x0/0x3f [<ffffffff800637ea>] io_schedule+0x3f/0x67 [<ffffffff800154aa>] sync_buffer+0x3b/0x3f [<ffffffff80063a16>] __wait_on_bit+0x40/0x6e [<ffffffff8001546f>] sync_buffer+0x0/0x3f [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a0ae0>] wake_bit_function+0x0/0x23 [<ffffffff88945bc8>] bh_submit_read+0x58/0x70 [ldiskfs] [<ffffffff88945ef8>] read_block_bitmap+0xc8/0x1c0 [ldiskfs] [<ffffffff88968101>] ldiskfs_mb_free_blocks+0x191/0x5d0 [ldiskfs] [<ffffffff8894a0d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs] [<ffffffff88030d09>] do_get_write_access+0x4f9/0x530 [jbd] [<ffffffff80007691>] find_get_page+0x21/0x51 [<ffffffff80010c54>] __find_get_block_slow+0x2f/0xf7 [<ffffffff88946d4d>] ldiskfs_free_blocks+0x8d/0xe0 [ldiskfs] [<ffffffff8895ee66>] ldiskfs_ext_remove_space+0x3a6/0x740 [ldiskfs] [<ffffffff88960401>] ldiskfs_ext_truncate+0x161/0x1f0 [ldiskfs] [<ffffffff8894c881>] ldiskfs_truncate+0xc1/0x610 [ldiskfs] [<ffffffff800cd34f>] unmap_mapping_range+0x59/0x204 [<ffffffff8894a0d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs] [<ffffffff800cdd9d>] vmtruncate+0xa2/0xc9 [<ffffffff800417a6>] inode_setattr+0x22/0x104 [<ffffffff8894df3b>] ldiskfs_setattr+0x1eb/0x270 [ldiskfs] [<ffffffff889ca037>] fsfilt_ldiskfs_setattr+0x1a7/0x250 [fsfilt_ldiskfs] [<ffffffff889e5551>] filter_version_get_check+0x91/0x2a0 [obdfilter] [<ffffffff800645ab>] __down_write_nested+0x12/0x92 [<ffffffff885b1378>] cfs_alloc+0x68/0xc0 [libcfs] [<ffffffff889f2bfb>] filter_destroy+0xd9b/0x1fb0 [obdfilter] [<ffffffff886f0bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc] [<ffffffff886f42a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc] [<ffffffff88703549>] ldlm_srv_pool_recalc+0x79/0x220 [ptlrpc] [<ffffffff88719924>] lustre_msg_add_version+0x34/0x110 [ptlrpc] [<ffffffff8871c62a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] [<ffffffff886dba4c>] ldlm_resource_putref+0x34c/0x3c0 [ptlrpc] [<ffffffff886d68d2>] ldlm_lock_put+0x372/0x3d0 [ptlrpc] [<ffffffff8871c739>] lustre_pack_reply+0x29/0xb0 [ptlrpc] [<ffffffff889a4050>] ost_destroy+0x660/0x790 [ost] [<ffffffff88718a78>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] [<ffffffff887188c5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] [<ffffffff889ada26>] ost_handle+0x1556/0x55b0 [ost] [<ffffffff800dbcaa>] free_block+0x126/0x143 [<ffffffff800dbeec>] __drain_alien_cache+0x51/0x66 [<ffffffff88725c37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc] [<ffffffff800470f3>] try_to_wake_up+0x472/0x484 [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff8008b4a5>] __wake_up_common+0x3e/0x68 [<ffffffff88729698>] ptlrpc_main+0x1258/0x1420 [ptlrpc] [<ffffffff8008d07b>] default_wake_function+0x0/0xe [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88728440>] ptlrpc_main+0x0/0x1420 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 or Pid: 13507, comm: ll_ost_io_205 LustreError: dumping log to /tmp/lustre-log.1328117954.22974 Lustre: lustre-OST0001: slow journal start 34s due to heavy IO load Lustre: Skipped 1 previous similar message Lustre: lustre-OST0001: slow brw_start 34s due to heavy IO load Lustre: Skipped 1 previous similar message Lustre: lustre-OST0001: slow journal start 36s due to heavy IO load Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0001: slow journal start 37s due to heavy IO load Lustre: Skipped 1 previous similar message Lustre: lustre-OST0001: slow commitrw commit 37s due to heavy IO load Lustre: lustre-OST0001: slow i_mutex 34s due to heavy IO load Lustre: Skipped 1 previous similar message Lustre: lustre-OST0001: slow i_mutex 34s due to heavy IO load Lustre: lustre-OST0000: slow setattr 44s due to heavy IO load Lustre: lustre-OST0000: slow setattr 44s due to heavy IO load Lustre: lustre-OST0001: slow setattr 106s due to heavy IO load Lustre: Skipped 2 previous similar messages Lustre: lustre-OST0001: slow setattr 106s due to heavy IO load On the MDS I see the following: LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x234ca921 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd03d1 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x14e2441 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc1861 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd0581 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd0272 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc8042 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc805c sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc805c sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd024d sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd0311 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd0316 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd032a sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xd032b sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc8fd6 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0xc1847 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x234ca903 sub-object on OST idx 3/4: rc = -107 LustreError: 5255:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x234ca904 sub-object on OST idx 3/4: rc = -107 Lustre: Service thread pid 25775 was inactive for 222.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Lustre: Skipped 3 previous similar messages Pid: 25775, comm: ll_mdt_117 Call Trace: [<ffffffff800638ab>] schedule_timeout+0x8a/0xad [<ffffffff80097d9f>] process_timeout+0x0/0x5 [<ffffffff88809c75>] osc_create+0xc75/0x13d0 [osc] [<ffffffff8008d07b>] default_wake_function+0x0/0xe [<ffffffff888b7cbb>] qos_remedy_create+0x45b/0x570 [lov] [<ffffffff888b1be3>] lov_fini_create_set+0x243/0x11e0 [lov] [<ffffffff888a5982>] lov_create+0x1552/0x1860 [lov] [<ffffffff888a65b6>] lov_iocontrol+0x926/0xf0f [lov] [<ffffffff8008d07b>] default_wake_function+0x0/0xe [<ffffffff88a6140a>] mds_finish_open+0x1fea/0x43e0 [mds] [<ffffffff88030d09>] do_get_write_access+0x4f9/0x530 [jbd] [<ffffffff889740d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs] [<ffffffff88974796>] ldiskfs_mark_inode_dirty+0x136/0x160 [ldiskfs] [<ffffffff889740d1>] ldiskfs_mark_iloc_dirty+0x411/0x480 [ldiskfs] [<ffffffff88a6845e>] mds_open+0x2cce/0x35f8 [mds] [<ffffffff887cddbf>] ksocknal_find_conn_locked+0xcf/0x1f0 [ksocklnd] [<ffffffff887cfef5>] ksocknal_alloc_tx+0x1f5/0x2a0 [ksocklnd] [<ffffffff88a3ef89>] mds_reint_rec+0x1d9/0x2b0 [mds] [<ffffffff88a6ac72>] mds_open_unpack+0x312/0x430 [mds] [<ffffffff88a31e7a>] mds_reint+0x35a/0x420 [mds] [<ffffffff88a30d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds] [<ffffffff88a3bbfc>] mds_intent_policy+0x4ac/0xc80 [mds] [<ffffffff887058b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc] [<ffffffff88702eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc] [<ffffffff886ff7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc] [<ffffffff88727720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc] [<ffffffff88724849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc] [<ffffffff88a3ab20>] mds_handle+0x4130/0x4d60 [mds] [<ffffffff88633be5>] lnet_match_blocked_msg+0x375/0x390 [lnet] [<ffffffff88748705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc] [<ffffffff8874fc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc] [<ffffffff800470f3>] try_to_wake_up+0x472/0x484 [<ffffffff8008b4a5>] __wake_up_common+0x3e/0x68 [<ffffffff88753698>] ptlrpc_main+0x1258/0x1420 [ptlrpc] [<ffffffff8008d07b>] default_wake_function+0x0/0xe [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff88752440>] ptlrpc_main+0x0/0x1420 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1328058813.25775 Lustre: Service thread pid 25775 completed after 223.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: Skipped 3 previous similar messages What do these messages mean? -- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
Carlos Thomaz
2012-Feb-01 19:04 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
Hi David, You may be facing the same issue discussed on previous threads, which is the issue regarding the zone_reclaim_mode. Take a look on the previous thread where myself and Kevin replied to Vijesh Ek. If you don''t have access to the previous emails, look at your kernel settings for the zone reclaim: cat /proc/sys/vm/zone_reclaim_mode It should be set to 0. Also, look at the number of Lustre OSS service threads. It may be set to high... Rgds. Carlos. -- Carlos Thomaz | HPC Systems Architect Mobile: +1 (303) 519-0578 cthomaz at ddn.com | Skype ID: carlosthomaz DataDirect Networks, Inc. 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless <http://twitter.com/ddn_limitless> | 1.800.TERABYTE On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote:>indicates the system was overloaded (too many service threads, or >
Charles Taylor
2012-Feb-01 19:27 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
You may also want to check and, if necessary, limit the lru_size on your clients. I believe there are guidelines in the ops manual. We have ~750 clients and limit ours to 600 per OST. That, combined with the setting zone_reclaim_mode=0 should make a big difference. Regards, Charlie Taylor UF HPC Center On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote:> Hi David, > > You may be facing the same issue discussed on previous threads, which is > the issue regarding the zone_reclaim_mode. > > Take a look on the previous thread where myself and Kevin replied to > Vijesh Ek. > > If you don''t have access to the previous emails, look at your kernel > settings for the zone reclaim: > > cat /proc/sys/vm/zone_reclaim_mode > > It should be set to 0. > > Also, look at the number of Lustre OSS service threads. It may be set to > high... > > Rgds. > Carlos. > > > -- > Carlos Thomaz | HPC Systems Architect > Mobile: +1 (303) 519-0578 > cthomaz at ddn.com | Skype ID: carlosthomaz > DataDirect Networks, Inc. > 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 > ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless > <http://twitter.com/ddn_limitless> | 1.800.TERABYTE > > > > > > On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: > >> indicates the system was overloaded (too many service threads, or >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCharles A. Taylor, Ph.D. Associate Director, UF HPC Center (352) 392-4036
David Noriega
2012-Feb-01 21:11 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
zone_reclaim_mode is 0 on all clients/servers When changing number of service threads or the lru_size, can these be done on the fly or do they require a reboot of either client or server? For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started give about 300(300, 359) so I''m thinking try half of that and see how it goes? Also checking lru_size, I get different numbers from the clients. cat /proc/fs/lustre/ldlm/namespaces/*/lru_size Client: MDT0 OST0 OST1 OST2 OST3 MGC head node: 0 22 22 22 22 400 (only a few users logged in) busy node: 1 501 504 503 505 400 (Fully loaded with jobs) samba/nfs server: 4 440070 44370 44348 26282 1600 So my understanding is the lru_size is set to auto by default thus the varying values, but setting it manually is effectively setting a max value? Also what does it mean to have a lower value(especially in the case of the samba/nfs server)? On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote:> > You may also want to check and, if necessary, limit the lru_size on your clients. ? I believe there are guidelines in the ops manual. ? ? ?We have ~750 clients and limit ours to 600 per OST. ? That, combined with the setting zone_reclaim_mode=0 should make a big difference. > > Regards, > > Charlie Taylor > UF HPC Center > > > On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: > >> Hi David, >> >> You may be facing the same issue discussed on previous threads, which is >> the issue regarding the zone_reclaim_mode. >> >> Take a look on the previous thread where myself and Kevin replied to >> Vijesh Ek. >> >> If you don''t have access to the previous emails, look at your kernel >> settings for the zone reclaim: >> >> cat /proc/sys/vm/zone_reclaim_mode >> >> It should be set to 0. >> >> Also, look at the number of Lustre OSS service threads. It may be set to >> high... >> >> Rgds. >> Carlos. >> >> >> -- >> Carlos Thomaz | HPC Systems Architect >> Mobile: +1 (303) 519-0578 >> cthomaz at ddn.com | Skype ID: carlosthomaz >> DataDirect Networks, Inc. >> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >> >> >> >> >> >> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >> >>> indicates the system was overloaded (too many service threads, or >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Charles A. Taylor, Ph.D. > Associate Director, > UF HPC Center > (352) 392-4036 > > >-- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
Carlos Thomaz
2012-Feb-02 00:33 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
David, The oss service threads is a function of your RAM size and CPUs. It''s difficult to say what would be a good upper limit without knowing the size of your OSS, # clients, storage back-end and workload. But the good thing you can give a try on the fly via lctl set_param command. Assuming you are running lustre 1.8, here is a good explanation on how to do it: http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ 87260 Some remarks: - reducing the number of OSS threads may impact the performance depending on how is your workload. - unfortunately I guess you will need to try and see what happens. I would go for 128 and analyze the behavior of your OSSs (via log files) and also keeping an eye on your workload. Seems to me that 300 is a bit too high (but again, I don''t know what you have on your storage back-end or OSS configuration). I can''t tell you much about the lru_size, but as far as I understand the values are dynamic and there''s not much to do rather than clear the last recently used queue or disable the lru sizing. I can''t help much on this other than pointing you out the explanation for it (see 31.2.11): http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html Regards, Carlos -- Carlos Thomaz | HPC Systems Architect Mobile: +1 (303) 519-0578 cthomaz at ddn.com | Skype ID: carlosthomaz DataDirect Networks, Inc. 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless <http://twitter.com/ddn_limitless> | 1.800.TERABYTE On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote:>zone_reclaim_mode is 0 on all clients/servers > >When changing number of service threads or the lru_size, can these be >done on the fly or do they require a reboot of either client or >server? >For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >give about 300(300, 359) so I''m thinking try half of that and see how >it goes? > >Also checking lru_size, I get different numbers from the clients. cat >/proc/fs/lustre/ldlm/namespaces/*/lru_size > >Client: MDT0 OST0 OST1 OST2 OST3 MGC >head node: 0 22 22 22 22 400 (only a few users logged in) >busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >samba/nfs server: 4 440070 44370 44348 26282 1600 > >So my understanding is the lru_size is set to auto by default thus the >varying values, but setting it manually is effectively setting a max >value? Also what does it mean to have a lower value(especially in the >case of the samba/nfs server)? > >On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >> >> You may also want to check and, if necessary, limit the lru_size on >>your clients. I believe there are guidelines in the ops manual. >>We have ~750 clients and limit ours to 600 per OST. That, combined >>with the setting zone_reclaim_mode=0 should make a big difference. >> >> Regards, >> >> Charlie Taylor >> UF HPC Center >> >> >> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >> >>> Hi David, >>> >>> You may be facing the same issue discussed on previous threads, which >>>is >>> the issue regarding the zone_reclaim_mode. >>> >>> Take a look on the previous thread where myself and Kevin replied to >>> Vijesh Ek. >>> >>> If you don''t have access to the previous emails, look at your kernel >>> settings for the zone reclaim: >>> >>> cat /proc/sys/vm/zone_reclaim_mode >>> >>> It should be set to 0. >>> >>> Also, look at the number of Lustre OSS service threads. It may be set >>>to >>> high... >>> >>> Rgds. >>> Carlos. >>> >>> >>> -- >>> Carlos Thomaz | HPC Systems Architect >>> Mobile: +1 (303) 519-0578 >>> cthomaz at ddn.com | Skype ID: carlosthomaz >>> DataDirect Networks, Inc. >>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>> >>> >>> >>> >>> >>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>> >>>> indicates the system was overloaded (too many service threads, or >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> Charles A. Taylor, Ph.D. >> Associate Director, >> UF HPC Center >> (352) 392-4036 >> >> >> > > > >-- >David Noriega >System Administrator >Computational Biology Initiative >High Performance Computing Center >University of Texas at San Antonio >One UTSA Circle >San Antonio, TX 78249 >Office: BSE 3.112 >Phone: 210-458-7100 >http://www.cbi.utsa.edu >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
David Noriega
2012-Feb-02 15:54 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
We have two OSSs, each with two quad core AMD Opterons and 8GB of ram and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun StorageTek 2540 connected with 8Gb fiber. What about tweaking max_dirty_mb on the client side? On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote:> David, > > The oss service threads is a function of your RAM size and CPUs. It''s > difficult to say what would be a good upper limit without knowing the size > of your OSS, # clients, storage back-end and workload. But the good thing > you can give a try on the fly via lctl set_param command. > > Assuming you are running lustre 1.8, here is a good explanation on how to > do it: > http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ > 87260 > > Some remarks: > - reducing the number of OSS threads may impact the performance depending > on how is your workload. > - unfortunately I guess you will need to try and see what happens. I would > go for 128 and analyze the behavior of your OSSs (via log files) and also > keeping an eye on your workload. Seems to me that 300 is a bit too high > (but again, I don''t know what you have on your storage back-end or OSS > configuration). > > > I can''t tell you much about the lru_size, but as far as I understand the > values are dynamic and there''s not much to do rather than clear the last > recently used queue or disable the lru sizing. I can''t help much on this > other than pointing you out the explanation for it (see 31.2.11): > > http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html > > > Regards, > Carlos > > > > > -- > Carlos Thomaz | HPC Systems Architect > Mobile: +1 (303) 519-0578 > cthomaz at ddn.com | Skype ID: carlosthomaz > DataDirect Networks, Inc. > 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 > ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless > <http://twitter.com/ddn_limitless> | 1.800.TERABYTE > > > > > > On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: > >>zone_reclaim_mode is 0 on all clients/servers >> >>When changing number of service threads or the lru_size, can these be >>done on the fly or do they require a reboot of either client or >>server? >>For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>give about 300(300, 359) so I''m thinking try half of that and see how >>it goes? >> >>Also checking lru_size, I get different numbers from the clients. cat >>/proc/fs/lustre/ldlm/namespaces/*/lru_size >> >>Client: MDT0 OST0 OST1 OST2 OST3 MGC >>head node: 0 22 22 22 22 400 (only a few users logged in) >>busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>samba/nfs server: 4 440070 44370 44348 26282 1600 >> >>So my understanding is the lru_size is set to auto by default thus the >>varying values, but setting it manually is effectively setting a max >>value? Also what does it mean to have a lower value(especially in the >>case of the samba/nfs server)? >> >>On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >>> >>> You may also want to check and, if necessary, limit the lru_size on >>>your clients. ? I believe there are guidelines in the ops manual. >>>We have ~750 clients and limit ours to 600 per OST. ? That, combined >>>with the setting zone_reclaim_mode=0 should make a big difference. >>> >>> Regards, >>> >>> Charlie Taylor >>> UF HPC Center >>> >>> >>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>> >>>> Hi David, >>>> >>>> You may be facing the same issue discussed on previous threads, which >>>>is >>>> the issue regarding the zone_reclaim_mode. >>>> >>>> Take a look on the previous thread where myself and Kevin replied to >>>> Vijesh Ek. >>>> >>>> If you don''t have access to the previous emails, look at your kernel >>>> settings for the zone reclaim: >>>> >>>> cat /proc/sys/vm/zone_reclaim_mode >>>> >>>> It should be set to 0. >>>> >>>> Also, look at the number of Lustre OSS service threads. It may be set >>>>to >>>> high... >>>> >>>> Rgds. >>>> Carlos. >>>> >>>> >>>> -- >>>> Carlos Thomaz | HPC Systems Architect >>>> Mobile: +1 (303) 519-0578 >>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>> DataDirect Networks, Inc. >>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>> >>>> >>>> >>>> >>>> >>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>> >>>>> indicates the system was overloaded (too many service threads, or >>>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> Charles A. Taylor, Ph.D. >>> Associate Director, >>> UF HPC Center >>> (352) 392-4036 >>> >>> >>> >> >> >> >>-- >>David Noriega >>System Administrator >>Computational Biology Initiative >>High Performance Computing Center >>University of Texas at San Antonio >>One UTSA Circle >>San Antonio, TX 78249 >>Office: BSE 3.112 >>Phone: 210-458-7100 >>http://www.cbi.utsa.edu >>_______________________________________________ >>Lustre-discuss mailing list >>Lustre-discuss at lists.lustre.org >>http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
David Noriega
2012-Feb-02 16:05 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
On a side note, what about increasing the MDS service threads? Checking that, its running at its max of 128. On Thu, Feb 2, 2012 at 9:54 AM, David Noriega <tsk133 at my.utsa.edu> wrote:> We have two OSSs, each with two quad core AMD Opterons and 8GB of ram > and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun > StorageTek 2540 connected with 8Gb fiber. > > What about tweaking max_dirty_mb on the client side? > > On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote: >> David, >> >> The oss service threads is a function of your RAM size and CPUs. It''s >> difficult to say what would be a good upper limit without knowing the size >> of your OSS, # clients, storage back-end and workload. But the good thing >> you can give a try on the fly via lctl set_param command. >> >> Assuming you are running lustre 1.8, here is a good explanation on how to >> do it: >> http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ >> 87260 >> >> Some remarks: >> - reducing the number of OSS threads may impact the performance depending >> on how is your workload. >> - unfortunately I guess you will need to try and see what happens. I would >> go for 128 and analyze the behavior of your OSSs (via log files) and also >> keeping an eye on your workload. Seems to me that 300 is a bit too high >> (but again, I don''t know what you have on your storage back-end or OSS >> configuration). >> >> >> I can''t tell you much about the lru_size, but as far as I understand the >> values are dynamic and there''s not much to do rather than clear the last >> recently used queue or disable the lru sizing. I can''t help much on this >> other than pointing you out the explanation for it (see 31.2.11): >> >> http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html >> >> >> Regards, >> Carlos >> >> >> >> >> -- >> Carlos Thomaz | HPC Systems Architect >> Mobile: +1 (303) 519-0578 >> cthomaz at ddn.com | Skype ID: carlosthomaz >> DataDirect Networks, Inc. >> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >> >> >> >> >> >> On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >> >>>zone_reclaim_mode is 0 on all clients/servers >>> >>>When changing number of service threads or the lru_size, can these be >>>done on the fly or do they require a reboot of either client or >>>server? >>>For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>>give about 300(300, 359) so I''m thinking try half of that and see how >>>it goes? >>> >>>Also checking lru_size, I get different numbers from the clients. cat >>>/proc/fs/lustre/ldlm/namespaces/*/lru_size >>> >>>Client: MDT0 OST0 OST1 OST2 OST3 MGC >>>head node: 0 22 22 22 22 400 (only a few users logged in) >>>busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>>samba/nfs server: 4 440070 44370 44348 26282 1600 >>> >>>So my understanding is the lru_size is set to auto by default thus the >>>varying values, but setting it manually is effectively setting a max >>>value? Also what does it mean to have a lower value(especially in the >>>case of the samba/nfs server)? >>> >>>On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >>>> >>>> You may also want to check and, if necessary, limit the lru_size on >>>>your clients. ? I believe there are guidelines in the ops manual. >>>>We have ~750 clients and limit ours to 600 per OST. ? That, combined >>>>with the setting zone_reclaim_mode=0 should make a big difference. >>>> >>>> Regards, >>>> >>>> Charlie Taylor >>>> UF HPC Center >>>> >>>> >>>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>>> >>>>> Hi David, >>>>> >>>>> You may be facing the same issue discussed on previous threads, which >>>>>is >>>>> the issue regarding the zone_reclaim_mode. >>>>> >>>>> Take a look on the previous thread where myself and Kevin replied to >>>>> Vijesh Ek. >>>>> >>>>> If you don''t have access to the previous emails, look at your kernel >>>>> settings for the zone reclaim: >>>>> >>>>> cat /proc/sys/vm/zone_reclaim_mode >>>>> >>>>> It should be set to 0. >>>>> >>>>> Also, look at the number of Lustre OSS service threads. It may be set >>>>>to >>>>> high... >>>>> >>>>> Rgds. >>>>> Carlos. >>>>> >>>>> >>>>> -- >>>>> Carlos Thomaz | HPC Systems Architect >>>>> Mobile: +1 (303) 519-0578 >>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>> DataDirect Networks, Inc. >>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>> >>>>>> indicates the system was overloaded (too many service threads, or >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> Charles A. Taylor, Ph.D. >>>> Associate Director, >>>> UF HPC Center >>>> (352) 392-4036 >>>> >>>> >>>> >>> >>> >>> >>>-- >>>David Noriega >>>System Administrator >>>Computational Biology Initiative >>>High Performance Computing Center >>>University of Texas at San Antonio >>>One UTSA Circle >>>San Antonio, TX 78249 >>>Office: BSE 3.112 >>>Phone: 210-458-7100 >>>http://www.cbi.utsa.edu >>>_______________________________________________ >>>Lustre-discuss mailing list >>>Lustre-discuss at lists.lustre.org >>>http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > -- > David Noriega > System Administrator > Computational Biology Initiative > High Performance Computing Center > University of Texas at San Antonio > One UTSA Circle > San Antonio, TX 78249 > Office: BSE 3.112 > Phone: 210-458-7100 > http://www.cbi.utsa.edu-- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
Andreas Dilger
2012-Feb-02 18:07 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
On 2012-02-02, at 8:54 AM, David Noriega wrote:> We have two OSSs, each with two quad core AMD Opterons and 8GB of ram > and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun > StorageTek 2540 connected with 8Gb fiber.Running 32-64 threads per OST is the optimum number, based on previous experience.> What about tweaking max_dirty_mb on the client side?Probably unrelated.> On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote: >> David, >> >> The oss service threads is a function of your RAM size and CPUs. It''s >> difficult to say what would be a good upper limit without knowing the size >> of your OSS, # clients, storage back-end and workload. But the good thing >> you can give a try on the fly via lctl set_param command. >> >> Assuming you are running lustre 1.8, here is a good explanation on how to >> do it: >> http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ >> 87260 >> >> Some remarks: >> - reducing the number of OSS threads may impact the performance depending >> on how is your workload. >> - unfortunately I guess you will need to try and see what happens. I would >> go for 128 and analyze the behavior of your OSSs (via log files) and also >> keeping an eye on your workload. Seems to me that 300 is a bit too high >> (but again, I don''t know what you have on your storage back-end or OSS >> configuration). >> >> >> I can''t tell you much about the lru_size, but as far as I understand the >> values are dynamic and there''s not much to do rather than clear the last >> recently used queue or disable the lru sizing. I can''t help much on this >> other than pointing you out the explanation for it (see 31.2.11): >> >> http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html >> >> >> Regards, >> Carlos >> >> >> >> >> -- >> Carlos Thomaz | HPC Systems Architect >> Mobile: +1 (303) 519-0578 >> cthomaz at ddn.com | Skype ID: carlosthomaz >> DataDirect Networks, Inc. >> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >> >> >> >> >> >> On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >> >>> zone_reclaim_mode is 0 on all clients/servers >>> >>> When changing number of service threads or the lru_size, can these be >>> done on the fly or do they require a reboot of either client or >>> server? >>> For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>> give about 300(300, 359) so I''m thinking try half of that and see how >>> it goes? >>> >>> Also checking lru_size, I get different numbers from the clients. cat >>> /proc/fs/lustre/ldlm/namespaces/*/lru_size >>> >>> Client: MDT0 OST0 OST1 OST2 OST3 MGC >>> head node: 0 22 22 22 22 400 (only a few users logged in) >>> busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>> samba/nfs server: 4 440070 44370 44348 26282 1600 >>> >>> So my understanding is the lru_size is set to auto by default thus the >>> varying values, but setting it manually is effectively setting a max >>> value? Also what does it mean to have a lower value(especially in the >>> case of the samba/nfs server)? >>> >>> On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >>>> >>>> You may also want to check and, if necessary, limit the lru_size on >>>> your clients. I believe there are guidelines in the ops manual. >>>> We have ~750 clients and limit ours to 600 per OST. That, combined >>>> with the setting zone_reclaim_mode=0 should make a big difference. >>>> >>>> Regards, >>>> >>>> Charlie Taylor >>>> UF HPC Center >>>> >>>> >>>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>>> >>>>> Hi David, >>>>> >>>>> You may be facing the same issue discussed on previous threads, which >>>>> is >>>>> the issue regarding the zone_reclaim_mode. >>>>> >>>>> Take a look on the previous thread where myself and Kevin replied to >>>>> Vijesh Ek. >>>>> >>>>> If you don''t have access to the previous emails, look at your kernel >>>>> settings for the zone reclaim: >>>>> >>>>> cat /proc/sys/vm/zone_reclaim_mode >>>>> >>>>> It should be set to 0. >>>>> >>>>> Also, look at the number of Lustre OSS service threads. It may be set >>>>> to >>>>> high... >>>>> >>>>> Rgds. >>>>> Carlos. >>>>> >>>>> >>>>> -- >>>>> Carlos Thomaz | HPC Systems Architect >>>>> Mobile: +1 (303) 519-0578 >>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>> DataDirect Networks, Inc. >>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>> >>>>>> indicates the system was overloaded (too many service threads, or >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> Charles A. Taylor, Ph.D. >>>> Associate Director, >>>> UF HPC Center >>>> (352) 392-4036 >>>> >>>> >>>> >>> >>> >>> >>> -- >>> David Noriega >>> System Administrator >>> Computational Biology Initiative >>> High Performance Computing Center >>> University of Texas at San Antonio >>> One UTSA Circle >>> San Antonio, TX 78249 >>> Office: BSE 3.112 >>> Phone: 210-458-7100 >>> http://www.cbi.utsa.edu >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > -- > David Noriega > System Administrator > Computational Biology Initiative > High Performance Computing Center > University of Texas at San Antonio > One UTSA Circle > San Antonio, TX 78249 > Office: BSE 3.112 > Phone: 210-458-7100 > http://www.cbi.utsa.edu > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Engineer http://www.whamcloud.com/
David Noriega
2012-Feb-03 00:05 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
I found this thread "Luster clients getting evicted" as I''ve also seen the "ost_connect operation failed with -16" message and there they recommend increasing the timeout, though that was for 1.6 and as I''ve read 1.8 has a different timeout system. Reading that, would increasing at_min(currently 0) or at_max(currently 600) be best? On Thu, Feb 2, 2012 at 12:07 PM, Andreas Dilger <adilger at whamcloud.com> wrote:> On 2012-02-02, at 8:54 AM, David Noriega wrote: >> We have two OSSs, each with two quad core AMD Opterons and 8GB of ram >> and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun >> StorageTek 2540 connected with 8Gb fiber. > > Running 32-64 threads per OST is the optimum number, based on previous > experience. > >> What about tweaking max_dirty_mb on the client side? > > Probably unrelated. > >> On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote: >>> David, >>> >>> The oss service threads is a function of your RAM size and CPUs. It''s >>> difficult to say what would be a good upper limit without knowing the size >>> of your OSS, # clients, storage back-end and workload. But the good thing >>> you can give a try on the fly via lctl set_param command. >>> >>> Assuming you are running lustre 1.8, here is a good explanation on how to >>> do it: >>> http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ >>> 87260 >>> >>> Some remarks: >>> - reducing the number of OSS threads may impact the performance depending >>> on how is your workload. >>> - unfortunately I guess you will need to try and see what happens. I would >>> go for 128 and analyze the behavior of your OSSs (via log files) and also >>> keeping an eye on your workload. Seems to me that 300 is a bit too high >>> (but again, I don''t know what you have on your storage back-end or OSS >>> configuration). >>> >>> >>> I can''t tell you much about the lru_size, but as far as I understand the >>> values are dynamic and there''s not much to do rather than clear the last >>> recently used queue or disable the lru sizing. I can''t help much on this >>> other than pointing you out the explanation for it (see 31.2.11): >>> >>> http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html >>> >>> >>> Regards, >>> Carlos >>> >>> >>> >>> >>> -- >>> Carlos Thomaz | HPC Systems Architect >>> Mobile: +1 (303) 519-0578 >>> cthomaz at ddn.com | Skype ID: carlosthomaz >>> DataDirect Networks, Inc. >>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>> >>> >>> >>> >>> >>> On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>> >>>> zone_reclaim_mode is 0 on all clients/servers >>>> >>>> When changing number of service threads or the lru_size, can these be >>>> done on the fly or do they require a reboot of either client or >>>> server? >>>> For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>>> give about 300(300, 359) so I''m thinking try half of that and see how >>>> it goes? >>>> >>>> Also checking lru_size, I get different numbers from the clients. cat >>>> /proc/fs/lustre/ldlm/namespaces/*/lru_size >>>> >>>> Client: MDT0 OST0 OST1 OST2 OST3 MGC >>>> head node: 0 22 22 22 22 400 (only a few users logged in) >>>> busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>>> samba/nfs server: 4 440070 44370 44348 26282 1600 >>>> >>>> So my understanding is the lru_size is set to auto by default thus the >>>> varying values, but setting it manually is effectively setting a max >>>> value? Also what does it mean to have a lower value(especially in the >>>> case of the samba/nfs server)? >>>> >>>> On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >>>>> >>>>> You may also want to check and, if necessary, limit the lru_size on >>>>> your clients. ? I believe there are guidelines in the ops manual. >>>>> We have ~750 clients and limit ours to 600 per OST. ? That, combined >>>>> with the setting zone_reclaim_mode=0 should make a big difference. >>>>> >>>>> Regards, >>>>> >>>>> Charlie Taylor >>>>> UF HPC Center >>>>> >>>>> >>>>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> You may be facing the same issue discussed on previous threads, which >>>>>> is >>>>>> the issue regarding the zone_reclaim_mode. >>>>>> >>>>>> Take a look on the previous thread where myself and Kevin replied to >>>>>> Vijesh Ek. >>>>>> >>>>>> If you don''t have access to the previous emails, look at your kernel >>>>>> settings for the zone reclaim: >>>>>> >>>>>> cat /proc/sys/vm/zone_reclaim_mode >>>>>> >>>>>> It should be set to 0. >>>>>> >>>>>> Also, look at the number of Lustre OSS service threads. It may be set >>>>>> to >>>>>> high... >>>>>> >>>>>> Rgds. >>>>>> Carlos. >>>>>> >>>>>> >>>>>> -- >>>>>> Carlos Thomaz | HPC Systems Architect >>>>>> Mobile: +1 (303) 519-0578 >>>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>>> DataDirect Networks, Inc. >>>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>>> >>>>>>> indicates the system was overloaded (too many service threads, or >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> Charles A. Taylor, Ph.D. >>>>> Associate Director, >>>>> UF HPC Center >>>>> (352) 392-4036 >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> David Noriega >>>> System Administrator >>>> Computational Biology Initiative >>>> High Performance Computing Center >>>> University of Texas at San Antonio >>>> One UTSA Circle >>>> San Antonio, TX 78249 >>>> Office: BSE 3.112 >>>> Phone: 210-458-7100 >>>> http://www.cbi.utsa.edu >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> >> >> -- >> David Noriega >> System Administrator >> Computational Biology Initiative >> High Performance Computing Center >> University of Texas at San Antonio >> One UTSA Circle >> San Antonio, TX 78249 >> Office: BSE 3.112 >> Phone: 210-458-7100 >> http://www.cbi.utsa.edu >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger ? ? ? ? ? ? ? ? ? ? ? Whamcloud, Inc. > Principal Engineer ? ? ? ? ? ? ? ? ? http://www.whamcloud.com/ > > > >-- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
Carlos Thomaz
2012-Feb-03 01:37 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
I can''t comment much on this (don''t have much experience tuning it), but Lustre 1.8 has a completely different timeouts architecture (Adaptive timeouts). I suggest you to take a deep look first: -- Carlos Thomaz | HPC Systems Architect Mobile: +1 (303) 519-0578 cthomaz at ddn.com | Skype ID: carlosthomaz DataDirect Networks, Inc. 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless <http://twitter.com/ddn_limitless> | 1.800.TERABYTE On 2/2/12 5:05 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote:>I found this thread "Luster clients getting evicted" as I''ve also seen >the "ost_connect operation failed with -16" message and there they >recommend increasing the timeout, though that was for 1.6 and as I''ve >read 1.8 has a different timeout system. Reading that, would >increasing at_min(currently 0) or at_max(currently 600) be best? > >On Thu, Feb 2, 2012 at 12:07 PM, Andreas Dilger <adilger at whamcloud.com> >wrote: >> On 2012-02-02, at 8:54 AM, David Noriega wrote: >>> We have two OSSs, each with two quad core AMD Opterons and 8GB of ram >>> and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun >>> StorageTek 2540 connected with 8Gb fiber. >> >> Running 32-64 threads per OST is the optimum number, based on previous >> experience. >> >>> What about tweaking max_dirty_mb on the client side? >> >> Probably unrelated. >> >>> On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote: >>>> David, >>>> >>>> The oss service threads is a function of your RAM size and CPUs. It''s >>>> difficult to say what would be a good upper limit without knowing the >>>>size >>>> of your OSS, # clients, storage back-end and workload. But the good >>>>thing >>>> you can give a try on the fly via lctl set_param command. >>>> >>>> Assuming you are running lustre 1.8, here is a good explanation on >>>>how to >>>> do it: >>>> >>>>http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651 >>>>263_ >>>> 87260 >>>> >>>> Some remarks: >>>> - reducing the number of OSS threads may impact the performance >>>>depending >>>> on how is your workload. >>>> - unfortunately I guess you will need to try and see what happens. I >>>>would >>>> go for 128 and analyze the behavior of your OSSs (via log files) and >>>>also >>>> keeping an eye on your workload. Seems to me that 300 is a bit too >>>>high >>>> (but again, I don''t know what you have on your storage back-end or OSS >>>> configuration). >>>> >>>> >>>> I can''t tell you much about the lru_size, but as far as I understand >>>>the >>>> values are dynamic and there''s not much to do rather than clear the >>>>last >>>> recently used queue or disable the lru sizing. I can''t help much on >>>>this >>>> other than pointing you out the explanation for it (see 31.2.11): >>>> >>>> http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html >>>> >>>> >>>> Regards, >>>> Carlos >>>> >>>> >>>> >>>> >>>> -- >>>> Carlos Thomaz | HPC Systems Architect >>>> Mobile: +1 (303) 519-0578 >>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>> DataDirect Networks, Inc. >>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>> >>>> >>>> >>>> >>>> >>>> On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>> >>>>> zone_reclaim_mode is 0 on all clients/servers >>>>> >>>>> When changing number of service threads or the lru_size, can these be >>>>> done on the fly or do they require a reboot of either client or >>>>> server? >>>>> For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>>>> give about 300(300, 359) so I''m thinking try half of that and see how >>>>> it goes? >>>>> >>>>> Also checking lru_size, I get different numbers from the clients. cat >>>>> /proc/fs/lustre/ldlm/namespaces/*/lru_size >>>>> >>>>> Client: MDT0 OST0 OST1 OST2 OST3 MGC >>>>> head node: 0 22 22 22 22 400 (only a few users logged in) >>>>> busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>>>> samba/nfs server: 4 440070 44370 44348 26282 1600 >>>>> >>>>> So my understanding is the lru_size is set to auto by default thus >>>>>the >>>>> varying values, but setting it manually is effectively setting a max >>>>> value? Also what does it mean to have a lower value(especially in the >>>>> case of the samba/nfs server)? >>>>> >>>>> On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> >>>>>wrote: >>>>>> >>>>>> You may also want to check and, if necessary, limit the lru_size on >>>>>> your clients. I believe there are guidelines in the ops manual. >>>>>> We have ~750 clients and limit ours to 600 per OST. That, combined >>>>>> with the setting zone_reclaim_mode=0 should make a big difference. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Charlie Taylor >>>>>> UF HPC Center >>>>>> >>>>>> >>>>>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>>>>> >>>>>>> Hi David, >>>>>>> >>>>>>> You may be facing the same issue discussed on previous threads, >>>>>>>which >>>>>>> is >>>>>>> the issue regarding the zone_reclaim_mode. >>>>>>> >>>>>>> Take a look on the previous thread where myself and Kevin replied >>>>>>>to >>>>>>> Vijesh Ek. >>>>>>> >>>>>>> If you don''t have access to the previous emails, look at your >>>>>>>kernel >>>>>>> settings for the zone reclaim: >>>>>>> >>>>>>> cat /proc/sys/vm/zone_reclaim_mode >>>>>>> >>>>>>> It should be set to 0. >>>>>>> >>>>>>> Also, look at the number of Lustre OSS service threads. It may be >>>>>>>set >>>>>>> to >>>>>>> high... >>>>>>> >>>>>>> Rgds. >>>>>>> Carlos. >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Carlos Thomaz | HPC Systems Architect >>>>>>> Mobile: +1 (303) 519-0578 >>>>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>>>> DataDirect Networks, Inc. >>>>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>>>> >>>>>>>> indicates the system was overloaded (too many service threads, or >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Lustre-discuss mailing list >>>>>>> Lustre-discuss at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>>> Charles A. Taylor, Ph.D. >>>>>> Associate Director, >>>>>> UF HPC Center >>>>>> (352) 392-4036 >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> David Noriega >>>>> System Administrator >>>>> Computational Biology Initiative >>>>> High Performance Computing Center >>>>> University of Texas at San Antonio >>>>> One UTSA Circle >>>>> San Antonio, TX 78249 >>>>> Office: BSE 3.112 >>>>> Phone: 210-458-7100 >>>>> http://www.cbi.utsa.edu >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> >>> >>> >>> -- >>> David Noriega >>> System Administrator >>> Computational Biology Initiative >>> High Performance Computing Center >>> University of Texas at San Antonio >>> One UTSA Circle >>> San Antonio, TX 78249 >>> Office: BSE 3.112 >>> Phone: 210-458-7100 >>> http://www.cbi.utsa.edu >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> Cheers, Andreas >> -- >> Andreas Dilger Whamcloud, Inc. >> Principal Engineer http://www.whamcloud.com/ >> >> >> >> > > > >-- >David Noriega >System Administrator >Computational Biology Initiative >High Performance Computing Center >University of Texas at San Antonio >One UTSA Circle >San Antonio, TX 78249 >Office: BSE 3.112 >Phone: 210-458-7100 >http://www.cbi.utsa.edu >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Carlos Thomaz
2012-Feb-03 01:38 UTC
[Lustre-discuss] Thread might be hung, Heavy IO Load messages
Ooopss.. Take a look first at: http://wiki.lustre.org/index.php/Architecture_-_Adaptive_Timeouts_-_Use_Cas es And google for adaptive timeouts Carlos. -- Carlos Thomaz | HPC Systems Architect Mobile: +1 (303) 519-0578 cthomaz at ddn.com | Skype ID: carlosthomaz DataDirect Networks, Inc. 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless <http://twitter.com/ddn_limitless> | 1.800.TERABYTE On 2/2/12 6:37 PM, "Carlos Thomaz" <cthomaz at ddn.com> wrote:>I can''t comment much on this (don''t have much experience tuning it), but >Lustre 1.8 has a completely different timeouts architecture (Adaptive >timeouts). >I suggest you to take a deep look first: > >-- >Carlos Thomaz | HPC Systems Architect >Mobile: +1 (303) 519-0578 >cthomaz at ddn.com | Skype ID: carlosthomaz >DataDirect Networks, Inc. >9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless ><http://twitter.com/ddn_limitless> | 1.800.TERABYTE > > > > > >On 2/2/12 5:05 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: > >>I found this thread "Luster clients getting evicted" as I''ve also seen >>the "ost_connect operation failed with -16" message and there they >>recommend increasing the timeout, though that was for 1.6 and as I''ve >>read 1.8 has a different timeout system. Reading that, would >>increasing at_min(currently 0) or at_max(currently 600) be best? >> >>On Thu, Feb 2, 2012 at 12:07 PM, Andreas Dilger <adilger at whamcloud.com> >>wrote: >>> On 2012-02-02, at 8:54 AM, David Noriega wrote: >>>> We have two OSSs, each with two quad core AMD Opterons and 8GB of ram >>>> and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun >>>> StorageTek 2540 connected with 8Gb fiber. >>> >>> Running 32-64 threads per OST is the optimum number, based on previous >>> experience. >>> >>>> What about tweaking max_dirty_mb on the client side? >>> >>> Probably unrelated. >>> >>>> On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <cthomaz at ddn.com> wrote: >>>>> David, >>>>> >>>>> The oss service threads is a function of your RAM size and CPUs. It''s >>>>> difficult to say what would be a good upper limit without knowing the >>>>>size >>>>> of your OSS, # clients, storage back-end and workload. But the good >>>>>thing >>>>> you can give a try on the fly via lctl set_param command. >>>>> >>>>> Assuming you are running lustre 1.8, here is a good explanation on >>>>>how to >>>>> do it: >>>>> >>>>>http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#5065 >>>>>1 >>>>>263_ >>>>> 87260 >>>>> >>>>> Some remarks: >>>>> - reducing the number of OSS threads may impact the performance >>>>>depending >>>>> on how is your workload. >>>>> - unfortunately I guess you will need to try and see what happens. I >>>>>would >>>>> go for 128 and analyze the behavior of your OSSs (via log files) and >>>>>also >>>>> keeping an eye on your workload. Seems to me that 300 is a bit too >>>>>high >>>>> (but again, I don''t know what you have on your storage back-end or >>>>>OSS >>>>> configuration). >>>>> >>>>> >>>>> I can''t tell you much about the lru_size, but as far as I understand >>>>>the >>>>> values are dynamic and there''s not much to do rather than clear the >>>>>last >>>>> recently used queue or disable the lru sizing. I can''t help much on >>>>>this >>>>> other than pointing you out the explanation for it (see 31.2.11): >>>>> >>>>> http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html >>>>> >>>>> >>>>> Regards, >>>>> Carlos >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Carlos Thomaz | HPC Systems Architect >>>>> Mobile: +1 (303) 519-0578 >>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>> DataDirect Networks, Inc. >>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/1/12 2:11 PM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>> >>>>>> zone_reclaim_mode is 0 on all clients/servers >>>>>> >>>>>> When changing number of service threads or the lru_size, can these >>>>>>be >>>>>> done on the fly or do they require a reboot of either client or >>>>>> server? >>>>>> For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>>>>> give about 300(300, 359) so I''m thinking try half of that and see >>>>>>how >>>>>> it goes? >>>>>> >>>>>> Also checking lru_size, I get different numbers from the clients. >>>>>>cat >>>>>> /proc/fs/lustre/ldlm/namespaces/*/lru_size >>>>>> >>>>>> Client: MDT0 OST0 OST1 OST2 OST3 MGC >>>>>> head node: 0 22 22 22 22 400 (only a few users logged in) >>>>>> busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>>>>> samba/nfs server: 4 440070 44370 44348 26282 1600 >>>>>> >>>>>> So my understanding is the lru_size is set to auto by default thus >>>>>>the >>>>>> varying values, but setting it manually is effectively setting a max >>>>>> value? Also what does it mean to have a lower value(especially in >>>>>>the >>>>>> case of the samba/nfs server)? >>>>>> >>>>>> On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> >>>>>>wrote: >>>>>>> >>>>>>> You may also want to check and, if necessary, limit the lru_size on >>>>>>> your clients. I believe there are guidelines in the ops manual. >>>>>>> We have ~750 clients and limit ours to 600 per OST. That, >>>>>>>combined >>>>>>> with the setting zone_reclaim_mode=0 should make a big difference. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Charlie Taylor >>>>>>> UF HPC Center >>>>>>> >>>>>>> >>>>>>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>>>>>> >>>>>>>> Hi David, >>>>>>>> >>>>>>>> You may be facing the same issue discussed on previous threads, >>>>>>>>which >>>>>>>> is >>>>>>>> the issue regarding the zone_reclaim_mode. >>>>>>>> >>>>>>>> Take a look on the previous thread where myself and Kevin replied >>>>>>>>to >>>>>>>> Vijesh Ek. >>>>>>>> >>>>>>>> If you don''t have access to the previous emails, look at your >>>>>>>>kernel >>>>>>>> settings for the zone reclaim: >>>>>>>> >>>>>>>> cat /proc/sys/vm/zone_reclaim_mode >>>>>>>> >>>>>>>> It should be set to 0. >>>>>>>> >>>>>>>> Also, look at the number of Lustre OSS service threads. It may be >>>>>>>>set >>>>>>>> to >>>>>>>> high... >>>>>>>> >>>>>>>> Rgds. >>>>>>>> Carlos. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Carlos Thomaz | HPC Systems Architect >>>>>>>> Mobile: +1 (303) 519-0578 >>>>>>>> cthomaz at ddn.com | Skype ID: carlosthomaz >>>>>>>> DataDirect Networks, Inc. >>>>>>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>>>>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>>>>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote: >>>>>>>> >>>>>>>>> indicates the system was overloaded (too many service threads, or >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Lustre-discuss mailing list >>>>>>>> Lustre-discuss at lists.lustre.org >>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>>> >>>>>>> Charles A. Taylor, Ph.D. >>>>>>> Associate Director, >>>>>>> UF HPC Center >>>>>>> (352) 392-4036 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> David Noriega >>>>>> System Administrator >>>>>> Computational Biology Initiative >>>>>> High Performance Computing Center >>>>>> University of Texas at San Antonio >>>>>> One UTSA Circle >>>>>> San Antonio, TX 78249 >>>>>> Office: BSE 3.112 >>>>>> Phone: 210-458-7100 >>>>>> http://www.cbi.utsa.edu >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> >>>> >>>> >>>> -- >>>> David Noriega >>>> System Administrator >>>> Computational Biology Initiative >>>> High Performance Computing Center >>>> University of Texas at San Antonio >>>> One UTSA Circle >>>> San Antonio, TX 78249 >>>> Office: BSE 3.112 >>>> Phone: 210-458-7100 >>>> http://www.cbi.utsa.edu >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger Whamcloud, Inc. >>> Principal Engineer http://www.whamcloud.com/ >>> >>> >>> >>> >> >> >> >>-- >>David Noriega >>System Administrator >>Computational Biology Initiative >>High Performance Computing Center >>University of Texas at San Antonio >>One UTSA Circle >>San Antonio, TX 78249 >>Office: BSE 3.112 >>Phone: 210-458-7100 >>http://www.cbi.utsa.edu >>_______________________________________________ >>Lustre-discuss mailing list >>Lustre-discuss at lists.lustre.org >>http://lists.lustre.org/mailman/listinfo/lustre-discuss > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss