Mike Hanby
2011-Aug-15 21:58 UTC
[Lustre-discuss] Question about setting max service threads
Howdy, Our OSS servers are logging quite a few "heavy IO load" combined with system load (via ''uptime'') being reported in the 100''s to several 100''s range. Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid 17651 completed after 236.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: Skipped 1 previous similar message Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load Lustre: Service thread pid 16436 completed after 210.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I''d like to test setting the ost_io.threads_max to values lower than 512. Question 1: Will this command survive a reboot "lctl set_param ost.OSS.ost_io.threads_max=256" or do I need to also run "lctl conf_param ost.OSS.ost_io.threads_max=256"? Question 2: Since Lustre "does not reduce the number of service threads in use", is there any way I can force the extra running service threads to exit, or is a reboot of the OSS servers the only clean way? Thanks, Mike
Mike Hanby
2011-Aug-15 22:12 UTC
[Lustre-discuss] Question about setting max service threads
Oh, I forgot to mention we are running Lustre 1.8.6 (Whamcloud build for CentOS x86_64)> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > bounces at lists.lustre.org] On Behalf Of Mike Hanby > Sent: Monday, August 15, 2011 4:59 PM > To: lustre-discuss at lists.lustre.org > Subject: [Lustre-discuss] Question about setting max service threads > > Howdy, > > Our OSS servers are logging quite a few "heavy IO load" combined with > system load (via ''uptime'') being reported in the 100''s to several 100''s > range. > > Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load > Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid 17651 > completed after 236.04s. This indicates the system was overloaded (too > many service threads, or there were not enough hardware resources). > Lustre: Skipped 1 previous similar message > Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load > Lustre: Service thread pid 16436 completed after 210.17s. This > indicates the system was overloaded (too many service threads, or there > were not enough hardware resources). > > I''d like to test setting the ost_io.threads_max to values lower than > 512. > > Question 1: Will this command survive a reboot "lctl set_param > ost.OSS.ost_io.threads_max=256" or do I need to also run "lctl > conf_param ost.OSS.ost_io.threads_max=256"? > > Question 2: Since Lustre "does not reduce the number of service threads > in use", is there any way I can force the extra running service threads > to exit, or is a reboot of the OSS servers the only clean way? > > Thanks, > > Mike > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2011-Aug-15 22:36 UTC
[Lustre-discuss] Question about setting max service threads
On 2011-08-15, at 3:58 PM, Mike Hanby wrote:> Our OSS servers are logging quite a few "heavy IO load" combined with system load (via ''uptime'') being reported in the 100''s to several 100''s range. > > Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load > Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid 17651 completed after 236.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). > Lustre: Skipped 1 previous similar message > Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load > Lustre: Service thread pid 16436 completed after 210.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). > > I''d like to test setting the ost_io.threads_max to values lower than 512. > > Question 1: Will this command survive a reboot "lctl set_param ost.OSS.ost_io.threads_max=256"This is only a temporary setting.> or do I need to also run "lctl conf_param ost.OSS.ost_io.threads_max=256"?The conf_param syntax is (unfortunately) slightly different than the set_param syntax. You can also set this in /etc/modprobe.d/lustre.conf: options ost oss_num_threads=256 options mds mds_num_threads=256> Question 2: Since Lustre "does not reduce the number of service threads in use", is there any way I can force the extra running service threads to exit, or is a reboot of the OSS servers the only clean way?I had written a patch to do this, but it wasn''t landed yet. Currently the only way to limit the thread count is to set this before the number of running threads has exceeded the maximum thread count. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Kevin Van Maren
2011-Aug-16 02:56 UTC
[Lustre-discuss] Question about setting max service threads
Andreas answered the question asked, and did an excellent job. But to answer the unasked question, will reducing the thread count really fix the problem: This is often NOT caused by mere disk overload from too many service threads. For example, one recent issue was tracked down to free space allocation times being quite large, due to free space bitmaps needing to be read from disk. It has also been common for memory allocations to be the major time sink, as with Lustre 1.8 the service threads no longer reuse the buffer and have to allocate new memory on every request (numa zoned allocations were especially problematic; apparently the "best" pages to free have a tendency of being found on the "wrong" numa node, so it took a lot of time/work to free up space on the local numa node to allow the allocation to succeed). Bug 23826 had patches to track service times better, which will help you see how much of an issue this really is. See also Bug 22516, which strives to normalize server threads per OST, rather than per server. Big 22886 discusses issues with the elevator taking 1MB IOs and converting them into "odd" sizes, which depending on the array could also have an impact on IO. Bug 23805 has some additional rambling along this line as well. Kevin On Aug 15, 2011, at 6:36 PM, Andreas Dilger <adilger at whamcloud.com> wrote:> On 2011-08-15, at 3:58 PM, Mike Hanby wrote: >> Our OSS servers are logging quite a few "heavy IO load" combined >> with system load (via ''uptime'') being reported in the 100''s to >> several 100''s range. >> >> Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO >> load >> Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid >> 17651 completed after 236.04s. This indicates the system was >> overloaded (too many service threads, or there were not enough >> hardware resources). >> Lustre: Skipped 1 previous similar message >> Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO >> load >> Lustre: Service thread pid 16436 completed after 210.17s. This >> indicates the system was overloaded (too many service threads, or >> there were not enough hardware resources). >> >> I''d like to test setting the ost_io.threads_max to values lower >> than 512. >> >> Question 1: Will this command survive a reboot "lctl set_param >> ost.OSS.ost_io.threads_max=256" > > This is only a temporary setting. > >> or do I need to also run "lctl conf_param >> ost.OSS.ost_io.threads_max=256"? > > The conf_param syntax is (unfortunately) slightly different than the > set_param syntax. You can also set this in /etc/modprobe.d/ > lustre.conf: > > options ost oss_num_threads=256 > options mds mds_num_threads=256 > >> Question 2: Since Lustre "does not reduce the number of service >> threads in use", is there any way I can force the extra running >> service threads to exit, or is a reboot of the OSS servers the only >> clean way? > > I had written a patch to do this, but it wasn''t landed yet. > Currently the only way to limit the thread count is to set this > before the number of running threads has exceeded the maximum thread > count. > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss