We recently setup a lustre config. 1 MDS 4 OSS''s. Everything is running fine except on the first OSS we are experiencing very high cpu load. The first OSS is running a CPU load in the high 50''s. The other 3 OSS''s are steady at around 8. Everything is the same between all of the OSS''s. The stripe is setup stripe_count: 1 stripe_offset: -1 Red Hat 5 64bit kernel-2.6.18-194.3.1.el5_lustre.1.8.4 kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4 lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 Any thing I can check on the problem OSS to rectify this issue. Thank you in advance Rocky -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101116/b79349fd/attachment.html
Hello, Normally when stripe_offset is set to -1, MDS will do load/space balancing automatically. What is your use pattern of the filesystem? It sounds like that your applications are doing extensive I/O on that particular OSS. To find out why the load on the OSS is so high, please - find what processes are hogging the CPUs using top(1). - get the stripe info of your in-use files to see whether most of them reside on the same OSS. If the files in use are not distributed among the OSS servers, or your file usage pattern is one-OSS bound, you may want to consider tuning the stripe_count/stripe_size. ? 2010-11-16???10:38? Ronald K Long ???> > We recently setup a lustre config. 1 MDS 4 OSS''s. Everything is running fine except on the first OSS we are experiencing very high cpu load. The first OSS is running a CPU load in the high 50''s. The other 3 OSS''s are steady at around 8. Everything is the same between all of the OSS''s. > > The stripe is setup > > stripe_count: 1 stripe_offset: -1 > > Red Hat 5 64bit > > kernel-2.6.18-194.3.1.el5_lustre.1.8.4 > kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4 > lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4 > lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 > lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 > > > Any thing I can check on the problem OSS to rectify this issue. > > Thank you in advance > > Rocky > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101116/f264ba76/attachment-0001.html
The data used on the file system is pretty transient. Files are created and then moved off to other locations not on the lustre system. I did look into top and do not find 1 specific process that is hogging the cpu. Pretty much the same across each OSS. Quiet a few of these is about it. ll_ost_io_01 The stripping seems to be going across all of the ost''s correctly. UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 726.2G 1.7G 683.0G 0% /san[MDT:0] lustre-OST0000_UUID 2.4T 742.2G 1.5T 30% /san[OST:0] lustre-OST0001_UUID 2.4T 696.6G 1.6T 28% /san[OST:1] lustre-OST0002_UUID 2.4T 729.9G 1.5T 30% /san[OST:2] lustre-OST0003_UUID 2.4T 736.1G 1.5T 30% /san[OST:3] lustre-OST0004_UUID 2.4T 757.1G 1.5T 31% /san[OST:4] lustre-OST0005_UUID 2.4T 784.7G 1.5T 32% /san[OST:5] lustre-OST0006_UUID 2.4T 898.8G 1.4T 37% /san[OST:6] lustre-OST0007_UUID 2.4T 762.2G 1.5T 31% /san[OST:7] filesystem summary: 18.9T 6.0T 12.0T 31% /san Thanks Again. Rocky From: Wang Yibin <wang.yibin at oracle.com> To: Ronald K Long <rklong at usgs.gov> Cc: lustre-discuss at lists.lustre.org Date: 11/16/2010 09:54 AM Subject: Re: [Lustre-discuss] High CPU load, only on 1 OSS Hello, Normally when stripe_offset is set to -1, MDS will do load/space balancing automatically. What is your use pattern of the filesystem? It sounds like that your applications are doing extensive I/O on that particular OSS. To find out why the load on the OSS is so high, please - find what processes are hogging the CPUs using top(1). - get the stripe info of your in-use files to see whether most of them reside on the same OSS. If the files in use are not distributed among the OSS servers, or your file usage pattern is one-OSS bound, you may want to consider tuning the stripe_count/stripe_size. ? 2010-11-16???10:38? Ronald K Long ??? We recently setup a lustre config. 1 MDS 4 OSS''s. Everything is running fine except on the first OSS we are experiencing very high cpu load. The first OSS is running a CPU load in the high 50''s. The other 3 OSS''s are steady at around 8. Everything is the same between all of the OSS''s. The stripe is setup stripe_count: 1 stripe_offset: -1 Red Hat 5 64bit kernel-2.6.18-194.3.1.el5_lustre.1.8.4 kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4 lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 Any thing I can check on the problem OSS to rectify this issue. Thank you in advance Rocky _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101116/fc7396bc/attachment.html
Reduce the oss_num_threads in modprobe.conf may be useful. On Tue, Nov 16, 2010 at 10:38 PM, Ronald K Long <rklong at usgs.gov> wrote:> > We recently setup a lustre config. ?1 MDS 4 OSS''s. ?Everything is running > fine except on the first OSS we are experiencing very high cpu load. ?The > first OSS is running a CPU load in the high 50''s. ?The other 3 OSS''s are > steady at around 8. ?Everything is the same between all of the OSS''s. > > The stripe is setup > > stripe_count: ? 1 stripe_offset: ?-1 > > Red Hat 5 64bit > > kernel-2.6.18-194.3.1.el5_lustre.1.8.4 > kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4 > lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4 > lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 > lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 > > > Any thing I can check on the problem OSS to rectify this issue. > > Thank you in advance > > Rocky > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
All of our OSS''s are configured the same. However on the first OSS which is experiencing the high CPU load the number of io threads being started is much higher than the other OSS servers. The server that is experiencing high CPU ost.OSS.ost_io.threads_started=175 Other OSS servers ost.OSS.ost_io.threads_started=128 All off the OSS servers are configured like this. ost.OSS.ost_io.threads_min=128 ost.OSS.ost_io.threads_max=512 Any direction/information to resolve this issue is greatly appreciated. If any other info is needed please let me know. Rocky From: Wang Yibin <wang.yibin at oracle.com> To: Ronald K Long <rklong at usgs.gov> Cc: lustre-discuss at lists.lustre.org Date: 11/16/2010 09:54 AM Subject: Re: [Lustre-discuss] High CPU load, only on 1 OSS Hello, Normally when stripe_offset is set to -1, MDS will do load/space balancing automatically. What is your use pattern of the filesystem? It sounds like that your applications are doing extensive I/O on that particular OSS. To find out why the load on the OSS is so high, please - find what processes are hogging the CPUs using top(1). - get the stripe info of your in-use files to see whether most of them reside on the same OSS. If the files in use are not distributed among the OSS servers, or your file usage pattern is one-OSS bound, you may want to consider tuning the stripe_count/stripe_size. ? 2010-11-16???10:38? Ronald K Long ??? We recently setup a lustre config. 1 MDS 4 OSS''s. Everything is running fine except on the first OSS we are experiencing very high cpu load. The first OSS is running a CPU load in the high 50''s. The other 3 OSS''s are steady at around 8. Everything is the same between all of the OSS''s. The stripe is setup stripe_count: 1 stripe_offset: -1 Red Hat 5 64bit kernel-2.6.18-194.3.1.el5_lustre.1.8.4 kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4 lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4 Any thing I can check on the problem OSS to rectify this issue. Thank you in advance Rocky _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110126/79a80d24/attachment.html