Grigory Shamov
2012-Dec-06 19:06 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Hi, On our cluster, when there is a load on Lustre FS, at some points it slows down precipitously, and there are very very many "slow IO " and "slow setattr" messages on the OSS servers: ======[2988758.408968] Lustre: scratch-OST0004: slow i_mutex 51s due to heavy IO load [2988758.408974] Lustre: Skipped 276 previous similar messages [2988760.309388] Lustre: scratch-OST0004: slow setattr 50s due to heavy IO load [2988822.617865] Lustre: scratch-OST0004: slow setattr 62s due to heavy IO load [2988822.689819] Lustre: scratch-OST0004: slow journal start 48s due to heavy IO load [2988822.690627] Lustre: scratch-OST0004: slow journal start 56s due to heavy IO load [2988823.125410] Lustre: scratch-OST0004: slow parent lock 55s due to heavy IO load [2988823.125419] Lustre: Skipped 1 previous similar message [2988823.125432] Lustre: scratch-OST0004: slow preprw_write setup 55s due to heavy IO load [2988856.236914] Lustre: scratch-OST0004: slow direct_io 33s due to heavy IO load [2988856.236922] Lustre: Skipped 323 previous similar messages [2988892.543942] Lustre: scratch-OST0004: slow i_mutex 48s due to heavy IO load [2988892.543950] Lustre: Skipped 280 previous similar messages [2988892.545310] Lustre: scratch-OST0004: slow setattr 55s due to heavy IO load [2988892.547328] Lustre: scratch-OST0004: slow parent lock 42s due to heavy IO load [2988892.547334] Lustre: Skipped 4 previous similar messages [2988958.306720] Lustre: scratch-OST0004: slow setattr 52s due to heavy IO load [2988958.306724] Lustre: Skipped 1 previous similar message [2988958.310818] Lustre: scratch-OST0004: slow parent lock 59s due to heavy IO load [2989040.406738] Lustre: scratch-OST0004: slow setattr 50s due to heavy IO load ======== I wonder if mounting it on clients with "noatime" and/or changing the atime_diff would help to rid off of these Lustre slowdowns? Right now we have: /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS server is 60. I''ve tried to Google it first, and found that apparently "noatime " is not supported for 1.8, and changing atime_diff is the preferred way? Could you please advise me, which way is better/possible, and how does one change atime_diff? Will it help? Does it require, say, client''s remount, etc.? Any ideas and advice would be greatly appreciated! Thank you very much in advance. -- Grigory Shamov HPC Analyst, Westgrid/Compute Canada E2-588 EITC Building, University of Manitoba (204) 474-9625
Colin Faber
2012-Dec-06 19:28 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Hi, The messages indicate overloaded backend storage. You could try this, another option may be to statically set the maximum number of threads on the OSS, this should reduce load to the system and push the backlogs to your clients (hopefully) -cf On 12/06/2012 12:06 PM, Grigory Shamov wrote:> Hi, > > On our cluster, when there is a load on Lustre FS, at some points it slows down precipitously, and there are very very many "slow IO " and "slow setattr" messages on the OSS servers: > > ======> [2988758.408968] Lustre: scratch-OST0004: slow i_mutex 51s due to heavy IO load > [2988758.408974] Lustre: Skipped 276 previous similar messages > [2988760.309388] Lustre: scratch-OST0004: slow setattr 50s due to heavy IO load > [2988822.617865] Lustre: scratch-OST0004: slow setattr 62s due to heavy IO load > [2988822.689819] Lustre: scratch-OST0004: slow journal start 48s due to heavy IO load > [2988822.690627] Lustre: scratch-OST0004: slow journal start 56s due to heavy IO load > [2988823.125410] Lustre: scratch-OST0004: slow parent lock 55s due to heavy IO load > [2988823.125419] Lustre: Skipped 1 previous similar message > [2988823.125432] Lustre: scratch-OST0004: slow preprw_write setup 55s due to heavy IO load > [2988856.236914] Lustre: scratch-OST0004: slow direct_io 33s due to heavy IO load > [2988856.236922] Lustre: Skipped 323 previous similar messages > [2988892.543942] Lustre: scratch-OST0004: slow i_mutex 48s due to heavy IO load > [2988892.543950] Lustre: Skipped 280 previous similar messages > [2988892.545310] Lustre: scratch-OST0004: slow setattr 55s due to heavy IO load > [2988892.547328] Lustre: scratch-OST0004: slow parent lock 42s due to heavy IO load > [2988892.547334] Lustre: Skipped 4 previous similar messages > [2988958.306720] Lustre: scratch-OST0004: slow setattr 52s due to heavy IO load > [2988958.306724] Lustre: Skipped 1 previous similar message > [2988958.310818] Lustre: scratch-OST0004: slow parent lock 59s due to heavy IO load > [2989040.406738] Lustre: scratch-OST0004: slow setattr 50s due to heavy IO load > ========> > I wonder if mounting it on clients with "noatime" and/or changing the atime_diff would help to rid off of these Lustre slowdowns? Right now we have: /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS server is 60. > > I''ve tried to Google it first, and found that apparently "noatime " is not supported for 1.8, and changing atime_diff is the preferred way? > > Could you please advise me, which way is better/possible, and how does one change atime_diff? Will it help? Does it require, say, client''s remount, etc.? > > Any ideas and advice would be greatly appreciated! Thank you very much in advance. > > > -- > Grigory Shamov > HPC Analyst, Westgrid/Compute Canada > E2-588 EITC Building, University of Manitoba > (204) 474-9625 > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Dilger, Andreas
2012-Dec-06 19:41 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
On 12/6/12 12:06 PM, "Grigory Shamov" <gas5x at yahoo.com> wrote:>Hi, > >On our cluster, when there is a load on Lustre FS, at some points it >slows down precipitously, and there are very very many "slow IO " and >"slow setattr" messages on the OSS servers: > >======>[2988758.408968] Lustre: scratch-OST0004: slow i_mutex 51s due to heavy >IO load >[2988758.408974] Lustre: Skipped 276 previous similar messages >[2988760.309388] Lustre: scratch-OST0004: slow setattr 50s due to heavy >IO load >[2988822.617865] Lustre: scratch-OST0004: slow setattr 62s due to heavy >IO load >[2988822.689819] Lustre: scratch-OST0004: slow journal start 48s due to >heavy IO load >[2988822.690627] Lustre: scratch-OST0004: slow journal start 56s due to >heavy IO load >[2988823.125410] Lustre: scratch-OST0004: slow parent lock 55s due to >heavy IO load >[2988823.125419] Lustre: Skipped 1 previous similar message >[2988823.125432] Lustre: scratch-OST0004: slow preprw_write setup 55s due >to heavy IO load >[2988856.236914] Lustre: scratch-OST0004: slow direct_io 33s due to heavy >IO load >[2988856.236922] Lustre: Skipped 323 previous similar messages >[2988892.543942] Lustre: scratch-OST0004: slow i_mutex 48s due to heavy >IO load >[2988892.543950] Lustre: Skipped 280 previous similar messages >[2988892.545310] Lustre: scratch-OST0004: slow setattr 55s due to heavy >IO load >[2988892.547328] Lustre: scratch-OST0004: slow parent lock 42s due to >heavy IO load >[2988892.547334] Lustre: Skipped 4 previous similar messages >[2988958.306720] Lustre: scratch-OST0004: slow setattr 52s due to heavy >IO load >[2988958.306724] Lustre: Skipped 1 previous similar message >[2988958.310818] Lustre: scratch-OST0004: slow parent lock 59s due to >heavy IO load >[2989040.406738] Lustre: scratch-OST0004: slow setattr 50s due to heavy >IO load >========> >I wonder if mounting it on clients with "noatime" and/or changing the >atime_diff would help to rid off of these Lustre slowdowns? Right now we >have: /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS server >is 60.No atime updates are ever written to disk on the OSTs, and at most only once every 10 minutes on the MDT. This is very likely due to small IO from the client or similar. Check "lctl get_param obdfilter.*.brw_stats" to see what kind of IO pattern the clients are sending.>I''ve tried to Google it first, and found that apparently "noatime " is >not supported for 1.8, and changing atime_diff is the preferred way? > >Could you please advise me, which way is better/possible, and how does >one change atime_diff? Will it help? Does it require, say, client''s >remount, etc.? > >Any ideas and advice would be greatly appreciated! Thank you very much in >advance. > > >-- >Grigory Shamov >HPC Analyst, Westgrid/Compute Canada >E2-588 EITC Building, University of Manitoba >(204) 474-9625 > > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Grigory Shamov
2012-Dec-06 19:45 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Dear Colin, Thanks for the reply! We have reduced the number of OST threads earlier, from the original DDN setting of 256 to 160. Looks like it made things better, but still, the problem persists. Reducing number of OST threads to a number that is lesser than number of clients seems to cause problems too.. Also, do you know if having OSS servers in active-active failover configuration affects the Lustre performance? Could it be that it forces sync on all I/O, or something of this sort to happen? -- Grigory Shamov --- On Thu, 12/6/12, Colin Faber <colin_faber at xyratex.com> wrote:> From: Colin Faber <colin_faber at xyratex.com> > Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? > To: "Grigory Shamov" <gas5x at yahoo.com> > Cc: lustre-discuss at lists.lustre.org > Date: Thursday, December 6, 2012, 11:28 AM > Hi, > > The messages indicate overloaded backend storage. You could > try this, > another option may be to statically set the maximum number > of threads on > the OSS, this should reduce load to the system and push the > backlogs to > your clients (hopefully) > > -cf > > > On 12/06/2012 12:06 PM, Grigory Shamov wrote: > > Hi, > > > > On our cluster, when there is a load on Lustre FS, at > some points it slows down precipitously, and there are very > very many "slow IO " and "slow setattr" messages on the OSS > servers: > > > > ======> > [2988758.408968] Lustre: scratch-OST0004: slow i_mutex > 51s due to heavy IO load > > [2988758.408974] Lustre: Skipped 276 previous similar > messages > > [2988760.309388] Lustre: scratch-OST0004: slow setattr > 50s due to heavy IO load > > [2988822.617865] Lustre: scratch-OST0004: slow setattr > 62s due to heavy IO load > > [2988822.689819] Lustre: scratch-OST0004: slow journal > start 48s due to heavy IO load > > [2988822.690627] Lustre: scratch-OST0004: slow journal > start 56s due to heavy IO load > > [2988823.125410] Lustre: scratch-OST0004: slow parent > lock 55s due to heavy IO load > > [2988823.125419] Lustre: Skipped 1 previous similar > message > > [2988823.125432] Lustre: scratch-OST0004: slow > preprw_write setup 55s due to heavy IO load > > [2988856.236914] Lustre: scratch-OST0004: slow > direct_io 33s due to heavy IO load > > [2988856.236922] Lustre: Skipped 323 previous similar > messages > > [2988892.543942] Lustre: scratch-OST0004: slow i_mutex > 48s due to heavy IO load > > [2988892.543950] Lustre: Skipped 280 previous similar > messages > > [2988892.545310] Lustre: scratch-OST0004: slow setattr > 55s due to heavy IO load > > [2988892.547328] Lustre: scratch-OST0004: slow parent > lock 42s due to heavy IO load > > [2988892.547334] Lustre: Skipped 4 previous similar > messages > > [2988958.306720] Lustre: scratch-OST0004: slow setattr > 52s due to heavy IO load > > [2988958.306724] Lustre: Skipped 1 previous similar > message > > [2988958.310818] Lustre: scratch-OST0004: slow parent > lock 59s due to heavy IO load > > [2989040.406738] Lustre: scratch-OST0004: slow setattr > 50s due to heavy IO load > > ========> > > > I wonder if mounting it on clients with "noatime" > and/or changing the atime_diff would help to rid off of > these Lustre slowdowns? Right now we have:? > /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS > server is 60. > > > > I''ve tried to Google it first, and found that > apparently "noatime " is not supported for 1.8, and changing > atime_diff is the preferred way? > > > > Could you please advise me, which way is > better/possible, and how does one change atime_diff?? > Will it help? Does it require, say, client''s remount, etc.? > > > > Any ideas and advice would be greatly appreciated! > Thank you very much in advance. > > > > > > -- > > Grigory Shamov > > HPC Analyst, Westgrid/Compute Canada > > E2-588 EITC Building, University of Manitoba > > (204) 474-9625 > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Grigory Shamov
2012-Dec-06 19:58 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Dear Andreas, Thank you for the reply! So, on one of our OSS servers the load is now 160. According to collectl, only one OST does most of the job. (We dont do striping on this FS; unless users to it manually on their subdirectories). I''ve done the obdfilter stats, and for disk I/O size I get: disk I/O size ios % cum % | ios % cum % 4K: 282890357 34 34 | 22425884 44 44 8K: 18651648 2 36 | 503635 0 45 16K: 31817375 3 40 | 1415935 2 48 32K: 47552890 5 46 | 308395 0 48 64K: 61437915 7 53 | 248666 0 49 128K: 72863407 8 62 | 520857 1 50 256K: 26320421 3 65 | 1144803 2 52 512K: 15805554 1 67 | 1703988 3 55 1M: 264536729 32 100 | 22336867 44 100 Am I looking at the right table? So, does it mean that we have small 4K I/O, which is 34% for reads and 44 for writes and is the cause of the problem? -- Grigory Shamov --- On Thu, 12/6/12, Dilger, Andreas <andreas.dilger at intel.com> wrote:> From: Dilger, Andreas <andreas.dilger at intel.com> > Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? > To: "Grigory Shamov" <gas5x at yahoo.com> > Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> > Date: Thursday, December 6, 2012, 11:41 AM > On 12/6/12 12:06 PM, "Grigory Shamov" > <gas5x at yahoo.com> > wrote: > > >Hi, > > > >On our cluster, when there is a load on Lustre FS, at > some points it > >slows down precipitously, and there are very very many > "slow IO " and > >"slow setattr" messages on the OSS servers: > > > >======> >[2988758.408968] Lustre: scratch-OST0004: slow i_mutex > 51s due to heavy > >IO load > >[2988758.408974] Lustre: Skipped 276 previous similar > messages > >[2988760.309388] Lustre: scratch-OST0004: slow setattr > 50s due to heavy > >IO load > >[2988822.617865] Lustre: scratch-OST0004: slow setattr > 62s due to heavy > >IO load > >[2988822.689819] Lustre: scratch-OST0004: slow journal > start 48s due to > >heavy IO load > >[2988822.690627] Lustre: scratch-OST0004: slow journal > start 56s due to > >heavy IO load > >[2988823.125410] Lustre: scratch-OST0004: slow parent > lock 55s due to > >heavy IO load > >[2988823.125419] Lustre: Skipped 1 previous similar > message > >[2988823.125432] Lustre: scratch-OST0004: slow > preprw_write setup 55s due > >to heavy IO load > >[2988856.236914] Lustre: scratch-OST0004: slow direct_io > 33s due to heavy > >IO load > >[2988856.236922] Lustre: Skipped 323 previous similar > messages > >[2988892.543942] Lustre: scratch-OST0004: slow i_mutex > 48s due to heavy > >IO load > >[2988892.543950] Lustre: Skipped 280 previous similar > messages > >[2988892.545310] Lustre: scratch-OST0004: slow setattr > 55s due to heavy > >IO load > >[2988892.547328] Lustre: scratch-OST0004: slow parent > lock 42s due to > >heavy IO load > >[2988892.547334] Lustre: Skipped 4 previous similar > messages > >[2988958.306720] Lustre: scratch-OST0004: slow setattr > 52s due to heavy > >IO load > >[2988958.306724] Lustre: Skipped 1 previous similar > message > >[2988958.310818] Lustre: scratch-OST0004: slow parent > lock 59s due to > >heavy IO load > >[2989040.406738] Lustre: scratch-OST0004: slow setattr > 50s due to heavy > >IO load > >========> > > >I wonder if mounting it on clients with "noatime" and/or > changing the > >atime_diff would help to rid off of these Lustre > slowdowns? Right now we > >have:? > /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS > server > >is 60. > > No atime updates are ever written to disk on the OSTs, and > at most only > once every 10 minutes on the MDT.? This is very likely > due to small IO > from the client or similar.? Check "lctl get_param > obdfilter.*.brw_stats" > to see what kind of IO pattern the clients are sending. > > >I''ve tried to Google it first, and found that apparently > "noatime " is > >not supported for 1.8, and changing atime_diff is the > preferred way? > > > >Could you please advise me, which way is > better/possible, and how does > >one change atime_diff?? Will it help? Does it > require, say, client''s > >remount, etc.? > > > >Any ideas and advice would be greatly appreciated! Thank > you very much in > >advance. > > > > > >-- > >Grigory Shamov > >HPC Analyst, Westgrid/Compute Canada > >E2-588 EITC Building, University of Manitoba > >(204) 474-9625 > > > > > >_______________________________________________ > >Lustre-discuss mailing list > >Lustre-discuss at lists.lustre.org > >http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > >
Colin Faber
2012-Dec-06 20:01 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Hi Grigory, The active-active failover configuration should make no difference here unless you''re running block level replication between hosts (outside the scope of lustre). What tuning do you currently have in place? Also, what kind of client work load are you experiencing (large / small file io)? -cf On 12/06/2012 12:45 PM, Grigory Shamov wrote:> Dear Colin, > > Thanks for the reply! > > We have reduced the number of OST threads earlier, from the original DDN setting of 256 to 160. Looks like it made things better, but still, the problem persists. Reducing number of OST threads to a number that is lesser than number of clients seems to cause problems too.. > > Also, do you know if having OSS servers in active-active failover configuration affects the Lustre performance? Could it be that it forces sync on all I/O, or something of this sort to happen? > > > > -- > Grigory Shamov > > > --- On Thu, 12/6/12, Colin Faber <colin_faber at xyratex.com> wrote: > >> From: Colin Faber <colin_faber at xyratex.com> >> Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? >> To: "Grigory Shamov" <gas5x at yahoo.com> >> Cc: lustre-discuss at lists.lustre.org >> Date: Thursday, December 6, 2012, 11:28 AM >> Hi, >> >> The messages indicate overloaded backend storage. You could >> try this, >> another option may be to statically set the maximum number >> of threads on >> the OSS, this should reduce load to the system and push the >> backlogs to >> your clients (hopefully) >> >> -cf >> >> >> On 12/06/2012 12:06 PM, Grigory Shamov wrote: >>> Hi, >>> >>> On our cluster, when there is a load on Lustre FS, at >> some points it slows down precipitously, and there are very >> very many "slow IO " and "slow setattr" messages on the OSS >> servers: >>> ======>>> [2988758.408968] Lustre: scratch-OST0004: slow i_mutex >> 51s due to heavy IO load >>> [2988758.408974] Lustre: Skipped 276 previous similar >> messages >>> [2988760.309388] Lustre: scratch-OST0004: slow setattr >> 50s due to heavy IO load >>> [2988822.617865] Lustre: scratch-OST0004: slow setattr >> 62s due to heavy IO load >>> [2988822.689819] Lustre: scratch-OST0004: slow journal >> start 48s due to heavy IO load >>> [2988822.690627] Lustre: scratch-OST0004: slow journal >> start 56s due to heavy IO load >>> [2988823.125410] Lustre: scratch-OST0004: slow parent >> lock 55s due to heavy IO load >>> [2988823.125419] Lustre: Skipped 1 previous similar >> message >>> [2988823.125432] Lustre: scratch-OST0004: slow >> preprw_write setup 55s due to heavy IO load >>> [2988856.236914] Lustre: scratch-OST0004: slow >> direct_io 33s due to heavy IO load >>> [2988856.236922] Lustre: Skipped 323 previous similar >> messages >>> [2988892.543942] Lustre: scratch-OST0004: slow i_mutex >> 48s due to heavy IO load >>> [2988892.543950] Lustre: Skipped 280 previous similar >> messages >>> [2988892.545310] Lustre: scratch-OST0004: slow setattr >> 55s due to heavy IO load >>> [2988892.547328] Lustre: scratch-OST0004: slow parent >> lock 42s due to heavy IO load >>> [2988892.547334] Lustre: Skipped 4 previous similar >> messages >>> [2988958.306720] Lustre: scratch-OST0004: slow setattr >> 52s due to heavy IO load >>> [2988958.306724] Lustre: Skipped 1 previous similar >> message >>> [2988958.310818] Lustre: scratch-OST0004: slow parent >> lock 59s due to heavy IO load >>> [2989040.406738] Lustre: scratch-OST0004: slow setattr >> 50s due to heavy IO load >>> ========>>> >>> I wonder if mounting it on clients with "noatime" >> and/or changing the atime_diff would help to rid off of >> these Lustre slowdowns? Right now we have: >> /proc/fs/lustre/mds/scratch-MDT0000/atime_diff on our MDS >> server is 60. >>> I''ve tried to Google it first, and found that >> apparently "noatime " is not supported for 1.8, and changing >> atime_diff is the preferred way? >>> Could you please advise me, which way is >> better/possible, and how does one change atime_diff? >> Will it help? Does it require, say, client''s remount, etc.? >>> Any ideas and advice would be greatly appreciated! >> Thank you very much in advance. >>> >>> -- >>> Grigory Shamov >>> HPC Analyst, Westgrid/Compute Canada >>> E2-588 EITC Building, University of Manitoba >>> (204) 474-9625 >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>
Mohr Jr, Richard Frank (Rick Mohr)
2012-Dec-07 18:49 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
On Dec 6, 2012, at 2:58 PM, Grigory Shamov wrote:> So, on one of our OSS servers the load is now 160. According to collectl, only one OST does most of the job. (We dont do striping on this FS; unless users to it manually on their subdirectories).This sounds similar to situations we see every now and then. The load on the oss server climbs until it is roughly equally to the number of oss threads (which sounds like your case with load=oss_threads=160), but only a single ost is performing any significant IO. This seems to arise when parallel jobs access the same file which has stripe_count=1. The oss is bombarded with so many requests to a single ost that they backlog and tie up all the oss threads. At that point, all IO to the oss slows to a crawl no matter which ost on the oss is being used. This becomes problematic because even a modest sized job can effectively DOS and oss server. When you encounter these problems, is the IO to the affected ost primarly one-way (ie - mostly reads or mostly writes)? In our cases, we tend to see this when parallel jobs are reading from a common file. There are a couple of things that I have found that help: 1) Increase the file striping a lot. This helps spread the load over more osts. We have had success with striping even relatively small files (~10 GB) over 100+ osts. Not only does it reduce load on the oss, but it usually speeds up the application significantly. 2) Make sure caching is enabled on the oss. For us, this seems to help mostly when lots of processes are reading in the same file. Not sure if your situation is exactly like what I have seen, but maybe some of that info can help a bit. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu
> 2) Make sure caching is enabled on the oss.How do you check/enable for this? Is it not enabled by default? Cheers, Mark ----- Original Message ----- From: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr at utk.edu> To: "Grigory Shamov" <gas5x at yahoo.com> Cc: lustre-discuss at lists.lustre.org Sent: Saturday, 8 December, 2012 5:19:31 AM Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? On Dec 6, 2012, at 2:58 PM, Grigory Shamov wrote:> So, on one of our OSS servers the load is now 160. According to collectl, only one OST does most of the job. (We dont do striping on this FS; unless users to it manually on their subdirectories).This sounds similar to situations we see every now and then. The load on the oss server climbs until it is roughly equally to the number of oss threads (which sounds like your case with load=oss_threads=160), but only a single ost is performing any significant IO. This seems to arise when parallel jobs access the same file which has stripe_count=1. The oss is bombarded with so many requests to a single ost that they backlog and tie up all the oss threads. At that point, all IO to the oss slows to a crawl no matter which ost on the oss is being used. This becomes problematic because even a modest sized job can effectively DOS and oss server. When you encounter these problems, is the IO to the affected ost primarly one-way (ie - mostly reads or mostly writes)? In our cases, we tend to see this when parallel jobs are reading from a common file. There are a couple of things that I have found that help: 1) Increase the file striping a lot. This helps spread the load over more osts. We have had success with striping even relatively small files (~10 GB) over 100+ osts. Not only does it reduce load on the oss, but it usually speeds up the application significantly. 2) Make sure caching is enabled on the oss. For us, this seems to help mostly when lots of processes are reading in the same file. Not sure if your situation is exactly like what I have seen, but maybe some of that info can help a bit. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20121208/b235304e/attachment.html
Spyro Polymiadis
2012-Dec-09 10:29 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Maybe from here? https://fs.hlrs.de/projects/craydoc/docs/books/S-0010-31/html-S-0010-31/z1112312952ebishop.html 3.3.6 Disabling OSS Read Cache and Writethrough Cache Lustre uses the Linux page cache to provide read-only caching of data on object storage servers (OSS). This strategy reduces disk access time caused by repeated reads from an OST. OSS read cache is enabled by default, but you can disable it by setting /proc parameters. For example, invoke the following on the OSS: nid00008:~ # lctl set_param obdfilter.*.read_cache_enable 0 Writethrough cache can also be disabled. This prevents file writes from ending up in the read cache. To disable writethrough cache, invoke the following on the OSS: nid00008:~ # lctl set_param obdfilter.*.writethrough_cache_enable 0 ----- Original Message ----- From: "Mark Day" <mark.day at rsp.com.au> To: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr at utk.edu> Cc: lustre-discuss at lists.lustre.org Sent: Saturday, 8 December, 2012 10:52:28 AM Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?> 2) Make sure caching is enabled on the oss.How do you check/enable for this? Is it not enabled by default? Cheers, Mark ----- Original Message ----- From: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr at utk.edu> To: "Grigory Shamov" <gas5x at yahoo.com> Cc: lustre-discuss at lists.lustre.org Sent: Saturday, 8 December, 2012 5:19:31 AM Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? On Dec 6, 2012, at 2:58 PM, Grigory Shamov wrote:> So, on one of our OSS servers the load is now 160. According to collectl, only one OST does most of the job. (We dont do striping on this FS; unless users to it manually on their subdirectories).This sounds similar to situations we see every now and then. The load on the oss server climbs until it is roughly equally to the number of oss threads (which sounds like your case with load=oss_threads=160), but only a single ost is performing any significant IO. This seems to arise when parallel jobs access the same file which has stripe_count=1. The oss is bombarded with so many requests to a single ost that they backlog and tie up all the oss threads. At that point, all IO to the oss slows to a crawl no matter which ost on the oss is being used. This becomes problematic because even a modest sized job can effectively DOS and oss server. When you encounter these problems, is the IO to the affected ost primarly one-way (ie - mostly reads or mostly writes)? In our cases, we tend to see this when parallel jobs are reading from a common file. There are a couple of things that I have found that help: 1) Increase the file striping a lot. This helps spread the load over more osts. We have had success with striping even relatively small files (~10 GB) over 100+ osts. Not only does it reduce load on the oss, but it usually speeds up the application significantly. 2) Make sure caching is enabled on the oss. For us, this seems to help mostly when lots of processes are reading in the same file. Not sure if your situation is exactly like what I have seen, but maybe some of that info can help a bit. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20121209/a8276848/attachment.html
Grigory Shamov
2012-Dec-10 16:43 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
Dear Richard, Thank you very much for the reply (somehow my email filter has eaten it so I knew it only from Mark''s quotation). Yes, it seems that your analysis is explaining our situation. We do see it when there is predominantly "reading" activity on one of the OSTs. Actually the volume of reading was small, thats why we couldn''t even locate an application that does it. It can be explained then that tehre? was really a blocking sitoation, not a throughput problem. The way our system is configured, number of OSTs is small (13). We have zero load on MDS, and stripe count 1.? The system is running and about 60% full; I wonder how would be a best strategy to change the striping now. I understand that if I just change the stripe count on the Lustre root dir, it will affect only newly created files/directories. Should I copy the user''s files, stripe their directories and then copy the data back? That sounds somewhat dangerous, especially if the users do some unusual things with symlinks.. ?-- Grigory Shamov HPC Analyst, Westgrid/Compute Canada E2-588 EITC Building, University of Manitoba (204) 474-9625 --- On Fri, 12/7/12, Mark Day <mark.day at rsp.com.au> wrote: From: Mark Day <mark.day at rsp.com.au> Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? To: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr at utk.edu> Cc: lustre-discuss at lists.lustre.org, "Grigory Shamov" <gas5x at yahoo.com> Date: Friday, December 7, 2012, 4:22 PM #yiv2002087058 p {margin:0;}> 2) Make sure caching is enabled on the oss. How do you check/enable for this? Is it not enabled by default? Cheers, Mark From: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr at utk.edu> To: "Grigory Shamov" <gas5x at yahoo.com> Cc: lustre-discuss at lists.lustre.org Sent: Saturday, 8 December, 2012 5:19:31 AM Subject: Re: [Lustre-discuss] noatime or atime_diff for Lustre 1.8.7? On Dec 6, 2012, at 2:58 PM, Grigory Shamov wrote:> So, on one of our OSS servers the load is now 160. According to collectl, only one OST does most of the job. (We dont do striping on this FS; unless users to it manually on their subdirectories).This sounds similar to situations we see every now and then. ?The load on the oss server climbs until it is roughly equally to the number of oss threads (which sounds like your case with load=oss_threads=160), but only a single ost is performing any significant IO. ?This seems to arise when parallel jobs access the same file which has stripe_count=1. ?The oss is bombarded with so many requests to a single ost that they backlog and tie up all the oss threads. ?At that point, all IO to the oss slows to a crawl no matter which ost on the oss is being used. ?This becomes problematic because even a modest sized job can effectively DOS and oss server. When you encounter these problems, is the IO to the affected ost primarly one-way (ie - mostly reads or mostly writes)? ?In our cases, we tend to see this when parallel jobs are reading from a common file. ?There are a couple of things that I have found that help: 1) Increase the file striping a lot. ?This helps spread the load over more osts. ?We have had success with striping even relatively small files (~10 GB) over 100+ osts. ?Not only does it reduce load on the oss, but it usually speeds up the application significantly. 2) Make sure caching is enabled on the oss. ?For us, this seems to help mostly when lots of processes are reading in the same file. Not sure if your situation is exactly like what I have seen, but maybe some of that info can help a bit. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20121210/e42d7f84/attachment.html
Mohr Jr, Richard Frank (Rick Mohr)
2012-Dec-10 16:59 UTC
[Lustre-discuss] noatime or atime_diff for Lustre 1.8.7?
On Dec 10, 2012, at 11:43 AM, Grigory Shamov wrote:> I wonder how would be a best strategy to change the striping now. I understand that if I just change the stripe count on the Lustre root dir, it will affect only newly created files/directories. Should I copy the user''s files, stripe their directories and then copy the data back? That sounds somewhat dangerous, especially if the users do some unusual things with symlinks..In our case, we approached the problem with user training. We didn''t make any changes to the file system itself. In general, these users had small to medium files so we wanted them to use stripe count 1 for their files. The primary exception was for these shared input files that were hit hard by multiple readers. We contacted the users, explained the situation, and gave them pointers on using "lfs setstripe". Once they started running jobs and seeing 30+% speed up, they were happy to manage the striping for their files. I don''t know your situation, so I can''t say if that approach would work for you. My experience has been that no matter what you choose for the default stripe count, someone will create files that don''t work well with the default. So there always seems to be some need for user education. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu