thr3ads.net - Lustre discuss - [Lustre-discuss] stripe offset and hot-spots [Nov 2009]

If this information is useful, please help other people find it:
Share via:

John White

2009-Nov-24 19:17 UTC

[Lustre-discuss] stripe offset and hot-spots

Hello Folks,
	So I''m trying to get a theoretical understanding of stripe offsets in
lustre.  As I understand it, the default offset set to 0 results in all writes
beginning at OSS0-OST0.  With a default stripe of 4, doesn''t this lead
to massive hotspots on OSS0-OST[0-3] (unless *all* writes are consistently
large)?

	With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, but
that''s not important right now).  This would appear, in theory, to put
OSS0 in a very hot situation.

	That said, I wonder how efficient a solution setting the stripe offset of the
root of the file system to -1 ("random") is to solving this
theoretical situation (given my understanding of striping under lustre).

	In reality, we have a quite varied workload on our file systems with codes
ranging from bio to astrophys and, as such, writes ranging from very small to
very large.  Any real-world experience with these situations?  Are there strange
inefficiencies or administrative difficulties that should be known previous to
enabling "random" offsets?  Any info would be greatly appreciated.
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

Guy Coates

2009-Nov-24 19:53 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

John White wrote:> Hello Folks,
> 	So I''m trying to get a theoretical understanding of stripe
offsets in lustre.  As I understand it, the default offset set to 0 results in
all writes beginning at OSS0-OST0.  With a default stripe of 4, doesn''t
this lead to massive hotspots on OSS0-OST[0-3] (unless *all* writes are
consistently large)?
> 
> 	With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, but
that''s not important right now).  This would appear, in theory, to put
OSS0 in a very hot situation.
> 
> 	That said, I wonder how efficient a solution setting the stripe offset of
the root of the file system to -1 ("random") is to solving this
theoretical situation (given my understanding of striping under lustre).
> 
> 	In reality, we have a quite varied workload on our file systems with codes
ranging from bio to astrophys and, as such, writes ranging from very small to
very large.  Any real-world experience with these situations?  Are there strange
inefficiencies or administrative difficulties that should be known previous to
enabling "random" offsets?  Any info would be greatly appreciated.
> ----------------
> John White
Hi John,

AFIAK, -1 is the default, so objects are allocated randomly, avoiding
the hot-spot situation you describe. Setting the offset to anything
other than random is probably a bad idea.

Setting the stripe count is a more complicated question.

We set the filesystem not to stripe by default, but then set striping on
datasets that we know are going to be hot. Many bio apps like to read a
common data sets in parallel and striping really helps performance.

The only problem we have historically seen  has been  OSTs becoming
unbalanced over time. If an OST fills up, users get sporadic "filesystem
full" errors even though df shows free space.

We found that this was typically due to code being left in debug mode,
and writing out multi TByte log files with no striping, which led to
single OSTs filling up. Quotas are your friend here.

Lustre 1.6 and later will switch to a weighted allocator if the OSTs
start to become unbalanced (see section 24.4.4 in the 1.6 manual).
Making the OST size as large as possible help too, so that the average
file size << OST size.

Ultimately you need to spend time educating your users about the pro and
cons of striping, especially if you have a very mixed application set.

Having good stats is important too; we have historical load graphs for
the OSSs (ganglia) and for our OSTs (rrdtool data collection straight
from the disk controllers). That helps us to  identify times when users
have got things wrong and a single OST/OSS is being hammered due to
incorrect striping.

Cheers,

Guy

-- 
Dr Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 ex 6925
Fax: +44 (0)1223 496802

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Andreas Dilger

2009-Nov-24 21:29 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

On 2009-11-24, at 12:17, John White wrote:> 	So I''m trying to get a theoretical understanding of stripe
offsets
> in lustre.  As I understand it, the default offset set to 0 results  
> in all writes beginning at OSS0-OST0.  With a default stripe of 4,  
> doesn''t this lead to massive hotspots on OSS0-OST[0-3] (unless
*all*
> writes are consistently large)?
As previously mentioned, the default is NOT to always start files with  
OST0, but rather to have a "round-robin with precession" (not random  
as is commonly mentioned) so that the OST used for stripe 0 of each  
file is evenly distributed among OSTs, regardless of the stripe count.
> 	With our setup, we have 4 OSTs per OSS (well, the last OSS has 3,  
> but that''s not important right now).  This would appear, in
theory,
> to put OSS0 in a very hot situation.
>
> 	That said, I wonder how efficient a solution setting the stripe  
> offset of the root of the file system to -1 ("random") is to
solving
> this theoretical situation (given my understanding of striping under  
> lustre).
Well, that is already the default, unless it has been changed at some  
time in the past by someone at your site.  We generally recommend  
against ever changing the starting index of files, since there are  
rarely good reasons to change this.  The man page writes:

         A start-ost of -1 allows the MDS to choose the starting
         index and it is strongly recommended, as this allows
         space and load balancing to be done by the MDS as needed.
> 	In reality, we have a quite varied workload on our file systems  
> with codes ranging from bio to astrophys and, as such, writes  
> ranging from very small to very large.  Any real-world experience  
> with these situations?  Are there strange inefficiencies or  
> administrative difficulties that should be known previous to  
> enabling "random" offsets?  Any info would be greatly
appreciated.

It isn''t random, specifically to avoid the case of non-uniform  
distribution when many clients are creating files at one time.  With  
random stripe-0 OST selection, it is inevitable that some OSTs get one  
or two more objects, and some OSTs get one or two fewer objects, and  
this can cause dramatic performance impacts.

For example, if the average objects per OST is 2, but some OSTs get 4  
objects and others get no objects then the application may see an  
aggregate performance drop of 50% or more, if it were using random  
object distribution.  With round-robin distribution, every OST will  
get 2 objects (assuming objects / OSTs is a whole number).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

rishi pathak

2009-Nov-25 11:12 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

We are running lustre 1.6 amounting to 12TB of space. We use stripe offset
and stripe count as ''-1'' and stripe size of 2MB .
Data on the filesystem comprises of very small to very large files. Some
days back we observed write failure on the fs inspite of having
1.2TB space available (as given by df ). The problem was that 2 of the OSTs
were 100% full.
So can we conclude that more than often the said 2 OSTs were choosen as
start offset for files that were smaller in size(<= 2MB)?

On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <adilger at sun.com>
wrote:
> On 2009-11-24, at 12:17, John White wrote:
> >       So I''m trying to get a theoretical understanding of
stripe offsets
> > in lustre.  As I understand it, the default offset set to 0 results
> > in all writes beginning at OSS0-OST0.  With a default stripe of 4,
> > doesn''t this lead to massive hotspots on OSS0-OST[0-3]
(unless *all*
> > writes are consistently large)?
>
> As previously mentioned, the default is NOT to always start files with
> OST0, but rather to have a "round-robin with precession" (not
random
> as is commonly mentioned) so that the OST used for stripe 0 of each
> file is evenly distributed among OSTs, regardless of the stripe count.
>
> >       With our setup, we have 4 OSTs per OSS (well, the last OSS has
3,
> > but that''s not important right now).  This would appear, in
theory,
> > to put OSS0 in a very hot situation.
> >
> >       That said, I wonder how efficient a solution setting the stripe
> > offset of the root of the file system to -1 ("random") is to
solving
> > this theoretical situation (given my understanding of striping under
> > lustre).
>
> Well, that is already the default, unless it has been changed at some
> time in the past by someone at your site.  We generally recommend
> against ever changing the starting index of files, since there are
> rarely good reasons to change this.  The man page writes:
>
>         A start-ost of -1 allows the MDS to choose the starting
>         index and it is strongly recommended, as this allows
>         space and load balancing to be done by the MDS as needed.
>
> >       In reality, we have a quite varied workload on our file systems
> > with codes ranging from bio to astrophys and, as such, writes
> > ranging from very small to very large.  Any real-world experience
> > with these situations?  Are there strange inefficiencies or
> > administrative difficulties that should be known previous to
> > enabling "random" offsets?  Any info would be greatly
appreciated.
>
>
> It isn''t random, specifically to avoid the case of non-uniform
> distribution when many clients are creating files at one time.  With
> random stripe-0 OST selection, it is inevitable that some OSTs get one
> or two more objects, and some OSTs get one or two fewer objects, and
> this can cause dramatic performance impacts.
>
> For example, if the average objects per OST is 2, but some OSTs get 4
> objects and others get no objects then the application may see an
> aggregate performance drop of 50% or more, if it were using random
> object distribution.  With round-robin distribution, every OST will
> get 2 objects (assuming objects / OSTs is a whole number).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091125/9090d6e4/attachment.html

Andreas Dilger

2009-Nov-25 17:53 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

On 2009-11-25, at 04:12, rishi pathak wrote:> We are running lustre 1.6 amounting to 12TB of space. We use stripe  
> offset and stripe count as ''-1'' and stripe size of 2MB .
Using stripe_count = -1 means "always stripe over all OSTs".
> Data on the filesystem comprises of very small to very large files.  
> Some days back we observed write failure on the fs inspite of having
> 1.2TB space available (as given by df ). The problem was that 2 of  
> the OSTs were 100% full.
That is because you are requesting all files to be stored on all  
OSTs.  If the OSTs are not the same size this can cause such  
problems.  Also, if the OSTs are "small" then it is more likely that  
one or the other will be filled before the others.

As a general rule, it is better to have as large OSTs as possible,  
rather than having more small OSTs.
> So can we conclude that more than often the said 2 OSTs were choosen  
> as start offset for files that were smaller in size(<= 2MB)?
At the time the files are created, Lustre cannot know what size they  
will be.  When there is unbalanced space usage like this it usually  
means that there are a small number of very large files that are  
causing the OSTs to be filled.

There is also ongoing work to improve the allocator so that it will do  
a better job to continually balance space usage across all OSTs,  
rather than only starting the rebalancing when the space usage is very  
different between OSTs.
> On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <adilger at sun.com>
> wrote:
> On 2009-11-24, at 12:17, John White wrote:
> > So I''m trying to get a theoretical understanding of stripe
offsets
> > in lustre.  As I understand it, the default offset set to 0 results
> > in all writes beginning at OSS0-OST0.  With a default stripe of 4,
> > doesn''t this lead to massive hotspots on OSS0-OST[0-3]
(unless *all*
> > writes are consistently large)?
>
> As previously mentioned, the default is NOT to always start files with
> OST0, but rather to have a "round-robin with precession" (not
random
> as is commonly mentioned) so that the OST used for stripe 0 of each
> file is evenly distributed among OSTs, regardless of the stripe count.
>
> >       With our setup, we have 4 OSTs per OSS (well, the last OSS  
> has 3,
> > but that''s not important right now).  This would appear, in
theory,
> > to put OSS0 in a very hot situation.
> >
> >       That said, I wonder how efficient a solution setting the  
> stripe
> > offset of the root of the file system to -1 ("random") is to
solving
> > this theoretical situation (given my understanding of striping under
> > lustre).
>
> Well, that is already the default, unless it has been changed at some
> time in the past by someone at your site.  We generally recommend
> against ever changing the starting index of files, since there are
> rarely good reasons to change this.  The man page writes:
>
>         A start-ost of -1 allows the MDS to choose the starting
>         index and it is strongly recommended, as this allows
>         space and load balancing to be done by the MDS as needed.
>
> >       In reality, we have a quite varied workload on our file  
> systems
> > with codes ranging from bio to astrophys and, as such, writes
> > ranging from very small to very large.  Any real-world experience
> > with these situations?  Are there strange inefficiencies or
> > administrative difficulties that should be known previous to
> > enabling "random" offsets?  Any info would be greatly
appreciated.
>
>
> It isn''t random, specifically to avoid the case of non-uniform
> distribution when many clients are creating files at one time.  With
> random stripe-0 OST selection, it is inevitable that some OSTs get one
> or two more objects, and some OSTs get one or two fewer objects, and
> this can cause dramatic performance impacts.
>
> For example, if the average objects per OST is 2, but some OSTs get 4
> objects and others get no objects then the application may see an
> aggregate performance drop of 50% or more, if it were using random
> object distribution.  With round-robin distribution, every OST will
> get 2 objects (assuming objects / OSTs is a whole number).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> -- 
> Regards--
> Rishi Pathak
> National PARAM Supercomputing Facility
> Center for Development of Advanced Computing(C-DAC)
> Pune University Campus,Ganesh Khind Road
> Pune-Maharastra

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Peter Grandi

2009-Nov-27 16:45 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

[ ... ]

jwhite> In reality, we have a quite varied workload on our file
jwhite> systems with codes ranging from bio to astrophys and, as
jwhite> such, writes ranging from very small to very large.

Good luck with that -- there is no file system design that can cope
*well* with all of that. Of course every vendor will tell you
otherwise abutn their product :-).

Also, there is no filesystem (as in -- file system instance,
actual storage pool) that can cope well with all that either.

jwhite> Any real-world experience with these situations?  Are
jwhite> there strange inefficiencies or administrative
jwhite> difficulties that should be known [ ... ]

Well, following on from the previous point, my recommendation
would be to consider using different tools for different purposes,
or at the very least different (storage) pools for different
purposes.

Lustre is a good candidate for the "silver bullet" sell because it
is indeed pretty decent all-round, and there isn''t much else
(perhaps GlusterFS) in the same class, as NFS and SMB and OpenAFS
all have some greater limitations (especially for Linux).

Also it has an underappreciated aspect: it is pretty decent even as
a non-cluster (e.g. one MDT/some OSTs a single server) filesystem,
because it is based on something close to ''ext4'', which is
ok-ish,
and because its network protocol is fairly ok and with arguably a
better design than the NFS or SMB protocol and quite importantly a
better client implementation than the Linux NFS or SMB clients.

But even if using Lustre for many sorts of things, I would not use
a single giant storage pool anyhow.

Sure "management" consultants push the sweet drug of consolidation
and virtualization, but taken to an extreme it creates a lot of
problems, as a single immense storage pools becomes rigid and hard
to manage.

However in many cases "management" can be easily sold on the
"silver bullet" theory that a single tool and pool can support
every requirement and the so called "reality principle" need not
apply.

If your case is not like that one possible change in your situation
is to use Lustre for a lot of different applications, but have
several different storage pools with rather different storage
backends and tuning.

Christopher J. Morrone

2009-Dec-29 20:51 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

Andreas Dilger wrote:
> Well, that is already the default, unless it has been changed at some  
> time in the past by someone at your site.  We generally recommend  
> against ever changing the starting index of files, since there are  
> rarely good reasons to change this.  The man page writes:
> 
>          A start-ost of -1 allows the MDS to choose the starting
>          index and it is strongly recommended, as this allows
>          space and load balancing to be done by the MDS as needed.
The Lustre Manual should be updated to use that wording.  It still says 
"random":

http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_78664

Also, it lists the default stripe-count as 1 when it should be 2.

Also, he might want to be aware that that round-robin is only used if no 
two OSTs are imbalanced by more than 20%.  Otherwise, the weighted 
allocator kicks in:

http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_pgfId-1293986

We haven''t had time to look into it very closely yet, but we have been 
getting complaints from users that seem to be a result of the weighted 
allocator.  It appears to not be uncommon for OSTs to get more than 20% 
out of balance on our systems, so the weighted allocator is in use 
fairly frequently.

The users are complaining of reduced filesystem bandwidth, and we 
suspect the weighted allocator.  It results in the users'' files being 
quite unevenly distributed among the OSTs.  Obviously, this is done 
purposely, with files more likely to be created on OSTs that have more 
free space.  But it also results in an unbalanced distribution of files, 
and therefore poor bandwidth.

We would probably prefer a simpler algorithm.  Possibly just stop 
creating new files on any OST that is 20% more full, and round-robin 
over the remaining osts.

Like I said, we haven''t had time to look into it too closely, so we 
don''t have a bug open yet.  But it is something to keep in mind.

Chris

Andreas Dilger

2009-Dec-30 22:41 UTC

head link

[Lustre-discuss] stripe offset and hot-spots

On 2009-12-29, at 13:51, Christopher J. Morrone wrote:> Andreas Dilger wrote:
>> Well, that is already the default, unless it has been changed at  
>> some  time in the past by someone at your site.  We generally  
>> recommend  against ever changing the starting index of files, since  
>> there are  rarely good reasons to change this.  The man page writes:
>>         A start-ost of -1 allows the MDS to choose the starting
>>         index and it is strongly recommended, as this allows
>>         space and load balancing to be done by the MDS as needed.
>
> The Lustre Manual should be updated to use that wording.  It still  
> says "random":
>
>
http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_78664
>
> Also, it lists the default stripe-count as 1 when it should be 2.
AFAIK, the default stripe count is still 1.  Is it possible you''ve  
changed this default locally?
> Also, he might want to be aware that that round-robin is only used  
> if no two OSTs are imbalanced by more than 20%.  Otherwise, the  
> weighted allocator kicks in:
>
>
http://manual.lustre.org/manual/LustreManual18_HTML/StripingAndIOOptions.html#50532485_pgfId-1293986
Right.
> We haven''t had time to look into it very closely yet, but we have
> been getting complaints from users that seem to be a result of the  
> weighted allocator.  It appears to not be uncommon for OSTs to get  
> more than 20% out of balance on our systems, so the weighted  
> allocator is in use fairly frequently.
>
> The users are complaining of reduced filesystem bandwidth, and we  
> suspect the weighted allocator.  It results in the users'' files  
> being quite unevenly distributed among the OSTs.  Obviously, this is  
> done purposely, with files more likely to be created on OSTs that  
> have more free space.  But it also results in an unbalanced  
> distribution of files, and therefore poor bandwidth.
>
> We would probably prefer a simpler algorithm.  Possibly just stop  
> creating new files on any OST that is 20% more full, and round-robin  
> over the remaining osts.
>
> Like I said, we haven''t had time to look into it too closely, so
we
> don''t have a bug open yet.  But it is something to keep in mind.

There is bug 18547 that is open to track the development of an  
improved QOS-RR allocator.  The goal is to always use round-robin  
allocation, but selectively skip OSTs that are too full.  There would  
no longer be a "QOS mode" per-se, it would always be active, hopefully
avoiding imbalances gently as soon as they appear, rather than letting  
them get too far out of balance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2009 - stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots

[Lustre-discuss] stripe offset and hot-spots