thr3ads.net - Lustre discuss - [Lustre-discuss] OSTs rank with lustre 1.6 [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Patrice Bouchand

2006-Jul-21 05:45 UTC

[Lustre-discuss] OSTs rank with lustre 1.6

Hello,

With Lustre 1.4, we had to take care the OSTs were added in a round-robin order
in the XML.
What about Lustre 1.6 ? Should we take care about such a thing ? Should we do
the first start among the OSTs using a round-robin order ?

	Best regards,

		Patrice Bouchand

-- 
Patrice BOUCHAND
patrice.bouchand@ext.bull.net
Bull Echirolles B1-430
tel : +4 76 29 75 23

Nathaniel Rutman

2006-Jul-21 10:01 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

Patrice Bouchand wrote:
>Hello,
>
>With Lustre 1.4, we had to take care the OSTs were added in a round-robin
order in the XML.
>What about Lustre 1.6 ? Should we take care about such a thing ? Should we
do the first start among the OSTs using a round-robin order ?
>
>	Best regards,
>
>		Patrice Bouchand
>
>  
>Excellent question.  Happily, the answer is "we do it automatically for 
you".

There is QOS code in 1.6 to select OTSs based on location (which OSS) 
and size considerations (free space).  Emptier OSTs are selected for 
stripes preferentially, and stripes are preferentially spread out 
between OSSs (to increase network bandwidth utilization).  When OSTs 
have approximately the same free space (within 20%), an automatic 
round-robin allocator alternates stripes between OSTs on different 
OSSs.  Here are some example round-robin stripe orders:
3: AAA
3x3: ABABAB
3x4: BBABABA
3x5: BBABBABA
3x5x1: BBABABABC
3x5x2: BABABCBABC
4x6x2: BABABCBABABC
(the same letter represents the different OSTs on a single OSS.)

When OSTs are beyond this free-space uniformity, then a weighting 
algorithm is used to influence OST ordering based on size and location.  
There is a user tunable in /proc/.../lov/qos_prio_free which can be 
increased to put more weighting on free space.  When set to 255, then 
location is no longer used in the stripe ordering calculations (i.e. 
entirely based on free space).  Note that these are weightings for a 
random algorithm and so will not necessarily strictly choose the 
"emptiest" OST every time, but on average it will fill the emptier
OSTs
faster.

Patrice Bouchand

2006-Jul-24 00:08 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

Thanks, these are good news.

What is the purpose of --index option of mkfs.lustre ? (I tried to use it but
with no sucess)
> Excellent question.  Happily, the answer is "we do it automatically
for
> you".
> 
> There is QOS code in 1.6 to select OTSs based on location (which OSS) 
> and size considerations (free space).  Emptier OSTs are selected for 
> stripes preferentially, and stripes are preferentially spread out 
> between OSSs (to increase network bandwidth utilization).  When OSTs 
> have approximately the same free space (within 20%), an automatic 
> round-robin allocator alternates stripes between OSTs on different 
> OSSs.  Here are some example round-robin stripe orders:
> 3: AAA
> 3x3: ABABAB
> 3x4: BBABABA
> 3x5: BBABBABA
> 3x5x1: BBABABABC
> 3x5x2: BABABCBABC
> 4x6x2: BABABCBABABC
> (the same letter represents the different OSTs on a single OSS.)
> 
> When OSTs are beyond this free-space uniformity, then a weighting 
> algorithm is used to influence OST ordering based on size and location.  
> There is a user tunable in /proc/.../lov/qos_prio_free which can be 
> increased to put more weighting on free space.  When set to 255, then 
> location is no longer used in the stripe ordering calculations (i.e. 
> entirely based on free space).  Note that these are weightings for a 
> random algorithm and so will not necessarily strictly choose the 
> "emptiest" OST every time, but on average it will fill the
emptier OSTs
> faster.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
>

Nathaniel Rutman

2006-Jul-24 10:22 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

You can use --index=X to force an OST to a particular index within the 
LOV.  Because of the QOS code, however, there''s not much point in doing
that. 
It''s more helpful in tunefs.lustre when upgrading an old 1.4 filesystem
to 1.6 in the case where 1.6 can''t detect the old index.

Patrice Bouchand wrote:
>Thanks, these are good news.
>
>What is the purpose of --index option of mkfs.lustre ? (I tried to use it
but with no sucess)
>
>  
>
>>Excellent question.  Happily, the answer is "we do it automatically
for
>>you".
>>
>>There is QOS code in 1.6 to select OTSs based on location (which OSS) 
>>and size considerations (free space).  Emptier OSTs are selected for 
>>stripes preferentially, and stripes are preferentially spread out 
>>between OSSs (to increase network bandwidth utilization).  When OSTs 
>>have approximately the same free space (within 20%), an automatic 
>>round-robin allocator alternates stripes between OSTs on different 
>>OSSs.  Here are some example round-robin stripe orders:
>>3: AAA
>>3x3: ABABAB
>>3x4: BBABABA
>>3x5: BBABBABA
>>3x5x1: BBABABABC
>>3x5x2: BABABCBABC
>>4x6x2: BABABCBABABC
>>(the same letter represents the different OSTs on a single OSS.)
>>
>>When OSTs are beyond this free-space uniformity, then a weighting 
>>algorithm is used to influence OST ordering based on size and location.
>>There is a user tunable in /proc/.../lov/qos_prio_free which can be 
>>increased to put more weighting on free space.  When set to 255, then 
>>location is no longer used in the stripe ordering calculations (i.e. 
>>entirely based on free space).  Note that these are weightings for a 
>>random algorithm and so will not necessarily strictly choose the 
>>"emptiest" OST every time, but on average it will fill the
emptier OSTs
>>faster.
>>
>>_______________________________________________
>>Lustre-discuss mailing list
>>Lustre-discuss@clusterfs.com
>>https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>>
>>    
>>
>
>  
>

Felix, Evan J

2006-Jul-25 10:45 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

> Excellent question.  Happily, the answer is "we do it 
> automatically for you".
Ahhh!!!  But what if I don''t want it done for me..  What if I really
want to fully stripe like I used to, for many of my performance runs I
very carefully laid down the stripes 8 wide, and started the next file
at 8*filenumber.  Will lstripe still obey me?

The --index flag works well in my case, I made use of it to stride my
OST id''s so the OST on a specific OSS were not hit when I striped in
that manner.  At mount time the mgs just believed that''s how I wanted
them organized. I do miss setting a UUID for each ost that has some
identifying info in it, if you stripe with indexes then you can just do
some math to tell you where the disk is.. I.e. num%num_ost is host
number, and int(num/num_ost) is disk number and such.

 So really the question becomes, you wrote all this code to do it for
me, do you have a good way I can specify my own policy, or is it all
hard coded.


Evan
> 
> There is QOS code in 1.6 to select OTSs based on location 
> (which OSS) and size considerations (free space).  Emptier 
> OSTs are selected for stripes preferentially, and stripes are 
> preferentially spread out between OSSs (to increase network 
> bandwidth utilization).  When OSTs have approximately the 
> same free space (within 20%), an automatic round-robin 
> allocator alternates stripes between OSTs on different OSSs.  
> Here are some example round-robin stripe orders:
> 3: AAA
> 3x3: ABABAB
> 3x4: BBABABA
> 3x5: BBABBABA
> 3x5x1: BBABABABC
> 3x5x2: BABABCBABC
> 4x6x2: BABABCBABABC
> (the same letter represents the different OSTs on a single OSS.)
> 
> When OSTs are beyond this free-space uniformity, then a 
> weighting algorithm is used to influence OST ordering based 
> on size and location.  
> There is a user tunable in /proc/.../lov/qos_prio_free which 
> can be increased to put more weighting on free space.  When 
> set to 255, then location is no longer used in the stripe 
> ordering calculations (i.e. 
> entirely based on free space).  Note that these are 
> weightings for a random algorithm and so will not necessarily 
> strictly choose the "emptiest" OST every time, but on average 
> it will fill the emptier OSTs faster.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Nathaniel Rutman

2006-Jul-25 11:23 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

Felix, Evan J wrote:
>>Excellent question.  Happily, the answer is "we do it 
>>automatically for you".
>>    
>>
>
>Ahhh!!!  But what if I don''t want it done for me..  What if I
really
>want to fully stripe like I used to, for many of my performance runs I
>very carefully laid down the stripes 8 wide, and started the next file
>at 8*filenumber.  Will lstripe still obey me?
>
>The --index flag works well in my case, I made use of it to stride my
>OST id''s so the OST on a specific OSS were not hit when I striped
in
>that manner.  At mount time the mgs just believed that''s how I
wanted
>them organized. I do miss setting a UUID for each ost that has some
>identifying info in it, if you stripe with indexes then you can just do
>some math to tell you where the disk is.. I.e. num%num_ost is host
>number, and int(num/num_ost) is disk number and such.
>
> So really the question becomes, you wrote all this code to do it for
>me, do you have a good way I can specify my own policy, or is it all
>hard coded.
>  
>Unhappily, we do it all automatically for you.
Peter, do we need to think about adding a striping policy tunable?
auto | index | space (where auto is the current optimized policy, index 
is strict ost index order, and space is the current policy weighted 
toward free space)

Andreas Dilger

2006-Jul-25 12:31 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

On Jul 25, 2006  10:23 -0700, Nathaniel Rutman wrote:> Felix, Evan J wrote:
> 
> >Ahhh!!!  But what if I don''t want it done for me..  What if I
really
> >want to fully stripe like I used to, for many of my performance runs I
> >very carefully laid down the stripes 8 wide, and started the next file
> >at 8*filenumber.  Will lstripe still obey me?
> >
> >The --index flag works well in my case, I made use of it to stride my
> >OST id''s so the OST on a specific OSS were not hit when I
striped in
> >that manner.  At mount time the mgs just believed that''s how I
wanted
> >them organized. I do miss setting a UUID for each ost that has some
> >identifying info in it, if you stripe with indexes then you can just do
> >some math to tell you where the disk is.. I.e. num%num_ost is host
> >number, and int(num/num_ost) is disk number and such.
> >
> >So really the question becomes, you wrote all this code to do it for
> >me, do you have a good way I can specify my own policy, or is it all
> >hard coded.
> > 
> >
> Unhappily, we do it all automatically for you.
> Peter, do we need to think about adding a striping policy tunable?
> auto | index | space (where auto is the current optimized policy, index 
> is strict ost index order, and space is the current policy weighted 
> toward free space)
Nathan, if the OST indices are already specified "optimally", and we
are not in "QOS" mode balancing uneven space allocation, then it may
just be that the OSTs will be layed out in the same order that they
were originally specified.  That would depend on the order that the
OSSes are walked, and how the OSTs are pulled out of the per-OSS lists
when generating the "round-robin" access pattern.

Evan, that said, if your job allocates 8 stripes at a time, it _should_
be that they will be spread evenly across all of the available OSTs without
explicitly specifying a striping pattern.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter J. Braam

2006-Jul-25 13:30 UTC

head link

[Lustre-discuss] OSTs rank with lustre 1.6

I think that we probably do need something like that.  Probably not a
huge issue?  Perhaps we can do this per pool.

- peter -


>   > >
 > Unhappily, we do it all automatically for you.
 > Peter, do we need to think about adding a striping policy tunable?
 > auto | index | space (where auto is the current optimized 
 > policy, index is strict ost index order, and space is the 
 > current policy weighted toward free space)
 > 
 >

Lustre discuss - Jul 2006 - OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6

[Lustre-discuss] OSTs rank with lustre 1.6