Hello, With Lustre 1.4, we had to take care the OSTs were added in a round-robin order in the XML. What about Lustre 1.6 ? Should we take care about such a thing ? Should we do the first start among the OSTs using a round-robin order ? Best regards, Patrice Bouchand -- Patrice BOUCHAND patrice.bouchand@ext.bull.net Bull Echirolles B1-430 tel : +4 76 29 75 23
Patrice Bouchand wrote:>Hello, > >With Lustre 1.4, we had to take care the OSTs were added in a round-robin order in the XML. >What about Lustre 1.6 ? Should we take care about such a thing ? Should we do the first start among the OSTs using a round-robin order ? > > Best regards, > > Patrice Bouchand > > >Excellent question. Happily, the answer is "we do it automatically for you". There is QOS code in 1.6 to select OTSs based on location (which OSS) and size considerations (free space). Emptier OSTs are selected for stripes preferentially, and stripes are preferentially spread out between OSSs (to increase network bandwidth utilization). When OSTs have approximately the same free space (within 20%), an automatic round-robin allocator alternates stripes between OSTs on different OSSs. Here are some example round-robin stripe orders: 3: AAA 3x3: ABABAB 3x4: BBABABA 3x5: BBABBABA 3x5x1: BBABABABC 3x5x2: BABABCBABC 4x6x2: BABABCBABABC (the same letter represents the different OSTs on a single OSS.) When OSTs are beyond this free-space uniformity, then a weighting algorithm is used to influence OST ordering based on size and location. There is a user tunable in /proc/.../lov/qos_prio_free which can be increased to put more weighting on free space. When set to 255, then location is no longer used in the stripe ordering calculations (i.e. entirely based on free space). Note that these are weightings for a random algorithm and so will not necessarily strictly choose the "emptiest" OST every time, but on average it will fill the emptier OSTs faster.
Thanks, these are good news. What is the purpose of --index option of mkfs.lustre ? (I tried to use it but with no sucess)> Excellent question. Happily, the answer is "we do it automatically for > you". > > There is QOS code in 1.6 to select OTSs based on location (which OSS) > and size considerations (free space). Emptier OSTs are selected for > stripes preferentially, and stripes are preferentially spread out > between OSSs (to increase network bandwidth utilization). When OSTs > have approximately the same free space (within 20%), an automatic > round-robin allocator alternates stripes between OSTs on different > OSSs. Here are some example round-robin stripe orders: > 3: AAA > 3x3: ABABAB > 3x4: BBABABA > 3x5: BBABBABA > 3x5x1: BBABABABC > 3x5x2: BABABCBABC > 4x6x2: BABABCBABABC > (the same letter represents the different OSTs on a single OSS.) > > When OSTs are beyond this free-space uniformity, then a weighting > algorithm is used to influence OST ordering based on size and location. > There is a user tunable in /proc/.../lov/qos_prio_free which can be > increased to put more weighting on free space. When set to 255, then > location is no longer used in the stripe ordering calculations (i.e. > entirely based on free space). Note that these are weightings for a > random algorithm and so will not necessarily strictly choose the > "emptiest" OST every time, but on average it will fill the emptier OSTs > faster. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
You can use --index=X to force an OST to a particular index within the LOV. Because of the QOS code, however, there''s not much point in doing that. It''s more helpful in tunefs.lustre when upgrading an old 1.4 filesystem to 1.6 in the case where 1.6 can''t detect the old index. Patrice Bouchand wrote:>Thanks, these are good news. > >What is the purpose of --index option of mkfs.lustre ? (I tried to use it but with no sucess) > > > >>Excellent question. Happily, the answer is "we do it automatically for >>you". >> >>There is QOS code in 1.6 to select OTSs based on location (which OSS) >>and size considerations (free space). Emptier OSTs are selected for >>stripes preferentially, and stripes are preferentially spread out >>between OSSs (to increase network bandwidth utilization). When OSTs >>have approximately the same free space (within 20%), an automatic >>round-robin allocator alternates stripes between OSTs on different >>OSSs. Here are some example round-robin stripe orders: >>3: AAA >>3x3: ABABAB >>3x4: BBABABA >>3x5: BBABBABA >>3x5x1: BBABABABC >>3x5x2: BABABCBABC >>4x6x2: BABABCBABABC >>(the same letter represents the different OSTs on a single OSS.) >> >>When OSTs are beyond this free-space uniformity, then a weighting >>algorithm is used to influence OST ordering based on size and location. >>There is a user tunable in /proc/.../lov/qos_prio_free which can be >>increased to put more weighting on free space. When set to 255, then >>location is no longer used in the stripe ordering calculations (i.e. >>entirely based on free space). Note that these are weightings for a >>random algorithm and so will not necessarily strictly choose the >>"emptiest" OST every time, but on average it will fill the emptier OSTs >>faster. >> >>_______________________________________________ >>Lustre-discuss mailing list >>Lustre-discuss@clusterfs.com >>https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> >> >> >> > > >
> Excellent question. Happily, the answer is "we do it > automatically for you".Ahhh!!! But what if I don''t want it done for me.. What if I really want to fully stripe like I used to, for many of my performance runs I very carefully laid down the stripes 8 wide, and started the next file at 8*filenumber. Will lstripe still obey me? The --index flag works well in my case, I made use of it to stride my OST id''s so the OST on a specific OSS were not hit when I striped in that manner. At mount time the mgs just believed that''s how I wanted them organized. I do miss setting a UUID for each ost that has some identifying info in it, if you stripe with indexes then you can just do some math to tell you where the disk is.. I.e. num%num_ost is host number, and int(num/num_ost) is disk number and such. So really the question becomes, you wrote all this code to do it for me, do you have a good way I can specify my own policy, or is it all hard coded. Evan> > There is QOS code in 1.6 to select OTSs based on location > (which OSS) and size considerations (free space). Emptier > OSTs are selected for stripes preferentially, and stripes are > preferentially spread out between OSSs (to increase network > bandwidth utilization). When OSTs have approximately the > same free space (within 20%), an automatic round-robin > allocator alternates stripes between OSTs on different OSSs. > Here are some example round-robin stripe orders: > 3: AAA > 3x3: ABABAB > 3x4: BBABABA > 3x5: BBABBABA > 3x5x1: BBABABABC > 3x5x2: BABABCBABC > 4x6x2: BABABCBABABC > (the same letter represents the different OSTs on a single OSS.) > > When OSTs are beyond this free-space uniformity, then a > weighting algorithm is used to influence OST ordering based > on size and location. > There is a user tunable in /proc/.../lov/qos_prio_free which > can be increased to put more weighting on free space. When > set to 255, then location is no longer used in the stripe > ordering calculations (i.e. > entirely based on free space). Note that these are > weightings for a random algorithm and so will not necessarily > strictly choose the "emptiest" OST every time, but on average > it will fill the emptier OSTs faster. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Felix, Evan J wrote:>>Excellent question. Happily, the answer is "we do it >>automatically for you". >> >> > >Ahhh!!! But what if I don''t want it done for me.. What if I really >want to fully stripe like I used to, for many of my performance runs I >very carefully laid down the stripes 8 wide, and started the next file >at 8*filenumber. Will lstripe still obey me? > >The --index flag works well in my case, I made use of it to stride my >OST id''s so the OST on a specific OSS were not hit when I striped in >that manner. At mount time the mgs just believed that''s how I wanted >them organized. I do miss setting a UUID for each ost that has some >identifying info in it, if you stripe with indexes then you can just do >some math to tell you where the disk is.. I.e. num%num_ost is host >number, and int(num/num_ost) is disk number and such. > > So really the question becomes, you wrote all this code to do it for >me, do you have a good way I can specify my own policy, or is it all >hard coded. > >Unhappily, we do it all automatically for you. Peter, do we need to think about adding a striping policy tunable? auto | index | space (where auto is the current optimized policy, index is strict ost index order, and space is the current policy weighted toward free space)
On Jul 25, 2006 10:23 -0700, Nathaniel Rutman wrote:> Felix, Evan J wrote: > > >Ahhh!!! But what if I don''t want it done for me.. What if I really > >want to fully stripe like I used to, for many of my performance runs I > >very carefully laid down the stripes 8 wide, and started the next file > >at 8*filenumber. Will lstripe still obey me? > > > >The --index flag works well in my case, I made use of it to stride my > >OST id''s so the OST on a specific OSS were not hit when I striped in > >that manner. At mount time the mgs just believed that''s how I wanted > >them organized. I do miss setting a UUID for each ost that has some > >identifying info in it, if you stripe with indexes then you can just do > >some math to tell you where the disk is.. I.e. num%num_ost is host > >number, and int(num/num_ost) is disk number and such. > > > >So really the question becomes, you wrote all this code to do it for > >me, do you have a good way I can specify my own policy, or is it all > >hard coded. > > > > > Unhappily, we do it all automatically for you. > Peter, do we need to think about adding a striping policy tunable? > auto | index | space (where auto is the current optimized policy, index > is strict ost index order, and space is the current policy weighted > toward free space)Nathan, if the OST indices are already specified "optimally", and we are not in "QOS" mode balancing uneven space allocation, then it may just be that the OSTs will be layed out in the same order that they were originally specified. That would depend on the order that the OSSes are walked, and how the OSTs are pulled out of the per-OSS lists when generating the "round-robin" access pattern. Evan, that said, if your job allocates 8 stripes at a time, it _should_ be that they will be spread evenly across all of the available OSTs without explicitly specifying a striping pattern. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
I think that we probably do need something like that. Probably not a huge issue? Perhaps we can do this per pool. - peter ->> > > Unhappily, we do it all automatically for you. > Peter, do we need to think about adding a striping policy tunable? > auto | index | space (where auto is the current optimized > policy, index is strict ost index order, and space is the > current policy weighted toward free space) > >