Hi all, sorry if this question has been answered before. What is the optimal ''strategy'' assigning OSTs to OSS nodes: -a- Assign OST via round-robin to the OSS -b- Assign in consecutive order (as long as the backend storage provides enought capacity for iops and bandwidth) -c- Something ''in-between'' the ''extremes'' of -a- and -b- E.g.: -a- OSS_1 OSS_2 OST_3 |_ |_ |_ OST_1 OST_2 OST_3 OST_4 OST_5 OST_6 OST_7 OST_8 OST_9 -b- OSS_1 OSS_2 OST_3 |_ |_ |_ OST_1 OST_4 OST_7 OST_2 OST_5 OST_8 OST_3 OST_6 OST_9 I thought -a- would be best for task-local (each task write to own file) and single file (all task write to single file) I/O since its like a raid-0 approach used disk I/O (and SUN create our first FS this way). Does someone made any systematic investigations which approach is best or have some educated opinion? Many thanks in advance. BR -Frank Heckes ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de
Michael Barnes
2011-Mar-31 14:54 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
Frank, File striping and allocation are essentially randomized across OSTs so from lustre''s point of view there is no difference between between a and b. AFAIK, Lustre does try to do some balancing based on available space and possibly other simple heuristics, but the ordering of the OSTs does not affect this decision making process. From a management point of view, b is much simpler to manage, and in the case that you add more storage to your system, you just keep adding the OSTs in sequence. -mb On Mar 31, 2011, at 10:06 AM, Heckes, Frank wrote:> Hi all, > > sorry if this question has been answered before. > > What is the optimal ''strategy'' assigning OSTs to OSS nodes: > > -a- Assign OST via round-robin to the OSS > -b- Assign in consecutive order (as long as the backend storage provides > enought capacity for iops and bandwidth) > -c- Something ''in-between'' the ''extremes'' of -a- and -b- > > E.g.: > > -a- OSS_1 OSS_2 OST_3 > |_ |_ |_ > OST_1 OST_2 OST_3 > OST_4 OST_5 OST_6 > OST_7 OST_8 OST_9 > > -b- OSS_1 OSS_2 OST_3 > |_ |_ |_ > OST_1 OST_4 OST_7 > OST_2 OST_5 OST_8 > OST_3 OST_6 OST_9 > > I thought -a- would be best for task-local (each task write to own > file) and single file (all task write to single file) I/O since its like > a raid-0 approach used disk I/O (and SUN create our first FS this way). > Does someone made any systematic investigations which approach is best > or have some educated opinion? > Many thanks in advance. > BR > > -Frank Heckes > > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), > Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > Prof. Dr. Sebastian M. Schmidt > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > > Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- +----------------------------------------------- | Michael Barnes | | Thomas Jefferson National Accelerator Facility | Scientific Computing Group | 12000 Jefferson Ave. | Newport News, VA 23606 | (757) 269-7634 +-----------------------------------------------
Wojciech Turek
2011-Mar-31 15:04 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
I agree with Michael, keep it simple so it won''t become unmanageable when you grow your system to ten''s or hundred''s of OSTs.>From Lustre point of vew it does not matter which OSS mounts which OST aslong as the distribution of the OST''s across OSS''s is evenly balanced. Lustre objects are placed on the OSTs using load balancing algorithm which is based on OSTs available space. You can change that default behaviour using OST pools. Cheers Wojciech On 31 March 2011 15:54, Michael Barnes <Michael.Barnes at jlab.org> wrote:> > Frank, > > File striping and allocation are essentially randomized across OSTs > so from lustre''s point of view there is no difference between between > a and b. AFAIK, Lustre does try to do some balancing based on available > space and possibly other simple heuristics, but the ordering of the OSTs > does not affect this decision making process. > > >From a management point of view, b is much simpler to manage, and in the > case that you add more storage to your system, you just keep adding the > OSTs in sequence. > > -mb > > On Mar 31, 2011, at 10:06 AM, Heckes, Frank wrote: > > > Hi all, > > > > sorry if this question has been answered before. > > > > What is the optimal ''strategy'' assigning OSTs to OSS nodes: > > > > -a- Assign OST via round-robin to the OSS > > -b- Assign in consecutive order (as long as the backend storage provides > > enought capacity for iops and bandwidth) > > -c- Something ''in-between'' the ''extremes'' of -a- and -b- > > > > E.g.: > > > > -a- OSS_1 OSS_2 OST_3 > > |_ |_ |_ > > OST_1 OST_2 OST_3 > > OST_4 OST_5 OST_6 > > OST_7 OST_8 OST_9 > > > > -b- OSS_1 OSS_2 OST_3 > > |_ |_ |_ > > OST_1 OST_4 OST_7 > > OST_2 OST_5 OST_8 > > OST_3 OST_6 OST_9 > > > > I thought -a- would be best for task-local (each task write to own > > file) and single file (all task write to single file) I/O since its like > > a raid-0 approach used disk I/O (and SUN create our first FS this way). > > Does someone made any systematic investigations which approach is best > > or have some educated opinion? > > Many thanks in advance. > > BR > > > > -Frank Heckes > > > > > ------------------------------------------------------------------------------------------------ > > > ------------------------------------------------------------------------------------------------ > > Forschungszentrum Juelich GmbH > > 52425 Juelich > > Sitz der Gesellschaft: Juelich > > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher > > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), > > Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > > Prof. Dr. Sebastian M. Schmidt > > > ------------------------------------------------------------------------------------------------ > > > ------------------------------------------------------------------------------------------------ > > > > Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- > +----------------------------------------------- > | Michael Barnes > | > | Thomas Jefferson National Accelerator Facility > | Scientific Computing Group > | 12000 Jefferson Ave. > | Newport News, VA 23606 > | (757) 269-7634 > +----------------------------------------------- > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110331/f9c0ed29/attachment.html
Kevin Van Maren
2011-Mar-31 15:06 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
It used to be that multi-stripe files were created with sequential OST indexes. It also used to be that OST indexes were sequentially assigned to newly-created files. As Lustre now adds greater randomization, the strategy for assigning OSTs to OSS nodes (and storage hardware, which often limits the aggregate performance of multiple OSTs) is less important. While I have normally gone with "a", "b" can make it easier to remember where OSTs are located, and also keep a uniform convention if the storage system is later grown. Kevin Heckes, Frank wrote:> Hi all, > > sorry if this question has been answered before. > > What is the optimal ''strategy'' assigning OSTs to OSS nodes: > > -a- Assign OST via round-robin to the OSS > -b- Assign in consecutive order (as long as the backend storage provides > enought capacity for iops and bandwidth) > -c- Something ''in-between'' the ''extremes'' of -a- and -b- > > E.g.: > > -a- OSS_1 OSS_2 OST_3 > |_ |_ |_ > OST_1 OST_2 OST_3 > OST_4 OST_5 OST_6 > OST_7 OST_8 OST_9 > > -b- OSS_1 OSS_2 OST_3 > |_ |_ |_ > OST_1 OST_4 OST_7 > OST_2 OST_5 OST_8 > OST_3 OST_6 OST_9 > > I thought -a- would be best for task-local (each task write to own > file) and single file (all task write to single file) I/O since its like > a raid-0 approach used disk I/O (and SUN create our first FS this way). > Does someone made any systematic investigations which approach is best > or have some educated opinion? > Many thanks in advance. > BR > > -Frank Heckes > > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), > Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > Prof. Dr. Sebastian M. Schmidt > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > > Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Ashley Pittman
2011-Mar-31 15:24 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
On 31 Mar 2011, at 15:06, Heckes, Frank wrote:> -a- Assign OST via round-robin to the OSS > -b- Assign in consecutive order (as long as the backend storage provides > enought capacity for iops and bandwidth)At DDN our default config is -b- although we have done -a- at customers request. I don''t believe it makes a huge difference whichever way, because of the dual controllers and dual links in a DDN system we alternate luns over controllers (and hence IB ports), we do this rather than alternating OSTS as it''s a lot easier to visually verify that -b- is correct than -a-. At least one DDN reseller does it the other way however. Ashley.
Jeremy Filizetti
2011-Mar-31 20:59 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
I this a feature implemented after 1.8.5? In the past default striping without an offset resulted in sequential stripe allocation according to client device order for a striped file. Basically the order OSTs were mounted after the the last --writeconf is the order the targets are added to the client llog and allocated. It''s probably not a big deal for lots of clients but for a small number of clients doing large sequential IO or working over the WAN it is. So regardless of an A or B configuration a file with a stripe count of 3 could end up issuing IO to a single OSS instead of using round-robin between the socket/queue pair to each OSS. Jeremy On Thu, Mar 31, 2011 at 11:06 AM, Kevin Van Maren < kevin.van.maren at oracle.com> wrote:> It used to be that multi-stripe files were created with sequential OST > indexes. It also used to be that OST indexes were sequentially assigned > to newly-created files. > As Lustre now adds greater randomization, the strategy for assigning > OSTs to OSS nodes (and storage hardware, which often limits the > aggregate performance of multiple OSTs) is less important. > > While I have normally gone with "a", "b" can make it easier to remember > where OSTs are located, and also keep a uniform convention if the > storage system is later grown. > > Kevin > > > Heckes, Frank wrote: > > Hi all, > > > > sorry if this question has been answered before. > > > > What is the optimal ''strategy'' assigning OSTs to OSS nodes: > > > > -a- Assign OST via round-robin to the OSS > > -b- Assign in consecutive order (as long as the backend storage provides > > enought capacity for iops and bandwidth) > > -c- Something ''in-between'' the ''extremes'' of -a- and -b- > > > > E.g.: > > > > -a- OSS_1 OSS_2 OST_3 > > |_ |_ |_ > > OST_1 OST_2 OST_3 > > OST_4 OST_5 OST_6 > > OST_7 OST_8 OST_9 > > > > -b- OSS_1 OSS_2 OST_3 > > |_ |_ |_ > > OST_1 OST_4 OST_7 > > OST_2 OST_5 OST_8 > > OST_3 OST_6 OST_9 > > > > I thought -a- would be best for task-local (each task write to own > > file) and single file (all task write to single file) I/O since its like > > a raid-0 approach used disk I/O (and SUN create our first FS this way). > > Does someone made any systematic investigations which approach is best > > or have some educated opinion? > > Many thanks in advance. > > BR > > > > -Frank Heckes > > > > > ------------------------------------------------------------------------------------------------ > > > ------------------------------------------------------------------------------------------------ > > Forschungszentrum Juelich GmbH > > 52425 Juelich > > Sitz der Gesellschaft: Juelich > > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher > > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), > > Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > > Prof. Dr. Sebastian M. Schmidt > > > ------------------------------------------------------------------------------------------------ > > > ------------------------------------------------------------------------------------------------ > > > > Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110331/3fe7193c/attachment.html
No, the algorithm is not purely random, it is weighted on QOS, space and a few other things. When a stripe is chosen on one OSS, we add a penalty to the other OSTs on that OSS to prevent IO bunching on one OSS. cliffw On Thu, Mar 31, 2011 at 1:59 PM, Jeremy Filizetti < jeremy.filizetti at gmail.com> wrote:> I this a feature implemented after 1.8.5? In the past default striping > without an offset resulted in sequential stripe allocation according to > client device order for a striped file. Basically the order OSTs were > mounted after the the last --writeconf is the order the targets are added to > the client llog and allocated. > > It''s probably not a big deal for lots of clients but for a small number of > clients doing large sequential IO or working over the WAN it is. So > regardless of an A or B configuration a file with a stripe count of 3 could > end up issuing IO to a single OSS instead of using round-robin between the > socket/queue pair to each OSS. > > Jeremy > > > On Thu, Mar 31, 2011 at 11:06 AM, Kevin Van Maren < > kevin.van.maren at oracle.com> wrote: > >> It used to be that multi-stripe files were created with sequential OST >> indexes. It also used to be that OST indexes were sequentially assigned >> to newly-created files. >> As Lustre now adds greater randomization, the strategy for assigning >> OSTs to OSS nodes (and storage hardware, which often limits the >> aggregate performance of multiple OSTs) is less important. >> >> While I have normally gone with "a", "b" can make it easier to remember >> where OSTs are located, and also keep a uniform convention if the >> storage system is later grown. >> >> Kevin >> >> >> Heckes, Frank wrote: >> > Hi all, >> > >> > sorry if this question has been answered before. >> > >> > What is the optimal ''strategy'' assigning OSTs to OSS nodes: >> > >> > -a- Assign OST via round-robin to the OSS >> > -b- Assign in consecutive order (as long as the backend storage provides >> > enought capacity for iops and bandwidth) >> > -c- Something ''in-between'' the ''extremes'' of -a- and -b- >> > >> > E.g.: >> > >> > -a- OSS_1 OSS_2 OST_3 >> > |_ |_ |_ >> > OST_1 OST_2 OST_3 >> > OST_4 OST_5 OST_6 >> > OST_7 OST_8 OST_9 >> > >> > -b- OSS_1 OSS_2 OST_3 >> > |_ |_ |_ >> > OST_1 OST_4 OST_7 >> > OST_2 OST_5 OST_8 >> > OST_3 OST_6 OST_9 >> > >> > I thought -a- would be best for task-local (each task write to own >> > file) and single file (all task write to single file) I/O since its like >> > a raid-0 approach used disk I/O (and SUN create our first FS this way). >> > Does someone made any systematic investigations which approach is best >> > or have some educated opinion? >> > Many thanks in advance. >> > BR >> > >> > -Frank Heckes >> > >> > >> ------------------------------------------------------------------------------------------------ >> > >> ------------------------------------------------------------------------------------------------ >> > Forschungszentrum Juelich GmbH >> > 52425 Juelich >> > Sitz der Gesellschaft: Juelich >> > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >> > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher >> > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), >> > Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >> > Prof. Dr. Sebastian M. Schmidt >> > >> ------------------------------------------------------------------------------------------------ >> > >> ------------------------------------------------------------------------------------------------ >> > >> > Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110331/e32b79e7/attachment.html
Andreas Dilger
2011-Apr-01 05:54 UTC
[Lustre-discuss] Optimal stratgy for OST distribution
Actually, the MDS will still assign OST indices in round-robin order unless the free space is more than 20% imbalanced. However, it will internally do an OSS-first ordering of the OSTs to ensure maximum spreading of load across the OSS nodes. For details see lustre/lov/lov_qos.c. Cheers, Andreas On 2011-03-31, at 5:06 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> It used to be that multi-stripe files were created with sequential OST > indexes. It also used to be that OST indexes were sequentially assigned > to newly-created files. > As Lustre now adds greater randomization, the strategy for assigning > OSTs to OSS nodes (and storage hardware, which often limits the > aggregate performance of multiple OSTs) is less important. > > While I have normally gone with "a", "b" can make it easier to remember > where OSTs are located, and also keep a uniform convention if the > storage system is later grown. > > Kevin > > > Heckes, Frank wrote: >> Hi all, >> >> sorry if this question has been answered before. >> >> What is the optimal ''strategy'' assigning OSTs to OSS nodes: >> >> -a- Assign OST via round-robin to the OSS >> -b- Assign in consecutive order (as long as the backend storage provides >> enought capacity for iops and bandwidth) >> -c- Something ''in-between'' the ''extremes'' of -a- and -b- >> >> E.g.: >> >> -a- OSS_1 OSS_2 OST_3 >> |_ |_ |_ >> OST_1 OST_2 OST_3 >> OST_4 OST_5 OST_6 >> OST_7 OST_8 OST_9 >> >> -b- OSS_1 OSS_2 OST_3 >> |_ |_ |_ >> OST_1 OST_4 OST_7 >> OST_2 OST_5 OST_8 >> OST_3 OST_6 OST_9 >> >> I thought -a- would be best for task-local (each task write to own >> file) and single file (all task write to single file) I/O since its like >> a raid-0 approach used disk I/O (and SUN create our first FS this way). >> Does someone made any systematic investigations which approach is best >> or have some educated opinion? >> Many thanks in advance. >> BR >> >> -Frank Heckes >> >> ------------------------------------------------------------------------------------------------ >> ------------------------------------------------------------------------------------------------ >> Forschungszentrum Juelich GmbH >> 52425 Juelich >> Sitz der Gesellschaft: Juelich >> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher >> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), >> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >> Prof. Dr. Sebastian M. Schmidt >> ------------------------------------------------------------------------------------------------ >> ------------------------------------------------------------------------------------------------ >> >> Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss