emoly.liu
2009-Jan-13 16:16 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi rob, Thanks for reviewing the patch so carefully. Sorry for this delay reply. Robert Latham wrote:> On Fri, Nov 28, 2008 at 01:33:39PM +0800, emoly.liu wrote: > >> Hello, >> >> Per Rob''s comment, I send the Lustre ADIO patch here. FYI, the brief >> introduction is given below. >> Any advice is welcome. >> > > Hi LiuYing > > I finally had a chance to look at your patch. Sorry it took me so > long. I''m glad to see the Lustre folks taking MPI-IO more seriously. > > First off, the patch looks pretty good. It''s clear why the approaches > you''re taking give you the improvements in MPI-IO performance. > Naturally, I''ve got some comments. > > Your patch modifies or creates 11 files. I''ll go through it file by > file and tell you what I think. I''ve also copied Wei-keng Liao from > Northwestern. He has done a lot of work tuning ROMIO for lustre and > he might have some comments or suggestions, as he has done some of > these things in a slightly different way. Weikuan, I presume you > worked with LiuYing on this patch? If not, any comments from you? > > OK, here we go! > > ad_lustre/Makefile.in > - just fine. building new objects > > ad_lustre/README > - I''m happy to have this, and you might want to add even more content >yes> ad_lustre/ad_lustre.c > - Just a small change to call the two new lustre-specific strided and > strided-collective routines > > ad_lustre/ad_lustre.h > - bring in lustre structs/constants if "lustre_user.h" not getting > brought in: is this really necessary? ROMIO won''t try to build > ad_lustre unless ''lustre'' is part of "--with-file-systems" at > configure time--with-file-systems=lustre only tells it will build lustre adio, but it doesn''t show where is the lustre_user.h, which is needed by ad_open.c (for setting stripe stuff). So if there is no lustre_user.h in /user/include (no lustre-devel rpm in your system), hmm, maybe we should only give error msg at this moment, instead of adding this if.> ad_lustre/ad_lustre_aggregate.c > - Lots of changes. I see you started from adio/common/ad_aggregate.c > > - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O > if any piece bigger than a big_req hint. > - You removed a lot of comments. why? >Some comments are not suitable to lustre adio driver and some were probably removed during the codes cleanup. I will restore the latter.> - here again, you should refer to ROMIO hint struct instead of calling > MPI_Info functions >Sorry, I am not sure what ROMIO hint structure you mean. I assume it is ADIOI_Hints_struct. I can add lustre hints to that struct, but they are lustre related, not common. Is that ok? If I misunderstand, please advise.> - maybe we need, as wei-keng suggests, a better separation so you > don''t need to duplicate Calc_my_req just so you can call the > lustre-sepecific Calc_aggregator. Could a hint do this for you? >yes, that would be better if there is a separation.> - same for Calc_others_req. I see you want to change ROMIO policy, > and that makes sense for you, but too bad we don''t have a better way > than for you to copy huge chunks of romio code only to make small > modifications. >yes, in Calc_others_req we try to reduce MPI_Alltoall operations to get some performance improvement.> ad_lustre/ad_lustre_hints.c > - new hints: please prefix lustre-specific with ''lustre'' or > ''romio_lustre'' Yes, we are inconsistent about this, so you probably > learned this bad habbit from us. Please do as I say, not as I do > (or have done in the past :> ) >sure, I will. :-)> - Also, note that for all the ROMIO hints, we add to an internal-to > ROMIO hint structure. It''s a bit faster to read from these > "natural" hints instead of going through the MPI_Info interface and > parsing strings. >mentioned above> ad_lustre/ad_lustre_open.c > - looks fine. I''m a little concerned about rank 0 setting > stripe/stride with ioctl while everyone sits in barrier. ADIO_Open > will call lustre_open with COMM_SELF in create case, but that > doesn''t help since you need to be able to adjust on open. What > happens when you adjust an already-created file? >You''re right. We shouldn''t set lustre striping info only according to O_CREAT flag because fd will return a wrong value for an existing file. I will fix it. Thanks.> ad_lustre/ad_lustre_rwcontig.c > - just new copyright... but nothing changed? >Sorry, I will remove it.> ad_lustre/ad_lustre_wrcoll.c > > - this is more like a fork of adio/common/ad_write_str.c and not a > copy. Watch out for changes we''ve made in adio/common. I can make > an honest effort to change ad_lustre when we change adio/common, but > we don''t have a lustre test system. > - missing some 64-bit Aint casting >I will add it.> - took out most of the comments from adio/common/ad_write_coll.c -- why? >I will restore some.> - missing the "new MPE" logging anthony put into rest of ROMIO >I had a look at MPE events defined in adioi.h. Could you tell me what else MPE logs you hope to see?> - yet another implementation of "rounding domains to stripe size". > - interesting: disable data sieving in two-phase >In our tests, data sieving showed bad collective write performance for some kinds of workloads. To avoid this, we define ds_in_coll hint, distinguished from the one in independent I/O.> - optimization: instead of MPI_Info_get, use ROMIO hint struct. >fd->hints> - missing wei-keng''s reworked type processing >Sorry, I checked the codes but I didn''t find anything missed commented by wei-keng. Could you tell me what type processing is missed in this file?> > > ad_lustre/ad_lustre_wrstr.c > - some formating changes > - also missing some 64-bit Aint casts >I will add it.> common/ad_write_coll.c > - make ADIOI_Heap_merge "public" to ROMIO. That''s fine: sure beats copying it >:-)> So, for the next steps, could you at the very least do the following: > - change the hint names > - use the fd->hints structure instead of MPI Info routines when you > are strictly internal to ROMIO. Continue to populate the users > MPI_INFO. >OK, I will refine the patch per your comments and send it to you later.> Other than that, I''ll commit this to our trunk. I will need Sun and > Oak Ridge''s help to keep the ad_lustre driver up to date. > > Ideally, we would have most of the ad_lustre functionality as "tunable > knobs" in the common/ code. That''s a matter of software engineering > we just haven''t tackled yet, so a separate ad_lustre is just fine for > now. > > Thanks for the patch! >If I misunderstand or miss anything, please let me know. Thank you so much! LiuYing> ==rob > > >> -------- Original Message -------- >> Subject: Re: [Lustre-discuss] [Lustre-devel] a new Lustre ADIO driver >> Date: Wed, 12 Nov 2008 11:35:48 -0600 >> From: Robert Latham <robl at mcs.anl.gov> >> To: emoly.liu <Emoly.Liu at Sun.COM> >> CC: lustre-discuss <lustre-discuss at lists.lustre.org>, >> lustre-devel at lists.lustre.org >> References: <49110B1A.5030201 at sun.com> <49114151.10908 at sun.com> >> >> >> >> On Wed, Nov 05, 2008 at 02:46:41PM +0800, emoly.liu wrote: >> >>> Hi, >>> >>> Here is some description and performance results for the new Lustre >>> ADIO driver. You might be interested. >>> >>> The main purpose of the new lustre adio driver is to collect the small >>> size I/O requests into fewer number of larger ones according to >>> Lustre''s striping feature with low overhead automatically. We also >>> provide some hints for performance tuning. >>> >>> In the following graph, the recent FLASH I/O benchmarking result shows >>> that the new driver performs much better than the old one, especially >>> for a large-scale system. >>> In this test, FLASH I/O blocksize is 4*4*4 that means each 3 clients >>> write 40KB, 40.5KB and 42KB respectively. The I/O operation iteration >>> (nvar) is set to 1024. >>> >> Hi LiuYing >> >> Since this patch is for ROMIO and not really for Lustre, could you >> send this to romio-maint at mcs.anl.gov ? It will reach more of the >> ROMIO developers that way. >> >> Thanks >> ==rob >> >> >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090114/2505f635/attachment-0001.html
Robert Latham
2009-Jan-13 19:52 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi LiuYing. Thanks for giving my comments such consideration. I''ve only responded to a few points below inline: On Wed, Jan 14, 2009 at 12:16:34AM +0800, emoly.liu wrote:>> ad_lustre/ad_lustre.h >> - bring in lustre structs/constants if "lustre_user.h" not getting >> brought in: is this really necessary? ROMIO won''t try to build >> ad_lustre unless ''lustre'' is part of "--with-file-systems" at >> configure time> --with-file-systems=lustre only tells it will build lustre adio, but > it doesn''t show where is the lustre_user.h, which is needed by > ad_open.c (for setting stripe stuff). So if there is no > lustre_user.h in /user/include (no lustre-devel rpm in your system), > hmm, maybe we should only give error msg at this moment, instead of > adding this if.This is sort of what we do for PVFS and PVFS2: if the user explicitly asks for pvfs or pvfs2, and ROMIO configure.in cannot find the necessary header files, we AC_ERROR. I''d like to see the same behavior for lustre too.>> ad_lustre/ad_lustre_aggregate.c - Lots of changes. I see you started >> from adio/common/ad_aggregate.c >> >> - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O >> if any piece bigger than a big_req hint. >> - You removed a lot of comments. why? >> > Some comments are not suitable to lustre adio driver and some were > probably removed during the codes cleanup. I will restore the latter.I can certainly appreciate the need to remove misleading comments. All the ROMIO maintainers will tell you, though, that this code is tricky and we look at it infrequently enough that it''s really pretty hard to over-comment it.>> - here again, you should refer to ROMIO hint struct instead of calling >> MPI_Info functions >> > Sorry, I am not sure what ROMIO hint structure you mean. I assume it is > ADIOI_Hints_struct. I can add lustre hints to that struct, but they are > lustre related, not common. Is that ok? > If I misunderstand, please advise.Sure. I''ll be happy to explain further. You''re on the right track. In adioi.h we have ADIOI_Hints_struct. This struct contains many fs-independent values (cb_buffer_size, striping_unit), but it also contains a union for fs-specific values. Right now, that union only has values for pvfs and pvfs2, but we should add another member for lustre. Here''s an example of ROMIO using the hints in this structure (adio/ad_pvfs2/ad_pvfs2_read.c) if (fd->hints->fs_hints.pvfs2.listio_read == ADIOI_HINT_ENABLE) { ret = ADIOI_PVFS2_ReadStridedListIO(fd, buf, count, datatype, file_ptr_type, offset, status, error_code); return; } No need to extract the "romio_pvfs2_listio" keyval from the MPI_INFO member, no need to strcasecmp the info value. Those steps were all done at MPI_FILE_OPEN or MPI_FILE_SET_VIEW time. It''s a small optimization, and what you do now is correct, but making use of ADIOI_Hints_struct will make your code more like the other drivers.>> - maybe we need, as wei-keng suggests, a better separation so you >> don''t need to duplicate Calc_my_req just so you can call the >> lustre-sepecific Calc_aggregator. Could a hint do this for you? >> > yes, that would be better if there is a separation.OK, glad we''re on the same page. I don''t think you need to do this bit of software engineering for your patch. It''s more a note to ourselves: if we ever redesign ROMIO, we''ll have to be sure to make this part of it.>> - missing the "new MPE" logging anthony put into rest of ROMIO >> > I had a look at MPE events defined in adioi.h. Could you tell me what > else MPE logs you hope to see?Take a look at some of the other files that have #ifdef ADIOI_MPE_LOGGING lines. (Unfortunately, there are still some legacy MPE lines in the code, too. I''ll have to delete them) The events in adioi.h are exactly what I''d like to see. I see some lustre files already have some of them. Here''s an example of a modern call (do it like this): MPE_Log_event( ADIOI_MPE_open_a, 0, NULL ); and here''s an ancient call (don''t do this): MPE_Log_event(9, 0, "start close"); In the end, though, this is for developers to see what''s going on inside ROMIO with JUMPSHOT. it''s not critical, though you might find it helpful when you are looking for answers to performance questions.>> - missing wei-keng''s reworked type processing > Sorry, I checked the codes but I didn''t find anything missed > commented by wei-keng. Could you tell me what type processing is > missed in this file?This one is a bit hard to describe, because the initial work went into revision 958, but then we did some testing and refining for several more revisions before settling on a final version. It''s ok if you don''t incorporate this change, but if you''re motivated, svn revision 1001 makes some changes to ad_read_coll which are demonstrative of the overall changes also needed in ad_write_coll, ad_write_str, and ad_read_str. In ad_lustre, you have one file for "io", so you have already reduced your work by half. Good.> If I misunderstand or miss anything, please let me know. > Thank you so much!Thank you for the good discussion. I''m looking forward to seeing the next revision. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
Robert Latham
2009-Feb-23 23:09 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi. Just wanted to see how things are going with this patch. I think the MPICH2 folks are looking to start the release process in the next few weeks, and I''d like to get this patchset into that distribution. Thanks ==rob On Tue, Jan 13, 2009 at 01:52:59PM -0600, Robert Latham wrote:> Hi LiuYing. Thanks for giving my comments such consideration. I''ve > only responded to a few points below inline: > > On Wed, Jan 14, 2009 at 12:16:34AM +0800, emoly.liu wrote: > >> ad_lustre/ad_lustre.h > >> - bring in lustre structs/constants if "lustre_user.h" not getting > >> brought in: is this really necessary? ROMIO won''t try to build > >> ad_lustre unless ''lustre'' is part of "--with-file-systems" at > >> configure time > > > --with-file-systems=lustre only tells it will build lustre adio, but > > it doesn''t show where is the lustre_user.h, which is needed by > > ad_open.c (for setting stripe stuff). So if there is no > > lustre_user.h in /user/include (no lustre-devel rpm in your system), > > hmm, maybe we should only give error msg at this moment, instead of > > adding this if. > > This is sort of what we do for PVFS and PVFS2: if the user explicitly > asks for pvfs or pvfs2, and ROMIO configure.in cannot find the > necessary header files, we AC_ERROR. I''d like to see the same > behavior for lustre too. > > >> ad_lustre/ad_lustre_aggregate.c - Lots of changes. I see you started > >> from adio/common/ad_aggregate.c > >> > >> - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O > >> if any piece bigger than a big_req hint. > >> - You removed a lot of comments. why? > >> > > Some comments are not suitable to lustre adio driver and some were > > probably removed during the codes cleanup. I will restore the latter. > > I can certainly appreciate the need to remove misleading comments. > All the ROMIO maintainers will tell you, though, that this code is > tricky and we look at it infrequently enough that it''s really pretty > hard to over-comment it. > > >> - here again, you should refer to ROMIO hint struct instead of calling > >> MPI_Info functions > >> > > Sorry, I am not sure what ROMIO hint structure you mean. I assume it is > > ADIOI_Hints_struct. I can add lustre hints to that struct, but they are > > lustre related, not common. Is that ok? > > If I misunderstand, please advise. > > Sure. I''ll be happy to explain further. > > You''re on the right track. In adioi.h we have ADIOI_Hints_struct. > This struct contains many fs-independent values (cb_buffer_size, > striping_unit), but it also contains a union for fs-specific values. > Right now, that union only has values for pvfs and pvfs2, but we > should add another member for lustre. > > Here''s an example of ROMIO using the hints in this structure > (adio/ad_pvfs2/ad_pvfs2_read.c) > > if (fd->hints->fs_hints.pvfs2.listio_read == ADIOI_HINT_ENABLE) { > ret = ADIOI_PVFS2_ReadStridedListIO(fd, buf, count, datatype, > file_ptr_type, offset, status, error_code); > return; > } > > No need to extract the "romio_pvfs2_listio" keyval from the MPI_INFO > member, no need to strcasecmp the info value. Those steps were all > done at MPI_FILE_OPEN or MPI_FILE_SET_VIEW time. It''s a small > optimization, and what you do now is correct, but making use of > ADIOI_Hints_struct will make your code more like the other drivers. > > >> - maybe we need, as wei-keng suggests, a better separation so you > >> don''t need to duplicate Calc_my_req just so you can call the > >> lustre-sepecific Calc_aggregator. Could a hint do this for you? > >> > > yes, that would be better if there is a separation. > > OK, glad we''re on the same page. I don''t think you need to do this > bit of software engineering for your patch. It''s more a note to > ourselves: if we ever redesign ROMIO, we''ll have to be sure to make > this part of it. > > >> - missing the "new MPE" logging anthony put into rest of ROMIO > >> > > I had a look at MPE events defined in adioi.h. Could you tell me what > > else MPE logs you hope to see? > > Take a look at some of the other files that have > #ifdef ADIOI_MPE_LOGGING > lines. > (Unfortunately, there are still some legacy MPE lines in the code, > too. I''ll have to delete them) > > The events in adioi.h are exactly what I''d like to see. I see some > lustre files already have some of them. > > Here''s an example of a modern call (do it like this): > MPE_Log_event( ADIOI_MPE_open_a, 0, NULL ); > > and here''s an ancient call (don''t do this): > MPE_Log_event(9, 0, "start close"); > > In the end, though, this is for developers to see what''s going on > inside ROMIO with JUMPSHOT. it''s not critical, though you might find > it helpful when you are looking for answers to performance questions. > > >> - missing wei-keng''s reworked type processing > > Sorry, I checked the codes but I didn''t find anything missed > > commented by wei-keng. Could you tell me what type processing is > > missed in this file? > > This one is a bit hard to describe, because the initial work went into > revision 958, but then we did some testing and refining for several > more revisions before settling on a final version. It''s ok if you > don''t incorporate this change, but if you''re motivated, svn revision > 1001 makes some changes to ad_read_coll which are demonstrative of the > overall changes also needed in ad_write_coll, ad_write_str, and > ad_read_str. > > In ad_lustre, you have one file for "io", so you have already reduced > your work by half. Good. > > > If I misunderstand or miss anything, please let me know. > > Thank you so much! > > Thank you for the good discussion. I''m looking forward to seeing the > next revision. > > ==rob >-- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
emoly.liu
2009-Mar-02 03:27 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi Robert, Here is the new lustre adio driver patch. I fixed the following problem per our discussion: * change the hints name o from xxx to romio_lustre_xxx * use the fd->hints structure instead of MPI Info routines o define a struct for lustre in ADIOI_Hints_struct in adioi.h and replace the old MPI_Info_get calls with the new romio hints * check lustre/lustre_user.h header file in configure instead of giving the lustre structs/constants o define AC_CHECK_HEADERS in romio/configure.in. If the header file doesn''t exist, report AC_MSG_ERROR * restore mis-removed comments * new MPE logging o add MPE logging for read/write in ad_lustre_rwconfig.c * fix the issue in ad_lustre_open.c I tested the new driver on less than 10 nodes with IOR benchmark. Please check and if you have any questions, please let me know. Thanks, LiuYing Robert Latham wrote:> Hi. Just wanted to see how things are going with this patch. I think > the MPICH2 folks are looking to start the release process in the next > few weeks, and I''d like to get this patchset into that distribution. > > Thanks > ==rob > > On Tue, Jan 13, 2009 at 01:52:59PM -0600, Robert Latham wrote: > >> Hi LiuYing. Thanks for giving my comments such consideration. I''ve >> only responded to a few points below inline: >> >> On Wed, Jan 14, 2009 at 12:16:34AM +0800, emoly.liu wrote: >> >>>> ad_lustre/ad_lustre.h >>>> - bring in lustre structs/constants if "lustre_user.h" not getting >>>> brought in: is this really necessary? ROMIO won''t try to build >>>> ad_lustre unless ''lustre'' is part of "--with-file-systems" at >>>> configure time >>>> >>> --with-file-systems=lustre only tells it will build lustre adio, but >>> it doesn''t show where is the lustre_user.h, which is needed by >>> ad_open.c (for setting stripe stuff). So if there is no >>> lustre_user.h in /user/include (no lustre-devel rpm in your system), >>> hmm, maybe we should only give error msg at this moment, instead of >>> adding this if. >>> >> This is sort of what we do for PVFS and PVFS2: if the user explicitly >> asks for pvfs or pvfs2, and ROMIO configure.in cannot find the >> necessary header files, we AC_ERROR. I''d like to see the same >> behavior for lustre too. >> >> >>>> ad_lustre/ad_lustre_aggregate.c - Lots of changes. I see you started >>>> from adio/common/ad_aggregate.c >>>> >>>> - ADIOI_LUSTRE_Docollect: interesting policy to avoid collective I/O >>>> if any piece bigger than a big_req hint. >>>> - You removed a lot of comments. why? >>>> >>>> >>> Some comments are not suitable to lustre adio driver and some were >>> probably removed during the codes cleanup. I will restore the latter. >>> >> I can certainly appreciate the need to remove misleading comments. >> All the ROMIO maintainers will tell you, though, that this code is >> tricky and we look at it infrequently enough that it''s really pretty >> hard to over-comment it. >> >> >>>> - here again, you should refer to ROMIO hint struct instead of calling >>>> MPI_Info functions >>>> >>>> >>> Sorry, I am not sure what ROMIO hint structure you mean. I assume it is >>> ADIOI_Hints_struct. I can add lustre hints to that struct, but they are >>> lustre related, not common. Is that ok? >>> If I misunderstand, please advise. >>> >> Sure. I''ll be happy to explain further. >> >> You''re on the right track. In adioi.h we have ADIOI_Hints_struct. >> This struct contains many fs-independent values (cb_buffer_size, >> striping_unit), but it also contains a union for fs-specific values. >> Right now, that union only has values for pvfs and pvfs2, but we >> should add another member for lustre. >> >> Here''s an example of ROMIO using the hints in this structure >> (adio/ad_pvfs2/ad_pvfs2_read.c) >> >> if (fd->hints->fs_hints.pvfs2.listio_read == ADIOI_HINT_ENABLE) { >> ret = ADIOI_PVFS2_ReadStridedListIO(fd, buf, count, datatype, >> file_ptr_type, offset, status, error_code); >> return; >> } >> >> No need to extract the "romio_pvfs2_listio" keyval from the MPI_INFO >> member, no need to strcasecmp the info value. Those steps were all >> done at MPI_FILE_OPEN or MPI_FILE_SET_VIEW time. It''s a small >> optimization, and what you do now is correct, but making use of >> ADIOI_Hints_struct will make your code more like the other drivers. >> >> >>>> - maybe we need, as wei-keng suggests, a better separation so you >>>> don''t need to duplicate Calc_my_req just so you can call the >>>> lustre-sepecific Calc_aggregator. Could a hint do this for you? >>>> >>>> >>> yes, that would be better if there is a separation. >>> >> OK, glad we''re on the same page. I don''t think you need to do this >> bit of software engineering for your patch. It''s more a note to >> ourselves: if we ever redesign ROMIO, we''ll have to be sure to make >> this part of it. >> >> >>>> - missing the "new MPE" logging anthony put into rest of ROMIO >>>> >>>> >>> I had a look at MPE events defined in adioi.h. Could you tell me what >>> else MPE logs you hope to see? >>> >> Take a look at some of the other files that have >> #ifdef ADIOI_MPE_LOGGING >> lines. >> (Unfortunately, there are still some legacy MPE lines in the code, >> too. I''ll have to delete them) >> >> The events in adioi.h are exactly what I''d like to see. I see some >> lustre files already have some of them. >> >> Here''s an example of a modern call (do it like this): >> MPE_Log_event( ADIOI_MPE_open_a, 0, NULL ); >> >> and here''s an ancient call (don''t do this): >> MPE_Log_event(9, 0, "start close"); >> >> In the end, though, this is for developers to see what''s going on >> inside ROMIO with JUMPSHOT. it''s not critical, though you might find >> it helpful when you are looking for answers to performance questions. >> >> >>>> - missing wei-keng''s reworked type processing >>>> >>> Sorry, I checked the codes but I didn''t find anything missed >>> commented by wei-keng. Could you tell me what type processing is >>> missed in this file? >>> >> This one is a bit hard to describe, because the initial work went into >> revision 958, but then we did some testing and refining for several >> more revisions before settling on a final version. It''s ok if you >> don''t incorporate this change, but if you''re motivated, svn revision >> 1001 makes some changes to ad_read_coll which are demonstrative of the >> overall changes also needed in ad_write_coll, ad_write_str, and >> ad_read_str. >> >> In ad_lustre, you have one file for "io", so you have already reduced >> your work by half. Good. >> >> >>> If I misunderstand or miss anything, please let me know. >>> Thank you so much! >>> >> Thank you for the good discussion. I''m looking forward to seeing the >> next revision. >> >> ==rob >> >> > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090302/40e83d0d/attachment-0001.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: adio_driver_mpich2-1.0.7_v6.patch Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090302/40e83d0d/attachment-0001.ksh
David Knaak
2009-Mar-02 04:46 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
On Mon, Mar 02, 2009 at 11:27:29AM +0800, emoly.liu wrote:> Hi Robert, > > Here is the new lustre adio driver patch. I fixed the following problem > per our discussion:[snip] LiuYing, Thanks for copying me in this patch. I will look at it and also do some more testing with it. I''ll let you know if I have comments or questions. David Knaak
Robert Latham
2009-Mar-13 15:46 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi LiuYing and everyone else. Thanks for the improvements to this patch set. I looked over them quickly and it looks OK to me. I have committed this patch as revision 4055 in our trunk, and it will be part of the next MPICH2 beta release (which is likely to happen in the next week or so). I have two favors to ask of any other MPI-IO on Lustre users. First favor: please try this out on as many parallel workloads and as many Lustre deployments as possible. ROMIO has other file systems we can''t test at ANL: we rely on contributors to maintain those drivers. You might be able to just replace the romio/ directory in your MPICH2-1.0.8 tarball with an SVN checkout svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio I know this patch was against MPICH2-1.0.7, but it applied cleanly against 1.0.8, due largely to how self-contained everything is -- most changed files were in the ad_lustre directory. These lustre patches (first from Weikuan, second from Sun) are sizeable hunks of code, and hard to review properly. From here out, I hope that you will not be shy about sending further patches back "upstream" to us. Second favor: documentation... Can you send me a brief summary of the new hints? romio_lustre_CO romio_lustre_bigsize romio_lustre_ds_in_coll romio_lustre_contig_data romio_lustre_samesize romio_lustre_start_iodevice I know I for one am looking forward to improved MPI-IO performance on Lustre. Thank you for the contribution. ==rob On Mon, Mar 02, 2009 at 11:27:29AM +0800, emoly.liu wrote:> Here is the new lustre adio driver patch. I fixed the following problem > per our discussion: > > * change the hints name > o from xxx to romio_lustre_xxx > > * use the fd->hints structure instead of MPI Info routines > o define a struct for lustre in ADIOI_Hints_struct in adioi.h > and replace the old MPI_Info_get calls with the new romio hints > > * check lustre/lustre_user.h header file in configure instead of > giving the lustre structs/constants > o define AC_CHECK_HEADERS in romio/configure.in. If the header > file doesn''t exist, report AC_MSG_ERROR > > * restore mis-removed comments > > * new MPE logging > o add MPE logging for read/write in ad_lustre_rwconfig.c > > * fix the issue in ad_lustre_open.c > > > I tested the new driver on less than 10 nodes with IOR benchmark. > > Please check and if you have any questions, please let me know. >-- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
emoly.liu
2009-Mar-16 07:41 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Robert Latham wrote:> Hi LiuYing and everyone else. > > Thanks for the improvements to this patch set. I looked over them > quickly and it looks OK to me. I have committed this patch as > revision 4055 in our trunk, and it will be part of the next MPICH2 > beta release (which is likely to happen in the next week or so). >Thank you !> I have two favors to ask of any other MPI-IO on Lustre users. > > First favor: please try this out on as many parallel workloads and as > many Lustre deployments as possible. ROMIO has other file systems we > can''t test at ANL: we rely on contributors to maintain those drivers. > > You might be able to just replace the romio/ directory in your > MPICH2-1.0.8 tarball with an SVN checkout > > svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio > > I know this patch was against MPICH2-1.0.7, but it applied cleanly > against 1.0.8, due largely to how self-contained everything is -- most > changed files were in the ad_lustre directory. > > These lustre patches (first from Weikuan, second from Sun) are > sizeable hunks of code, and hard to review properly. From here out, > I hope that you will not be shy about sending further patches back > "upstream" to us. > > Second favor: documentation... Can you send me a brief summary of the > new hints? >Sure> romio_lustre_CO >In stripe-contiguous IO pattern, each OST will be accessed by a group of IO clients. CO means *C*lient/*O*ST ratio, the max. number of IO clients for each OST. CO=1 by default.> romio_lustre_bigsize >We won''t do collective I/O if this hint is set and the IO request size is bigger than this value. That''s because when the request size is big, the collective communication overhead increases and the benefits from collective I/O becomes limited.> romio_lustre_ds_in_coll >Collective IO will apply read-modify-write to deal with non-contiguous data by default. However, it will introduce some overhead(IO operation and locking). In our tests, data sieving showed bad collective write performance for some kinds of workloads. So, to avoid this, we define ds_in_coll hint to disable RMW in collective I/O, distinguished from the one in independent I/O.> romio_lustre_contig_data > romio_lustre_samesize >They are two hints to tell the driver whether the request data are contiguous and whether each request IO has the same size. If they are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by removing MPI_Alltoall(). Because each process can easily calculate the pairs of offset and length for each request without collective communication. BTW, currently only when they are both positive, the optimization can work. In the future, probably some efforts will be made to other conditions.> romio_lustre_start_iodevice >In Lustre, we use 3 values to describe striping information. They are -stripe_size: Number of bytes on each OST -stripe_count: Number of OSTs to stripe over (0 default, -1 all) -start_ost: OST index of first stripe (-1 filesystem default) In ROMIO, stripe_size(striping_factor) and stripe_count(striping_unit) have been defined in ADIOI_Hints_struct, but Lustre still needs another one for start_ost. So I use romio_lustre_start_iodevice for it. The similar name was ever used in Weikuan''s patch.> I know I for one am looking forward to improved MPI-IO performance > on Lustre. Thank you for the contribution. >Thank you too. -LiuYing> ==rob > > On Mon, Mar 02, 2009 at 11:27:29AM +0800, emoly.liu wrote: > >> Here is the new lustre adio driver patch. I fixed the following problem >> per our discussion: >> >> * change the hints name >> o from xxx to romio_lustre_xxx >> >> * use the fd->hints structure instead of MPI Info routines >> o define a struct for lustre in ADIOI_Hints_struct in adioi.h >> and replace the old MPI_Info_get calls with the new romio hints >> >> * check lustre/lustre_user.h header file in configure instead of >> giving the lustre structs/constants >> o define AC_CHECK_HEADERS in romio/configure.in. If the header >> file doesn''t exist, report AC_MSG_ERROR >> >> * restore mis-removed comments >> >> * new MPE logging >> o add MPE logging for read/write in ad_lustre_rwconfig.c >> >> * fix the issue in ad_lustre_open.c >> >> >> I tested the new driver on less than 10 nodes with IOR benchmark. >> >> Please check and if you have any questions, please let me know. >> >> > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090316/b8ba2f66/attachment.html
Robert Latham
2009-Mar-18 19:36 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
On Mon, Mar 16, 2009 at 03:41:47PM +0800, emoly.liu wrote:> Robert Latham wrote: >> Second favor: documentation... Can you send me a brief summary of >> the new hints? > SureThanks for the documentation. These explanations are good, but now we''ve found a few problems. The naming issues are rather minor, but some of your hints aren''t compliant with the MPI-IO spec, unfortunately.>> romio_lustre_CO >> > In stripe-contiguous IO pattern, each OST will be accessed by a group of > IO clients. CO means *C*lient/*O*ST ratio, the max. number of IO clients > for each OST. > CO=1 by default.To make it more clear, how about calling it "romio_lustre_co_ratio" ?>> romio_lustre_bigsize >> > We won''t do collective I/O if this hint is set and the IO request size > is bigger than this value. That''s because when the request size is big, > the collective communication overhead increases and the benefits from > collective I/O becomes limited.Instead of ''bigzise'' how about "romio_lustre_coll_highwater" or "romio_lustre_coll_threshold"?>> romio_lustre_contig_data >> romio_lustre_samesize >> > They are two hints to tell the driver whether the request data are > contiguous and whether each request IO has the same size. If they > are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by > removing MPI_Alltoall(). Because each process can easily calculate > the pairs of offset and length for each request without collective > communication. BTW, currently only when they are both positive, the > optimization can work. In the future, probably some efforts will be > made to other conditions.OK, here''s the one with the major problem. RobR reminds me that MPI-IO requires hints to be optional and cannot cause incorrect behavior. A user supplying these hints and then giving you data that is noncontiguous or not of the same size would cause incorrect behavior, so these aren''t appropriate. Is there a way you can check what the caller is doing? caller can lie to you via hints, but ROMIO still has to give the right answer. RobR thought maybe MPI_Allreduce or something along those lines before the MPI_Alltoall would let you check. Your other hints make a lot of intuitive sense to me. Is this one a big win, though? If MPI_Alltoall is giving you a big headache, then maybe there is a more fundamental problem with the MPI implementation? Thanks ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
emoly.liu
2009-Mar-19 06:27 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi rob, Robert Latham wrote:> On Mon, Mar 16, 2009 at 03:41:47PM +0800, emoly.liu wrote: > > > Thanks for the documentation. These explanations are good, but now > we''ve found a few problems. The naming issues are rather minor, but > some of your hints aren''t compliant with the MPI-IO spec, > unfortunately. > > >>> romio_lustre_CO >>> >>> >> In stripe-contiguous IO pattern, each OST will be accessed by a group of >> IO clients. CO means *C*lient/*O*ST ratio, the max. number of IO clients >> for each OST. >> CO=1 by default. >> > > To make it more clear, how about calling it "romio_lustre_co_ratio" ? >OK.> >>> romio_lustre_bigsize >>> >>> >> We won''t do collective I/O if this hint is set and the IO request size >> is bigger than this value. That''s because when the request size is big, >> the collective communication overhead increases and the benefits from >> collective I/O becomes limited. >> > > Instead of ''bigzise'' how about "romio_lustre_coll_highwater" or > "romio_lustre_coll_threshold"? >Both of them make sense to me.> >>> romio_lustre_contig_data >>> romio_lustre_samesize >>> >>> >> They are two hints to tell the driver whether the request data are >> contiguous and whether each request IO has the same size. If they >> are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by >> removing MPI_Alltoall(). Because each process can easily calculate >> the pairs of offset and length for each request without collective >> communication. BTW, currently only when they are both positive, the >> optimization can work. In the future, probably some efforts will be >> made to other conditions. >> > > OK, here''s the one with the major problem. RobR reminds me that > MPI-IO requires hints to be optional and cannot cause incorrect > behavior. A user supplying these hints and then giving you data that > is noncontiguous or not of the same size would cause incorrect > behavior, so these aren''t appropriate. > > Is there a way you can check what the caller is doing? caller can lie > to you via hints, but ROMIO still has to give the right answer. RobR > thought maybe MPI_Allreduce or something along those lines before the > MPI_Alltoall would let you check. >Hmm, it is indeed a problem, although we did get benefits from them in our previous tests. I will check it. But currently, is it possible to make mention of the risk with some words, just like "Don''t set these two hints, until you know exactly what you are doing" ? If it is still inappropriate, I will remove them in this version, then submit another patch once I figure out how to check it with low overhead. How about you ?> Your other hints make a lot of intuitive sense to me. Is this one a > big win, though? If MPI_Alltoall is giving you a big headache, then > maybe there is a more fundamental problem with the MPI implementation? >Thanks for your and RobR''s careful review. Your comments are very helpful. Some problems still need more investigation. B.R. -LiuYing> Thanks > ==rob > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090319/7b6c78c5/attachment-0001.html
Rob Ross
2009-Mar-19 07:07 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi LiuYing, Unfortunately, the group here is committed to our interpretation of the standard as being that the user passing a hint parameter that is misleading to the implementation cannot cause *incorrect* behavior (i.e. change the semantics of the call). An option for determining contiguity is to pass messages during file_set_view time; if the file view is contiguous, then the access is contiguous. Since file_set_view is a collective call, you have an opportunity to do this message passing. I''m not sure how you''re avoiding any communication, because the application processes can still be performing I/O at arbitrary offsets. Perhaps knowing that the access in file is contiguous, however, can be used to reduce the overall communication at I/O time anyway? Can you further explain how these hints worked? Maybe we can come up with an alternative together. Thanks, Rob On Mar 19, 2009, at 1:27 AM, emoly.liu wrote:> Hi rob, > > Robert Latham wrote: >> >> On Mon, Mar 16, 2009 at 03:41:47PM +0800, emoly.liu wrote: >>>> romio_lustre_contig_data >>>> romio_lustre_samesize >>> They are two hints to tell the driver whether the request data are >>> contiguous and whether each request IO has the same size. If they >>> are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by >>> removing MPI_Alltoall(). Because each process can easily calculate >>> the pairs of offset and length for each request without collective >>> communication. BTW, currently only when they are both positive, the >>> optimization can work. In the future, probably some efforts will be >>> made to other conditions. >>> >> OK, here''s the one with the major problem. RobR reminds me that >> MPI-IO requires hints to be optional and cannot cause incorrect >> behavior. A user supplying these hints and then giving you data that >> is noncontiguous or not of the same size would cause incorrect >> behavior, so these aren''t appropriate. >> >> Is there a way you can check what the caller is doing? caller can >> lie >> to you via hints, but ROMIO still has to give the right answer. RobR >> thought maybe MPI_Allreduce or something along those lines before the >> MPI_Alltoall would let you check. >> > Hmm, it is indeed a problem, although we did get benefits from them > in our previous tests. > > I will check it. But currently, is it possible to make mention of > the risk with some words, just like "Don''t set these two hints, > until you know exactly what you are doing" ? > If it is still inappropriate, I will remove them in this version, > then submit another patch once I figure out how to check it with low > overhead.
emoly.liu
2009-Mar-19 10:30 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi RobR, Rob Ross wrote:> Hi LiuYing, > > Unfortunately, the group here is committed to our interpretation of > the standard as being that the user passing a hint parameter that is > misleading to the implementation cannot cause *incorrect* behavior > (i.e. change the semantics of the call). > > An option for determining contiguity is to pass messages during > file_set_view time; if the file view is contiguous, then the access is > contiguous. Since file_set_view is a collective call, you have an > opportunity to do this message passing. > > I''m not sure how you''re avoiding any communication, because the > application processes can still be performing I/O at arbitrary > offsets. Perhaps knowing that the access in file is contiguous, > however, can be used to reduce the overall communication at I/O time > anyway? Can you further explain how these hints worked? Maybe we can > come up with an alternative together.We don''t mean to avoid any communication. We just do some optimization only when the communication overhead has a big impact on the system performance. When we tested the lustre adio driver, we found MPI_Alltoall cost much time. That is why we try to avoid it. There are two MPI_Alltoall calls in the original codes. One is called by ADIOI_W_Exchange_data() and the other by ADIOI_Calc_others_req(). * ADIOI_W_Exchange_data(): In this function, the original ADIO driver uses MPI_Alltoall() to exchange recv/send size among the processes. However, since Lustre ADIO driver reorganizes the requests into stripe-contiguous I/O pattern, recv/send size for each process can be calculated in the beginning of ADIOI_LUSTRE_Exch_and_write(). So we don''t need MPI_Alltoall() to exchange offset and length any more. * ADIOI_Calc_others_req(): This is what we are discussing. In this function, the original ADIO driver uses MPI_Alltoall() to exchange the access information among the processes. It''s necessary, but there is still a little space to improve. o if the request data are contiguous, the access information (offset and length) can be calculated by other simpler communication type(i.e. MPI_Allreduce). o Further, if the request size is same, the access information can be calculated directly without any communication. As you said, if the hints settings are inconsistent with the data given by the user, it will cause incorrect behavior and break the semantics of the call. So I agree this problem must be fixed. I don''t remember exactly how much the hints help us, but I think only when the request data are contiguous and have same size, we can get the real benefit. So, in other words, if checking contiguity+size will introduce new overhead, I prefer to remove the hints and use MPI_Alltoall cleanly. Any idea ? Thanks, -LiuYing> > Thanks, > > Rob > > On Mar 19, 2009, at 1:27 AM, emoly.liu wrote: > >> Hi rob, >> >> Robert Latham wrote: >>> >>> On Mon, Mar 16, 2009 at 03:41:47PM +0800, emoly.liu wrote: >>>>> romio_lustre_contig_data >>>>> romio_lustre_samesize >>>> They are two hints to tell the driver whether the request data are >>>> contiguous and whether each request IO has the same size. If they >>>> are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by >>>> removing MPI_Alltoall(). Because each process can easily calculate >>>> the pairs of offset and length for each request without collective >>>> communication. BTW, currently only when they are both positive, the >>>> optimization can work. In the future, probably some efforts will be >>>> made to other conditions. >>>> >>> OK, here''s the one with the major problem. RobR reminds me that >>> MPI-IO requires hints to be optional and cannot cause incorrect >>> behavior. A user supplying these hints and then giving you data that >>> is noncontiguous or not of the same size would cause incorrect >>> behavior, so these aren''t appropriate. >>> >>> Is there a way you can check what the caller is doing? caller can lie >>> to you via hints, but ROMIO still has to give the right answer. RobR >>> thought maybe MPI_Allreduce or something along those lines before the >>> MPI_Alltoall would let you check. >>> >> Hmm, it is indeed a problem, although we did get benefits from them >> in our previous tests. >> >> I will check it. But currently, is it possible to make mention of the >> risk with some words, just like "Don''t set these two hints, until you >> know exactly what you are doing" ? >> If it is still inappropriate, I will remove them in this version, >> then submit another patch once I figure out how to check it with low >> overhead. > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090319/71bd2509/attachment.html
emoly.liu
2009-Apr-24 08:39 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi Rob, Here is the new patch for lustre adio driver, based on MPICH2-1.0.7. Per our discussion, I did the following changes: 1) rename the hints - romio_lustre_CO -> romio_lustre_co_ratio - romio_lustre_bigsize -> romio_lustre_coll_threshold 2) remove the two confusing hints I removed "contig_data" and "samesize", then use ADIOI_Calc_others_req() instead of ADIOI_LUSTRE_Calc_others_req(). I have tested the patch in a small scale environment. Please check and let me know if you have any questions. Thanks, LiuYing Robert Latham wrote:> On Mon, Mar 16, 2009 at 03:41:47PM +0800, emoly.liu wrote: > > Thanks for the documentation. These explanations are good, but now > we''ve found a few problems. The naming issues are rather minor, but > some of your hints aren''t compliant with the MPI-IO spec, > unfortunately. > > >>> romio_lustre_CO >>> >>> >> In stripe-contiguous IO pattern, each OST will be accessed by a group of >> IO clients. CO means *C*lient/*O*ST ratio, the max. number of IO clients >> for each OST. >> CO=1 by default. >> > > To make it more clear, how about calling it "romio_lustre_co_ratio" ? > > >>> romio_lustre_bigsize >>> >>> >> We won''t do collective I/O if this hint is set and the IO request size >> is bigger than this value. That''s because when the request size is big, >> the collective communication overhead increases and the benefits from >> collective I/O becomes limited. >> > > Instead of ''bigzise'' how about "romio_lustre_coll_highwater" or > "romio_lustre_coll_threshold"? > > >>> romio_lustre_contig_data >>> romio_lustre_samesize >>> >>> >> They are two hints to tell the driver whether the request data are >> contiguous and whether each request IO has the same size. If they >> are both "yes", we can optimize ADIOI_LUSTRE_Calc_others_req() by >> removing MPI_Alltoall(). Because each process can easily calculate >> the pairs of offset and length for each request without collective >> communication. BTW, currently only when they are both positive, the >> optimization can work. In the future, probably some efforts will be >> made to other conditions. >> > > OK, here''s the one with the major problem. RobR reminds me that > MPI-IO requires hints to be optional and cannot cause incorrect > behavior. A user supplying these hints and then giving you data that > is noncontiguous or not of the same size would cause incorrect > behavior, so these aren''t appropriate. > > Is there a way you can check what the caller is doing? caller can lie > to you via hints, but ROMIO still has to give the right answer. RobR > thought maybe MPI_Allreduce or something along those lines before the > MPI_Alltoall would let you check. > > Your other hints make a lot of intuitive sense to me. Is this one a > big win, though? If MPI_Alltoall is giving you a big headache, then > maybe there is a more fundamental problem with the MPI implementation? > > Thanks > ==rob > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090424/83742654/attachment-0001.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: adio_driver_mpich2-1.0.7_v7.patch Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090424/83742654/attachment-0001.ksh
emoly.liu
2009-May-05 07:56 UTC
[Lustre-discuss] [ROMIO Req #940] [Fwd: Re: [Lustre-devel] a new Lustre ADIO driver]
Hi Rob, I have fixed the bug found by wei-keng liao in ad_lustre_aggregate.c. Here is the new patch(v8). If you have already applied patch_v7, you can use the small one(aggregate.patch) in the attachment directly. Thanks, LiuYing emoly.liu wrote:> Hi Wei-keng, > > Wei-keng Liao wrote: >> Hi, LiuYing, >> >> I am checking ad_lustre_aggregate.c and having a problem to understand >> the calculation of avail_cb_nodes in ADIOI_LUSTRE_Get_striping_info(). >> In line 62, I suspect this if condition may not be right: >> if (avail_cb_nodes == nprocs_for_coll) { > Sorry, this condition is not right, it should be > if (avail_cb_nodes < stripe_count * CO) { > > I will update the patch soon. Thanks for your careful review. > > -LiuYing >> >> For example, when nprocs_for_coll=10, stripe_count=10, and CO=1 >> the mapping of OSTs to the I/O clients is one-to-one. Before >> line 62, avail_cb_nodes is set to 10, but the line 62 if statement >> changes avail_cb_nodes to 5. Is this what you intend to do? >> >> If you would like to keep the one I/O client per OST, shouldn''t >> avail_cb_nodes be 10 in this example? >> >> Wei-keng >> >> > >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: adio_driver_mpich2-1.0.7_v8.patch Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090505/8a22ba11/attachment-0002.ksh -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: aggregate.patch Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090505/8a22ba11/attachment-0003.ksh
Rob Latham
2009-May-08 21:55 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Tue, May 05, 2009 at 03:56:11PM +0800, emoly.liu wrote:> I have fixed the bug found by wei-keng liao in ad_lustre_aggregate.c. > Here is the new patch(v8). If you have already applied patch_v7, you can > use the small one(aggregate.patch) in the attachment directly.Hi LiuYing. Thanks for incorporating our suggestions and sending the updated patch. I''ve applied these updates to our SVN tree, so they''re ready to go in the upcoming MPICH2 release candidate. Martin Audet recently volunteered to help us test out the Lustre driver on his system, so I have added him to the CC list. Thank you in advance, Martin. The ROMIO tree can be checked out via svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio You will probably need to build the whole MPICH2 tree, which can be checked out and built via the instructions at http://wiki.mcs.anl.gov/mpich2/index.php/Getting_And_Building_MPICH2 It would be a big help for us if anyone with access to a Lustre file system could run a few tests and make sure I didn''t get anything wrong applying these patches against SVN trunk. We have a week to shake out any bugs before the release. Thanks to all. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Weikuan Yu
2009-May-09 02:33 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Hi, RobL, I tried the compilation. Looks like the SVN code has some inconsistencies in variable names. ad_lustre_hints.c: In function `ADIOI_LUSTRE_SetInfo'': ad_lustre_hints.c:34: error: structure has no member named `co_ratio'' ad_lustre_hints.c:36: error: structure has no member named `coll_threshold'' ad_lustre_hints.c:154: error: structure has no member named `co_ratio'' ad_lustre_hints.c:172: error: structure has no member named `coll_threshold'' make[2]: *** [ad_lustre_hints.o] Error 1 make[2]: Leaving directory `/home/wyu/pmodels/mpich2/src/mpi/romio/adio/ad_lustre'' Make failed in directory adio/ad_lustre -- Weikuan Yu, +1 (334) 844-6330 http://www.eng.auburn.edu/~wkyu/> From: Rob Latham <robl at mcs.anl.gov> > Date: Fri, 8 May 2009 16:55:28 -0500 > To: "emoly.liu" <Emoly.Liu at Sun.COM> > Cc: Wei-keng Liao <wkliao at ece.northwestern.edu>, <romio-maint at mcs.anl.gov>, > "Tom.Wang" <Tom.Wang at Sun.COM>, David Knaak <knaak at cray.com>, Weikuan Yu > <wkyu at auburn.edu>, lustre-discuss <lustre-discuss at lists.lustre.org>, > <Martin.Audet at imi.cnrc-nrc.gc.ca> > Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver] > > > On Tue, May 05, 2009 at 03:56:11PM +0800, emoly.liu wrote: >> I have fixed the bug found by wei-keng liao in ad_lustre_aggregate.c. >> Here is the new patch(v8). If you have already applied patch_v7, you can >> use the small one(aggregate.patch) in the attachment directly. > > Hi LiuYing. Thanks for incorporating our suggestions and sending the > updated patch. I''ve applied these updates to our SVN tree, so > they''re ready to go in the upcoming MPICH2 release candidate. > > Martin Audet recently volunteered to help us test out the Lustre > driver on his system, so I have added him to the CC list. Thank you > in advance, Martin. > > The ROMIO tree can be checked out via > svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio > > You will probably need to build the whole MPICH2 tree, which can be > checked out and built via the instructions at > http://wiki.mcs.anl.gov/mpich2/index.php/Getting_And_Building_MPICH2 > > It would be a big help for us if anyone with access to a Lustre file > system could run a few tests and make sure I didn''t get anything wrong > applying these patches against SVN trunk. We have a week to shake out > any bugs before the release. > > Thanks to all. > ==rob > > -- > Rob Latham > Mathematics and Computer Science Division > Argonne National Lab, IL USA
Rob Latham
2009-May-09 16:18 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Fri, May 08, 2009 at 09:33:20PM -0500, Weikuan Yu wrote:> Hi, RobL, > > I tried the compilation. Looks like the SVN code has some inconsistencies in > variable names.Thanks. I forgot to check one change in. Do try now: revision 4480 should fix this. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Rajeev Thakur
2009-May-11 03:56 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Weikuan, I can''t read the attachment you sent. It has all kinds of strange characters. Rajeev> -----Original Message----- > From: Weikuan Yu [mailto:wkyu at auburn.edu] > Sent: Saturday, May 09, 2009 3:54 PM > To: Rob Latham > Cc: emoly.liu; Wei-keng Liao; romio-maint at mcs.anl.gov; > Tom.Wang; David Knaak; lustre-discuss; Martin.Audet at imi.cnrc-nrc.gc.ca > Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre > ADIO driver] > > Yes, it passed compilation. But there are many errors > reported from runtest, quite a number of them are only from > ad_lustre driver. Attached is an output tarball (named as > .txt though). It contains the output files from running > romio/runtests with ad_lustre and ad_ufs drivers separately. > > -- > Weikuan Yu, +1 (334) 844-6330 > http://www.eng.auburn.edu/~wkyu/ > > > > From: Rob Latham <robl at mcs.anl.gov> > > Date: Sat, 9 May 2009 11:18:47 -0500 > > To: Weikuan Yu <wkyu at auburn.edu> > > Cc: "emoly.liu" <Emoly.Liu at Sun.COM>, Wei-keng Liao > > <wkliao at ece.northwestern.edu>, <romio-maint at mcs.anl.gov>, "Tom.Wang" > > <Tom.Wang at Sun.COM>, David Knaak <knaak at cray.com>, lustre-discuss > > <lustre-discuss at lists.lustre.org>, <Martin.Audet at imi.cnrc-nrc.gc.ca> > > Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre > ADIO driver] > > > > On Fri, May 08, 2009 at 09:33:20PM -0500, Weikuan Yu wrote: > >> Hi, RobL, > >> > >> I tried the compilation. Looks like the SVN code has some > inconsistencies in > >> variable names. > > > > Thanks. I forgot to check one change in. Do try now: revision 4480 > > should fix this. > > > > ==rob > > > > -- > > Rob Latham > > Mathematics and Computer Science Division > > Argonne National Lab, IL USA > >
Weikuan Yu
2009-May-11 11:52 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Hi, Rajeev, Yeah, that file was a tarball named as .txt. It needs to be untar''ed to get two output files. coll_test.c, noncontig_coll.c, noncontig_coll2.c and others are failing. # tar zxf runtests.output.txt -- Weikuan Yu, +1 (334) 844-6330 http://www.eng.auburn.edu/~wkyu/> From: Rajeev Thakur <thakur at mcs.anl.gov> > Date: Sun, 10 May 2009 22:56:43 -0500 > To: Weikuan Yu <wkyu at auburn.edu>, ''Rob Latham'' <robl at mcs.anl.gov> > Cc: "''emoly.liu''" <Emoly.Liu at Sun.COM>, ''Wei-keng Liao'' > <wkliao at ece.northwestern.edu>, <romio-maint at mcs.anl.gov>, "''Tom.Wang''" > <Tom.Wang at Sun.COM>, ''David Knaak'' <knaak at cray.com>, ''lustre-discuss'' > <lustre-discuss at lists.lustre.org>, <Martin.Audet at imi.cnrc-nrc.gc.ca> > Subject: RE: [Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver] > > Weikuan, > I can''t read the attachment you sent. It has all kinds of strange > characters. > > Rajeev > > >> -----Original Message----- >> From: Weikuan Yu [mailto:wkyu at auburn.edu] >> Sent: Saturday, May 09, 2009 3:54 PM >> To: Rob Latham >> Cc: emoly.liu; Wei-keng Liao; romio-maint at mcs.anl.gov; >> Tom.Wang; David Knaak; lustre-discuss; Martin.Audet at imi.cnrc-nrc.gc.ca >> Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre >> ADIO driver] >> >> Yes, it passed compilation. But there are many errors >> reported from runtest, quite a number of them are only from >> ad_lustre driver. Attached is an output tarball (named as >> .txt though). It contains the output files from running >> romio/runtests with ad_lustre and ad_ufs drivers separately. >> >> -- >> Weikuan Yu, +1 (334) 844-6330 >> http://www.eng.auburn.edu/~wkyu/ >> >> >>> From: Rob Latham <robl at mcs.anl.gov> >>> Date: Sat, 9 May 2009 11:18:47 -0500 >>> To: Weikuan Yu <wkyu at auburn.edu> >>> Cc: "emoly.liu" <Emoly.Liu at Sun.COM>, Wei-keng Liao >>> <wkliao at ece.northwestern.edu>, <romio-maint at mcs.anl.gov>, "Tom.Wang" >>> <Tom.Wang at Sun.COM>, David Knaak <knaak at cray.com>, lustre-discuss >>> <lustre-discuss at lists.lustre.org>, <Martin.Audet at imi.cnrc-nrc.gc.ca> >>> Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre >> ADIO driver] >>> >>> On Fri, May 08, 2009 at 09:33:20PM -0500, Weikuan Yu wrote: >>>> Hi, RobL, >>>> >>>> I tried the compilation. Looks like the SVN code has some >> inconsistencies in >>>> variable names. >>> >>> Thanks. I forgot to check one change in. Do try now: revision 4480 >>> should fix this. >>> >>> ==rob >>> >>> -- >>> Rob Latham >>> Mathematics and Computer Science Division >>> Argonne National Lab, IL USA >> >> > >
Rob Latham
2009-May-11 14:28 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Sat, May 09, 2009 at 03:53:40PM -0500, Weikuan Yu wrote:> Yes, it passed compilation. But there are many errors reported from runtest, > quite a number of them are only from ad_lustre driver. Attached is an output > tarball (named as .txt though). It contains the output files from running > romio/runtests with ad_lustre and ad_ufs drivers separately.Thanks for sending both the UFS and Lustre tests. The atomicity, i_noncontig, noncontig, shared_fp, and ordered_fp failures look like fcntl locks just don''t work on the cray? Both UFS and Lustre say "ADIOI_Set_lock:: Function not implemented". That''s going to cause some problems not just for atomic mode, but also for data sieving writes, so I guess it''s good the failure is a loud MPI_ABORT and not a quiet corruption of data So, the real challenges are coll_test, noncontig_coll, hindexed, aggregation1, aggregation2, split_coll... basically, collective I/O is messed up. hindexed does a collective write followed by an independent read and that''s failing, so we should explore the collective write path first and make sure that''s working. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Weikuan Yu
2009-May-11 14:39 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
> From: Rob Latham <robl at mcs.anl.gov> > Date: Mon, 11 May 2009 09:28:12 -0500 > To: Weikuan Yu <wkyu at auburn.edu> > Cc: "emoly.liu" <Emoly.Liu at Sun.COM>, Wei-keng Liao > <wkliao at ece.northwestern.edu>, <romio-maint at mcs.anl.gov>, "Tom.Wang" > <Tom.Wang at Sun.COM>, David Knaak <knaak at cray.com>, lustre-discuss > <lustre-discuss at lists.lustre.org>, <Martin.Audet at imi.cnrc-nrc.gc.ca> > Subject: Re: [Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver] > > So, the real challenges are coll_test, noncontig_coll, hindexed, > aggregation1, aggregation2, split_coll... basically, collective I/O is > messed up.I agree. That''s why I posted all error output, along with the ones from ufs. -- Weikuan Yu, +1 (334) 844-6330 http://www.eng.auburn.edu/~wkyu/
David Knaak
2009-May-12 07:08 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Mon, May 11, 2009 at 09:28:12AM -0500, Rob Latham wrote:> On Sat, May 09, 2009 at 03:53:40PM -0500, Weikuan Yu wrote: > > Yes, it passed compilation. But there are many errors reported from runtest, > > quite a number of them are only from ad_lustre driver. Attached is an output > > tarball (named as .txt though). It contains the output files from running > > romio/runtests with ad_lustre and ad_ufs drivers separately. > > Thanks for sending both the UFS and Lustre tests. The atomicity, > i_noncontig, noncontig, shared_fp, and ordered_fp failures look like fcntl > locks just don''t work on the cray? Both UFS and Lustre say > "ADIOI_Set_lock:: Function not implemented". > > That''s going to cause some problems not just for atomic mode, but also > for data sieving writes, so I guess it''s good the failure is a loud > MPI_ABORT and not a quiet corruption of dataWeikuan, To get file locking to work properly for Cray XT, the Lustre file system must be mounted with the "flock" option the compute nodes. This is the recommended option for all Cray systems. Did you build and run these tests on a Cray XT system? If so, which one?> So, the real challenges are coll_test, noncontig_coll, hindexed, > aggregation1, aggregation2, split_coll... basically, collective I/O is > messed up. > > hindexed does a collective write followed by an independent read and > that''s failing, so we should explore the collective write path first > and make sure that''s working.Rob, I''m looking at the collective buffering errors. David
David Knaak
2009-May-12 09:10 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
> On Mon, May 11, 2009 at 09:28:12AM -0500, Rob Latham wrote: > > So, the real challenges are coll_test, noncontig_coll, hindexed, > > aggregation1, aggregation2, split_coll... basically, collective I/O is > > messed up. > > > > hindexed does a collective write followed by an independent read and > > that''s failing, so we should explore the collective write path first > > and make sure that''s working.On Tue, May 12, 2009 at 02:08:31AM -0500, David Knaak wrote:> I''m looking at the collective buffering errors.Rob, With my version of the Lustre stripe-aligned collective buffering (which merges some of LiuYing''s code with mine): coll_test passes (2 PEs as required) noncontig_coll passes (2 PEs as required) hindexed passes (4 PEs as required) aggregation1 passes (up to 60 PEs, that''s as high as I''m going tonight) aggregation2 passes (up to 60 PEs, that''s as high as I''m going tonight) split_coll passes (up to 60 PEs, that''s as high as I''m going tonight) The one collective buffering test that fails is noncontig_coll2. I''ll look at that more closely. One other test that fails is shared_fp (60 PEs) but it also fails with collective buffering disabled, and besides, it doesn''t make any collective I/O calls. I haven''t looked closely yet at the test. As I said last week, I have not yet had time to build the full Lustre ADIO and therefore can''t at this point make any statement about it. David
Rob Latham
2009-May-12 15:21 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Tue, May 12, 2009 at 04:10:44AM -0500, David Knaak wrote:> With my version of the Lustre stripe-aligned collective buffering > (which merges some of LiuYing''s code with mine): > > coll_test passes (2 PEs as required) > noncontig_coll passes (2 PEs as required) > hindexed passes (4 PEs as required) > aggregation1 passes (up to 60 PEs, that''s as high as I''m going tonight) > aggregation2 passes (up to 60 PEs, that''s as high as I''m going tonight) > split_coll passes (up to 60 PEs, that''s as high as I''m going tonight)Great! Thanks for the additional information.> The one collective buffering test that fails is noncontig_coll2. I''ll > look at that more closely.noncontig_coll2 is a tricky one in that it re-orders the I/O aggregators. The hint "cb_config_list" lets a user explicity specify which MPI processors should be i/o aggregators. There is no reason to expect the user to construct that list in rank-order, so ROMIO should be able to handle any permutation of nodes. If any part of the code makes an implicit assumption about the order of ranks, nonconitg_coll2 will probably give it a headache. I added that test back in december 2002... 7 years later I still remember how much of a pain it was to track down the fix.> One other test that fails is shared_fp (60 PEs) but it also fails > with collective buffering disabled, and besides, it doesn''t make any > collective I/O calls. I haven''t looked closely yet at the test.shared_fp will do fcntl locks to coordinate writes to a hidden file. This hidden file contains one value: the value of the shared file pointer. I don''t know what would be particularly tricky about that test, but at least we can rule out the two-phase code. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Rob Latham
2009-Jun-01 22:25 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Mon, May 11, 2009 at 09:28:12AM -0500, Rob Latham wrote:> So, the real challenges are coll_test, noncontig_coll, hindexed, > aggregation1, aggregation2, split_coll... basically, collective I/O is > messed up.Hi. I haven''t had a chance to debug this. How about any of you? The MPICH2 folks would like to release 1.1 tomorrow. I propose disabling the auto-detection of Lustre until we can fix this. I don''t want anybody upgrading to MPICH2-1.1 on a lustre system and getting corruption with collective i/o. At the same time I also don''t want to back out all the lustre changes, though, since I''m sure we are close. For testing and debugging, we can explicitly exercise the Lustre path by prefixing the file name with ''lustre:'' If we can find the fix, we can incorporate it into the follow-on patch-release, roughly scheduled for end of summer. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
David Knaak
2009-Jun-02 16:11 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Rob, As I mentioned earlier, other projects have made it difficult for me to devote much time to this. Yesterday I did build the full 1.1rc1 version of ROMIO in the Cray XT MPICH2 build. I then ran one of the test cases that had previously failed with my partially merged version of the earlier Lustre code before I fixed it. (This fix is in the Cray release 3.2.0 that we released in April.) The test failed yesterday in the same way with the vanilla 1.1rc1 ROMIO. The simplest test that I reduced it to is an IOR run with the following combinations: * clients = 7 * cb_nodes = 4 * striping_factor = 4 * striping_unit = 1048576 * romio_cb_write = enable * Command: ./IOR -a MPIIO -cCg -w -W -b 8m -t 8m -o file On the validation step (-W), IOR detects incorrect data. The basic problem is in "ADIOI_LUSTRE_W_Exchange_data" when "buf_idx" is updated: buf_idx[i] += send_size[i]; When the work loads are nice and even on each call to "ADIOI_LUSTRE_W_Exchange_data", this works but in a case like 7 clients and 4 aggregators, this simple updating of "buf_idx" for the next call is not correct. My fix is to save all the "buf_idx" values in the "ADIOI_Access my_req" structure and use the saved values rather than trying to recompute then in ADIOI_LUSTRE_W_Exchange_data. So in "ADIOI_LUSTRE_Calc_my_req", a "buf_idx" array is allocated and set similar to how "offsets" and "lens" are allocated and set: my_req[i].offsets = (ADIO_Offset *) ADIOI_Malloc(count_my_req_per_proc[i] * sizeof(ADIO_Offset)); my_req[i].lens = (int *) ADIOI_Malloc(count_my_req_per_proc[i] * sizeof(int)); my_req[i].buf_idx = (int *) ADIOI_Malloc(count_my_req_per_proc[i] * sizeof(int)); and my_req[proc].offsets[l] = off; my_req[proc].lens[l] = (int) avail_len; my_req[proc].buf_idx[l] = curr_idx; Then "ADIOI_LUSTRE_Exch_and_write" always has the correct "buf_idx" value when it calls "ADIOI_LUSTRE_W_Exchange_data". It won''t be quick and easy for me to give you a clean patch but I can send you my version of the routines. It is my plan to fully compare, test, and merge code as time allows this summer. David On Mon, Jun 01, 2009 at 05:25:04PM -0500, Rob Latham wrote:> On Mon, May 11, 2009 at 09:28:12AM -0500, Rob Latham wrote: > > So, the real challenges are coll_test, noncontig_coll, hindexed, > > aggregation1, aggregation2, split_coll... basically, collective I/O is > > messed up. > > Hi. I haven''t had a chance to debug this. How about any of you? > The MPICH2 folks would like to release 1.1 tomorrow. > > I propose disabling the auto-detection of Lustre until we can fix > this. I don''t want anybody upgrading to MPICH2-1.1 on a lustre system > and getting corruption with collective i/o. > > At the same time I also don''t want to back out all the lustre changes, > though, since I''m sure we are close. For testing and debugging, we > can explicitly exercise the Lustre path by prefixing the file name > with ''lustre:'' > > If we can find the fix, we can incorporate it into the follow-on > patch-release, roughly scheduled for end of summer. > > ==rob > > -- > Rob Latham > Mathematics and Computer Science Division > Argonne National Lab, IL USA--
Rob Latham
2009-Jun-02 16:57 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Tue, Jun 02, 2009 at 11:11:25AM -0500, David Knaak wrote:> It won''t be quick and easy for me to give you a clean patch but I can > send you my version of the routines. It is my plan to fully compare, > test, and merge code as time allows this summer.No worries. Thank you for the details. I knew you had a handle on the problem :> ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA