pascal.deveze at bull.net
2009-May-27 08:15 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Hi, I downloaded the last version of the Lustre ADIO driver on mpich2-1.1rc1. I saw that the parametre len_list_ptr of function ADIOI_Calc_my_off_len() has been modified in include/adioi.h: int **len_list_ptr (32 bits) has been changed to ADIO_Offset **len_list_ptr (64 bits) So, I propose to change in ad_lustre_aggregate;c, line 122: void ADIOI_LUSTRE_Calc_my_req(ADIO_File fd, ADIO_Offset *offset_list, int *len_list, int contig_access_count, <==== here is the change int *striping_info, int nprocs, int *count_my_req_procs_ptr, int **count_my_req_per_proc_ptr, ADIOI_Access **my_req_ptr, int **buf_idx_ptr) And in ad_lustre_wrcoll.c, line 84 : ADIO_Offset *offset_list = NULL, *st_offsets = NULL, *end_offsets NULL, *len_list = NULL; int *buf_idx = NULL, *striping_info = NULL; instead of: ADIO_Offset *offset_list = NULL, *st_offsets = NULL, *end_offsets NULL; int *buf_idx = NULL, *len_list = NULL, *striping_info = NULL; Without these changes, len_list[i] (where i is 1,3,5,7 ...) in the routine ADIOI_LUSTRE_Calc_my_req are null. I will now pass some tests and I hope make contributions. Regards, Pascal DEVEZE R&D HPC Bull France
emoly.liu
2009-Jun-03 03:19 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Hi all, Sorry for my late reply. These days I''m quite busy with something. I have read through the problems reported by all of you. Thanks for your kindly reports and fix. I check my codes and run some tests on MPICH2-1.0.7, but I can''t reproduce all of the problems. Then I saw the email by pescal. Thanks for pescal''s shedding light on me. Our lustre adio driver is developed based on MPICH2-1.0.7, where some variables were defined with 32-bit. But in the new MPICH2 release, some of them are changed into 64-bit, as pescal mentioned. I think it is one of the factors which caused the failure. I will check our driver and new romio codes thoroughly, do some tests as you suggested, then post a patch for MPICH2-1.1 later. Thanks, LiuYing pascal.deveze at bull.net wrote:> Hi, > > I downloaded the last version of the Lustre ADIO driver on mpich2-1.1rc1. > > I saw that the parametre len_list_ptr of function ADIOI_Calc_my_off_len() > has been modified in include/adioi.h: > int **len_list_ptr (32 bits) has been changed to > ADIO_Offset **len_list_ptr (64 bits) > > So, I propose to change in ad_lustre_aggregate;c, line 122: > void ADIOI_LUSTRE_Calc_my_req(ADIO_File fd, ADIO_Offset *offset_list, > int *len_list, int contig_access_count, > <==== here is the change > int *striping_info, int nprocs, > int *count_my_req_procs_ptr, > int **count_my_req_per_proc_ptr, > ADIOI_Access **my_req_ptr, > int **buf_idx_ptr) > > > And in ad_lustre_wrcoll.c, line 84 : > > ADIO_Offset *offset_list = NULL, *st_offsets = NULL, *end_offsets > NULL, *len_list = NULL; > int *buf_idx = NULL, *striping_info = NULL; > instead of: > ADIO_Offset *offset_list = NULL, *st_offsets = NULL, *end_offsets > NULL; > int *buf_idx = NULL, *len_list = NULL, *striping_info = NULL; > > Without these changes, len_list[i] (where i is 1,3,5,7 ...) in the routine > ADIOI_LUSTRE_Calc_my_req > are null. > > I will now pass some tests and I hope make contributions. > > Regards, > > > Pascal DEVEZE > R&D HPC > Bull France > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Best regards, LiuYing System Software Engineer, Lustre Group Sun Microsystems ( China ) Co. Limited
Rob Latham
2009-Jun-03 21:36 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
On Wed, May 27, 2009 at 10:15:29AM +0200, pascal.deveze at bull.net wrote:> Without these changes, len_list[i] (where i is 1,3,5,7 ...) in the routine > ADIOI_LUSTRE_Calc_my_req > are null. > > I will now pass some tests and I hope make contributions.Hi Pascal. Did these tests fix things for you? You''re right: one of the bigger changes from 1.0 to 1.1 was an effort to make ROMIO more 64-bit clean. If you go to the ''romio/test'' directory, there''s a script ''runtests'' that will run through a good number of MPI-IO tests: ''runtests -fname=lustre:/path/to/lustre'' If you run that, let me know how it goes. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
pascal.deveze at bull.net
2009-Jun-12 15:35 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Rob, I''ve made 4 modifications on the new Lustre ADIO driver. After that, all the romio tests succeed. The application "coll_test" also passes on 2 nodes (4 processes - 2OST) with the dimension of the 3 d array varying from 1 to 300. The version I use comes from mpich2-1.1. Hereafter is a quick explanation on my four modifications: 1) The calculation of avail_cb_nodes in ADIOI_LUSTRE_Get_stripping_info() For example, when nprocs_for_coll=2, stripe_count=2 and CO=1 , avail_cb_nodes is set to 1. The value 2 should be used instead. I propose to change the lines by: /* avail_cb_nodes should divide stripe_count exactly */ while (stripe_count % avail_cb_nodes) { avail_cb_nodes--; } 2) The parameter len_list_ptr has been modified in include/adioi.h, so I propose to change : int **len_list_ptr; to ADIO_Offset **len_list_ptr; in ad_lustre_aggregate.c and ad_lustre_wrcoll.c 3) Use of buf_idx[ ] in ADIOI_LUSTRE_W_Exchange_data I had a lot of troubles with the table of pointers buf_idx[ ]: "coll_test" with different dimensions of the 3d array detected a lot of errors. I solved the problems by using only one pointer: buf_idx[0] initialized by 0 in "ADIOI_LUSTRE_Calc_my_req" and used/modified in "ADIOI_LUSTRE_W_Exchange_data": if (send_size[i]) { MPI_Isend(((char *) buf) + buf_idx[0], send_size[i], MPI_BYTE, i, myrank + i + 100 * iter, fd->comm, send_req + j); j++; buf_idx[0]+=send_size[i]; } Of course, a single variable may be used to do that, but this implies more changes. This solution works fine with coll_test (used with a wide range of dimensions of the 3d array, on 4 nodes and 2 OST). Maybe this solution is too simple, and I missed something. What is your opinion ? 4) Macro ADIOI_BUFFERED_WRITE_WITHOUT_READ in ad_lustre_wrstr.c (line 213) This macro causes an abort in the tests "i_concontig" and "noncontig". In this macro, ADIO_WriteContig was called a lot of time with writebuf_len egal to 0. What is wanted is to copy the data (by memcpy) in the writebuf as much as possible to fill a stripe and then to call ADIO_WriteContig. As far as I understand it, this was not the case. I replaced this macro by a call to memcpy(writebuf + req_off - writebuf_off, (char *)buf + user_off, req_len); This modification works only if it is possible to copy all the user data in the write buffer (writebuf). (i.e. if bufsize <= writebuf_len) If bufsize is higher, a loop has to be introduced. What do you think about this change ? You will find attached my modification vs mpich2-1.1. Your questions and comments about them are welcome. Regards, Pascal DEVEZE R&D HPC Bull France (See attached file: Modifications-vs-1.1) -------------- next part -------------- A non-text attachment was scrubbed... Name: Modifications-vs-1.1 Type: application/octet-stream Size: 3813 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090612/8c35b9e7/attachment.obj
Rob Latham
2009-Jul-02 15:25 UTC
[Lustre-discuss] [ROMIO Req #940] a new Lustre ADIO driver]
Hello! Sorry for not following up with you sooner. I''ll tell you what I think of these approaches from a ROMIO perspective, and maybe the Cray and Lustre folks can provide their perspective and clear up any questions we''ve got.. On Fri, Jun 12, 2009 at 05:35:17PM +0200, pascal.deveze at bull.net wrote:> I''ve made 4 modifications on the new Lustre ADIO driver. After that, all > the romio tests succeed. > > The application "coll_test" also passes on 2 nodes (4 processes - 2OST) > with the dimension of the 3 d array varying from 1 to 300. > > The version I use comes from mpich2-1.1. > > Hereafter is a quick explanation on my four modifications: > > 1) The calculation of avail_cb_nodes in ADIOI_LUSTRE_Get_striping_info() > For example, when nprocs_for_coll=2, stripe_count=2 and CO=1 , > avail_cb_nodes is set to 1. > The value 2 should be used instead. > > I propose to change the lines by: > > /* avail_cb_nodes should divide stripe_count exactly */ > while (stripe_count % avail_cb_nodes) { > avail_cb_nodes--; > }CO is a ratio of how many compute nodes should participate in collective I/O, right? I''ve kind of forgotten what this code is attempting to do: I mean I can see that it''s setting up the stripe_size, stripe_count, and avail_cb_nodes for the subsequent call to ADIOI_LUSTRE_Calc_my_req(), but I forget why there is both a CO and a CO_max.> 2) The parameter len_list_ptr has been modified in include/adioi.h, so I > propose to change : > int **len_list_ptr; > to > ADIO_Offset **len_list_ptr; > > in ad_lustre_aggregate.c and ad_lustre_wrcoll.cNo argument here. That''s clearly the right thing to do. I''ve committed that change. Take a look at revision 4889 and let me know if I missed anything> 3) Use of buf_idx[ ] in ADIOI_LUSTRE_W_Exchange_data > > I had a lot of troubles with the table of pointers buf_idx[ ]: > "coll_test" with different dimensions of the 3d array detected a lot > of errors. > > I solved the problems by using only one pointer: buf_idx[0] initialized > by 0 in > "ADIOI_LUSTRE_Calc_my_req" and used/modified in > "ADIOI_LUSTRE_W_Exchange_data": > if (send_size[i]) { > MPI_Isend(((char *) buf) + buf_idx[0], send_size[i], > MPI_BYTE, i, myrank + i + 100 * iter, fd->comm, > send_req + j); > j++; > buf_idx[0]+=send_size[i]; > } > Of course, a single variable may be used to do that, but this implies > more changes. > This solution works fine with coll_test (used with a wide range of > dimensions of the 3d array, on 4 nodes and 2 OST). Maybe this solution > is too simple, > and I missed something. What is your opinion ?I''m a little nervous about this one, especially if the cb_config_list hint was given. Could I get a second opinion on this one from anyone?> 4) Macro ADIOI_BUFFERED_WRITE_WITHOUT_READ in ad_lustre_wrstr.c (line 213) > This macro causes an abort in the tests "i_concontig" and "noncontig". > > In this macro, ADIO_WriteContig was called a lot of time with > writebuf_len egal to 0. > What is wanted is to copy the data (by memcpy) in the writebuf as much > as possible > to fill a stripe and then to call ADIO_WriteContig. As far as I > understand it, this was not > the case. > > I replaced this macro by a call to > > memcpy(writebuf + req_off - writebuf_off, (char *)buf + user_off, > req_len); > > This modification works only if it is possible to copy all the user data > in the write buffer (writebuf). (i.e. if bufsize <= writebuf_len) > If bufsize is higher, a loop has to be introduced. > > What do you think about this change ?That''s delicate code to say the least ... Can we just check for writebuf_len equal to 0 and short circuit things? The macro is kind of hairy, but it already handles the bufsize > writebuf_len case and has seen a decade of testing and bugfixes.> You will find attached my modification vs mpich2-1.1. Your questions and > comments about them are welcome.Thanks much for including the patch. that''s a huge help when having this kind of conversation. In the future, can you send it in "unified diff" format (-u)? Also, you made four kinds of changes. I know how much a pain it is, but is it possible next time to split up the patch into four pieces, one for each topic? Thanks for your help with the Lustre driver, and for fielding my questions! ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA