Hi, I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and MVAPICH2 working with our Lustre 1.8 filesystem. Looking back at the list archives, 3 different solutions have been offered: 1) Disable "data sieving" (change default library behaviour) 2) Mount Lustre with "localflock" (flock consistent only within a node) 3) Mount Lustre with "flock" (flock consistent across cluster) However, it is not entirely clear which of these was considered the "best". Could anyone who is using MPI-IO on Lustre comment which they picked, please? I *think* the May 2008 list archive indicates I should be using (3), but I''d feel a whole lot better about it if I knew I wasn''t alone :) Cheers, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
Mark Dixon wrote:> I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and MVAPICH2 > working with our Lustre 1.8 filesystem. Looking back at the list archives, > 3 different solutions have been offered: > > 1) Disable "data sieving" (change default library behaviour) > 2) Mount Lustre with "localflock" (flock consistent only within a node) > 3) Mount Lustre with "flock" (flock consistent across cluster) > > However, it is not entirely clear which of these was considered the > "best". Could anyone who is using MPI-IO on Lustre comment which they > picked, please?FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO with Lustre (v 1.8) for several months now, and it''s been working reliably. We do mount the Lustre filesystem with "flock"; at one time I thought it necessary, but I don''t recall if I verified that after the initial problems with MPI-IO were resolved. Only a recent MPICH2 will have a working MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI and MVAPICH2 as well. -- Martin
we use "localflock" in order to work with MPI-IO. "flock" may consume more addtional resource than "localflock". On Mon, Nov 1, 2010 at 10:35 PM, Mark Dixon <m.c.dixon at leeds.ac.uk> wrote:> Hi, > > I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and MVAPICH2 > working with our Lustre 1.8 filesystem. Looking back at the list archives, > 3 different solutions have been offered: > > 1) Disable "data sieving" ? ? ? ? (change default library behaviour) > 2) Mount Lustre with "localflock" (flock consistent only within a node) > 3) Mount Lustre with "flock" ? ? ?(flock consistent across cluster) > > However, it is not entirely clear which of these was considered the > "best". Could anyone who is using MPI-IO on Lustre comment which they > picked, please? > > I *think* the May 2008 list archive indicates I should be using (3), but > I''d feel a whole lot better about it if I knew I wasn''t alone :) > > Cheers, > > Mark > -- > ----------------------------------------------------------------- > Mark Dixon ? ? ? ? ? ? ? ? ? ? ? Email ? ?: m.c.dixon at leeds.ac.uk > HPC/Grid Systems Support ? ? ? ? Tel (int): 35429 > Information Systems Services ? ? Tel (ext): +44(0)113 343 5429 > University of Leeds, LS2 9JT, UK > ----------------------------------------------------------------- > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Mon, 1 Nov 2010, Martin Pokorny wrote: ...> FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO with Lustre (v 1.8) > for several months now, and it''s been working reliably. We do mount the > Lustre filesystem with "flock"; at one time I thought it necessary, but > I don''t recall if I verified that after the initial problems with MPI-IO > were resolved. Only a recent MPICH2 will have a working > MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI > and MVAPICH2 as well.... Is MPICH2 where ROMIO is developed these days? I found it pretty difficult to work out where the public face of its development was... On Tue, 2 Nov 2010, Larry wrote:> we use "localflock" in order to work with MPI-IO. "flock" may > consume more addtional resource than "localflock".... Thanks for posting both of you, much appreciated :) Cripes, a tie (sort of). I think the following supports the "flock" option, but I would appreciate anyone putting-in a counter-opinion: 1) Is it a naive view that, if ROMIO asks for an flock, it needs it? And that if it doesn''t on Lustre, then eventually ROMIO will be developed to stop asking for them? 2) A message in the list archive says that Cray recommend "flock" for their clusters, and it sounds like they use an enhanced version of ROMIO in their MPT product. The ADIO driver for Lustre certainly looks like it''s still being actively worked-on to get it to maturity, and that various MPI implementations still need time to incorporate those changes. I note that the main MPI releases we use on our cluster (OpenMPI 1.4 and MVAPICH2 1.4 - we''re a year behind) have V04 of ad_lustre, but MVAPICH2 is now closest to MPICH2 as it has since moved to V05. Looks like I need to do a software refresh... and recommend our MPI-IO users to use MVAPICH2 for the time being. Thanks, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
On Wed, Nov 03, 2010 at 10:18:51AM +0000, Mark Dixon wrote:> On Mon, 1 Nov 2010, Martin Pokorny wrote: > ... > > FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO with Lustre (v 1.8) > > for several months now, and it''s been working reliably. We do mount the > > Lustre filesystem with "flock"; at one time I thought it necessary, but > > I don''t recall if I verified that after the initial problems with MPI-IO > > were resolved. Only a recent MPICH2 will have a working > > MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI > > and MVAPICH2 as well. > ... > > Is MPICH2 where ROMIO is developed these days? I found it pretty difficult > to work out where the public face of its development was...hello! I''m "the ROMIO guy". MPICH2 always contains the latest ROMIO. We''ll try to sync up with folks when major improvements happen. The community has really come through over the last year with a good Lustre driver for ROMIO, and now I''m encouraging other projects using ROMIO to sync up with us. OpenMPI is still running a fairly old version of ROMIO, though. Pascal Deveze has done all the work of syncing, but is waiting for an OpenMPI developer to say "ok, this looks fine" and commit it.> 1) Is it a naive view that, if ROMIO asks for an flock, it needs it? And > that if it doesn''t on Lustre, then eventually ROMIO will be developed to > stop asking for them?ROMIO uses these fcntl locks in one place on Lustre: the noncontiguous write path uses an optimization called "data sieving", which is a good optmization except there is a read-modify-write step. If two processes simultaneously read-modify-write the same region, who wins? We guard against this with an fcntl lock. Or by disabling data sieving writes.> 2) A message in the list archive says that Cray recommend "flock" for > their clusters, and it sounds like they use an enhanced version of ROMIO in > their MPT product.Cray-MPI version 3.2 or newer has a different (but good) Lustre driver for ROMIO. I defer to their advice on their systems.> The ADIO driver for Lustre certainly looks like it''s still being actively > worked-on to get it to maturity, and that various MPI implementations > still need time to incorporate those changes. > > I note that the main MPI releases we use on our cluster (OpenMPI 1.4 and > MVAPICH2 1.4 - we''re a year behind) have V04 of ad_lustre, but MVAPICH2 is > now closest to MPICH2 as it has since moved to V05. Looks like I need to > do a software refresh... and recommend our MPI-IO users to use MVAPICH2 > for the time being.you can take the romio distribution from the recent MPICH2-1.3.0 release and build that against an existing MPI library. You have to get the link order right, but if you are dedicated to using an old MPI version for other reasons, linking in a new ROMIO version might be the way to go. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
On Fri, 5 Nov 2010, Rob Latham wrote: ...> hello! I''m "the ROMIO guy".Hi Rob, thanks for replying - good to know the background and that OpenMPI isn''t being left behind! ...> ROMIO uses these fcntl locks in one place on Lustre: the noncontiguous > write path uses an optimization called "data sieving", which is a good > optmization except there is a read-modify-write step. If two > processes simultaneously read-modify-write the same region, who wins? > We guard against this with an fcntl lock. Or by disabling data > sieving writes.Assuming I do not disable data sieving, which of the following options will most likely give me correct behaviour? 1) Enable Lustre''s cluster-wide coherent fcntl locks. 2) Cheat, and enable Lustre''s (cheaper) fcntl locks that are only coherent on an individual computer, on the assumption that Lustre''s own internal locking mechanisms will "do the right thing". ...> you can take the romio distribution from the recent MPICH2-1.3.0 > release and build that against an existing MPI library. You have to > get the link order right, but if you are dedicated to using an old MPI > version for other reasons, linking in a new ROMIO version might be the > way to go.... Happily, I''m due to update my MPIs soon :) Thanks again, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
On Fri, Nov 05, 2010 at 05:11:47PM +0000, Mark Dixon wrote:> Assuming I do not disable data sieving, which of the following > options will most likely give me correct behaviour? > > 1) Enable Lustre''s cluster-wide coherent fcntl locks. >I think you''re going to have to go with this approach> 2) Cheat, and enable Lustre''s (cheaper) fcntl locks that are only > coherent on an individual computer, on the assumption that Lustre''s > own internal locking mechanisms will "do the right thing".Let''s consider two process A and B who are doing non-contiguous writes. A writes to the even bytes and B writes to the odd bytes. Data sieving means that both processes will read a big chunk of the file, modify the local copy of the buffer and then write it. OK, so Lustre will ensure that either all of A''s write or all of B''s write will show up, but that still means one process stomps all over the other. so you need something to serialize the read-modify-write. Note that if you use *collective* i/o then much of this no longer becomes a problem. Because it''s collective the processes can use MPI to coordinate and eliminate this false sharing problem. Note also that if you never ever use a shared file, then I guess you can use local locks. But if you are always doing one file per processor instead of a few shared files, you will end up with at the very least a mess of files. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
On Fri, 5 Nov 2010, Rob Latham wrote: ...> Let''s consider two process A and B who are doing non-contiguous > writes. A writes to the even bytes and B writes to the odd bytes. > Data sieving means that both processes will read a big chunk of the > file, modify the local copy of the buffer and then write it. OK, so > Lustre will ensure that either all of A''s write or all of B''s write > will show up, but that still means one process stomps all over the > other. > > so you need something to serialize the read-modify-write. > > Note that if you use *collective* i/o then much of this no longer > becomes a problem. Because it''s collective the processes can use MPI > to coordinate and eliminate this false sharing problem. > > Note also that if you never ever use a shared file, then I guess you > can use local locks. But if you are always doing one file per > processor instead of a few shared files, you will end up with at the > very least a mess of files.... Brilliant, very informative - thanks Rob. Hopefully this will help aid other people in my situation; this question has cropped-up a few times on this list and never seems to make it to a consensus - so different people have chosen different options. Best wishes, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------