thr3ads.net - Lustre discuss - [Lustre-discuss] MPI-IO / ROMIO support for Lustre [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Mark Dixon

2010-Nov-01 14:35 UTC

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

Hi,

I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and MVAPICH2 
working with our Lustre 1.8 filesystem. Looking back at the list archives, 
3 different solutions have been offered:

1) Disable "data sieving"         (change default library behaviour)
2) Mount Lustre with "localflock" (flock consistent only within a
node)
3) Mount Lustre with "flock"      (flock consistent across cluster)

However, it is not entirely clear which of these was considered the 
"best". Could anyone who is using MPI-IO on Lustre comment which they 
picked, please?

I *think* the May 2008 list archive indicates I should be using (3), but 
I''d feel a whole lot better about it if I knew I wasn''t alone
:)

Cheers,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

Martin Pokorny

2010-Nov-01 19:13 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

Mark Dixon wrote:> I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and
MVAPICH2
> working with our Lustre 1.8 filesystem. Looking back at the list archives, 
> 3 different solutions have been offered:
> 
> 1) Disable "data sieving"         (change default library
behaviour)
> 2) Mount Lustre with "localflock" (flock consistent only within a
node)
> 3) Mount Lustre with "flock"      (flock consistent across
cluster)
> 
> However, it is not entirely clear which of these was considered the 
> "best". Could anyone who is using MPI-IO on Lustre comment which
they
> picked, please?
FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO with Lustre
(v 1.8)
for several months now, and it''s been working reliably. We do mount the
Lustre filesystem with "flock"; at one time I thought it necessary,
but
I don''t recall if I verified that after the initial problems with
MPI-IO
were resolved. Only a recent MPICH2 will have a working 
MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI 
and MVAPICH2 as well.

-- 
Martin

Larry

2010-Nov-02 03:10 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

we use  "localflock"  in order to work with MPI-IO. "flock"
may
consume more addtional resource than "localflock".

On Mon, Nov 1, 2010 at 10:35 PM, Mark Dixon <m.c.dixon at leeds.ac.uk>
wrote:> Hi,
>
> I''m trying to get the MPI-IO/ROMIO shipped with OpenMPI and
MVAPICH2
> working with our Lustre 1.8 filesystem. Looking back at the list archives,
> 3 different solutions have been offered:
>
> 1) Disable "data sieving" ? ? ? ? (change default library
behaviour)
> 2) Mount Lustre with "localflock" (flock consistent only within a
node)
> 3) Mount Lustre with "flock" ? ? ?(flock consistent across
cluster)
>
> However, it is not entirely clear which of these was considered the
> "best". Could anyone who is using MPI-IO on Lustre comment which
they
> picked, please?
>
> I *think* the May 2008 list archive indicates I should be using (3), but
> I''d feel a whole lot better about it if I knew I wasn''t
alone :)
>
> Cheers,
>
> Mark
> --
> -----------------------------------------------------------------
> Mark Dixon ? ? ? ? ? ? ? ? ? ? ? Email ? ?: m.c.dixon at leeds.ac.uk
> HPC/Grid Systems Support ? ? ? ? Tel (int): 35429
> Information Systems Services ? ? Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Mark Dixon

2010-Nov-03 10:18 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

On Mon, 1 Nov 2010, Martin Pokorny wrote:
...> FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO with
Lustre (v 1.8)
> for several months now, and it''s been working reliably. We do
mount the
> Lustre filesystem with "flock"; at one time I thought it
necessary, but
> I don''t recall if I verified that after the initial problems with
MPI-IO
> were resolved. Only a recent MPICH2 will have a working
> MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI
> and MVAPICH2 as well....

Is MPICH2 where ROMIO is developed these days? I found it pretty difficult 
to work out where the public face of its development was...

On Tue, 2 Nov 2010, Larry wrote:
> we use  "localflock"  in order to work with MPI-IO.
"flock" may
> consume more addtional resource than "localflock"....

Thanks for posting both of you, much appreciated :)

Cripes, a tie (sort of). I think the following supports the "flock" 
option, but I would appreciate anyone putting-in a counter-opinion:

1) Is it a naive view that, if ROMIO asks for an flock, it needs it? And 
that if it doesn''t on Lustre, then eventually ROMIO will be developed
to
stop asking for them?

2) A message in the list archive says that Cray recommend "flock" for 
their clusters, and it sounds like they use an enhanced version of ROMIO in 
their MPT product.

The ADIO driver for Lustre certainly looks like it''s still being
actively
worked-on to get it to maturity, and that various MPI implementations 
still need time to incorporate those changes.

I note that the main MPI releases we use on our cluster (OpenMPI 1.4 and 
MVAPICH2 1.4 - we''re a year behind) have V04 of ad_lustre, but MVAPICH2
is
now closest to MPICH2 as it has since moved to V05. Looks like I need to 
do a software refresh... and recommend our MPI-IO users to use MVAPICH2 
for the time being.

Thanks,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

Rob Latham

2010-Nov-05 14:46 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

On Wed, Nov 03, 2010 at 10:18:51AM +0000, Mark Dixon
wrote:> On Mon, 1 Nov 2010, Martin Pokorny wrote:
> ...
> > FWIW, we''ve been using MPICH2''s MPI-IO/ROMIO/ADIO
with Lustre (v 1.8)
> > for several months now, and it''s been working reliably. We do
mount the
> > Lustre filesystem with "flock"; at one time I thought it
necessary, but
> > I don''t recall if I verified that after the initial problems
with MPI-IO
> > were resolved. Only a recent MPICH2 will have a working
> > MPI-IO/ROMIO/ADIO for Lustre; perhaps the code would work with OpenMPI
> > and MVAPICH2 as well.
> ...
> 
> Is MPICH2 where ROMIO is developed these days? I found it pretty difficult 
> to work out where the public face of its development was...
hello!  I''m "the ROMIO guy".  

MPICH2 always contains the latest ROMIO.  We''ll try to sync up
with folks when major improvements happen.  The community has really
come through over the last year with a good Lustre driver for ROMIO,
and now I''m encouraging other projects using ROMIO to sync up with us.
OpenMPI is still running a fairly old version of ROMIO, though.

Pascal Deveze has done all the work of syncing, but is waiting for an
OpenMPI developer to say "ok, this looks fine" and commit it.
> 1) Is it a naive view that, if ROMIO asks for an flock, it needs it? And 
> that if it doesn''t on Lustre, then eventually ROMIO will be
developed to
> stop asking for them?
ROMIO uses these fcntl locks in one place on Lustre: the noncontiguous
write path uses an optimization called "data sieving", which is a good
optmization except there is a read-modify-write step.  If two
processes simultaneously read-modify-write the same region, who wins?
We guard against this with an fcntl lock.  Or by disabling data
sieving writes.
> 2) A message in the list archive says that Cray recommend "flock"
for
> their clusters, and it sounds like they use an enhanced version of ROMIO in
> their MPT product.
Cray-MPI version 3.2 or newer has a different (but good) Lustre driver
for ROMIO.  I defer to their advice on their systems.
> The ADIO driver for Lustre certainly looks like it''s still being
actively
> worked-on to get it to maturity, and that various MPI implementations 
> still need time to incorporate those changes.
> 
> I note that the main MPI releases we use on our cluster (OpenMPI 1.4 and 
> MVAPICH2 1.4 - we''re a year behind) have V04 of ad_lustre, but
MVAPICH2 is
> now closest to MPICH2 as it has since moved to V05. Looks like I need to 
> do a software refresh... and recommend our MPI-IO users to use MVAPICH2 
> for the time being.
you can take the romio distribution from the recent MPICH2-1.3.0
release and build that against an existing MPI library.   You have to
get the link order right, but if you are dedicated to using an old MPI
version for other reasons, linking in a new ROMIO version might be the
way to go.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Mark Dixon

2010-Nov-05 17:11 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

On Fri, 5 Nov 2010, Rob Latham wrote:
...> hello!  I''m "the ROMIO guy".
Hi Rob, thanks for replying - good to know the background and that OpenMPI 
isn''t being left behind!

...> ROMIO uses these fcntl locks in one place on Lustre: the noncontiguous
> write path uses an optimization called "data sieving", which is a
good
> optmization except there is a read-modify-write step.  If two
> processes simultaneously read-modify-write the same region, who wins?
> We guard against this with an fcntl lock.  Or by disabling data
> sieving writes.
Assuming I do not disable data sieving, which of the following options 
will most likely give me correct behaviour?

1) Enable Lustre''s cluster-wide coherent fcntl locks.

2) Cheat, and enable Lustre''s (cheaper) fcntl locks that are only
coherent
on an individual computer, on the assumption that Lustre''s own internal
locking mechanisms will "do the right thing".

...> you can take the romio distribution from the recent MPICH2-1.3.0
> release and build that against an existing MPI library.   You have to
> get the link order right, but if you are dedicated to using an old MPI
> version for other reasons, linking in a new ROMIO version might be the
> way to go....

Happily, I''m due to update my MPIs soon :)

Thanks again,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

Rob Latham

2010-Nov-05 17:21 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

On Fri, Nov 05, 2010 at 05:11:47PM +0000, Mark Dixon
wrote:> Assuming I do not disable data sieving, which of the following
> options will most likely give me correct behaviour?
> 
> 1) Enable Lustre''s cluster-wide coherent fcntl locks.
> I think you''re going to have to go with this approach
> 2) Cheat, and enable Lustre''s (cheaper) fcntl locks that are only
> coherent on an individual computer, on the assumption that
Lustre''s
> own internal locking mechanisms will "do the right thing".
Let''s consider two process A and B who are doing non-contiguous
writes.  A writes to the even bytes and B writes to the odd bytes.
Data sieving means that both processes will read a big chunk of the
file, modify the local copy of the buffer and then write it.  OK, so
Lustre will ensure that either all of A''s write or all of B''s
write
will show up, but that still means one process stomps all over the
other.

so you need something to serialize the read-modify-write.  

Note that if you use *collective* i/o then much of this no longer
becomes a problem.  Because it''s collective the processes can use MPI
to coordinate and eliminate this false sharing problem.  

Note also that if you never ever use a shared file, then I guess you
can use local locks.  But if you are always doing one file per
processor instead of a few shared files,  you will end up with at the
very least a mess of files.  

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Mark Dixon

2010-Nov-08 08:56 UTC

head link

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

On Fri, 5 Nov 2010, Rob Latham wrote:
...> Let''s consider two process A and B who are doing non-contiguous
> writes.  A writes to the even bytes and B writes to the odd bytes.
> Data sieving means that both processes will read a big chunk of the
> file, modify the local copy of the buffer and then write it.  OK, so
> Lustre will ensure that either all of A''s write or all of
B''s write
> will show up, but that still means one process stomps all over the
> other.
>
> so you need something to serialize the read-modify-write.
>
> Note that if you use *collective* i/o then much of this no longer
> becomes a problem.  Because it''s collective the processes can use
MPI
> to coordinate and eliminate this false sharing problem.
>
> Note also that if you never ever use a shared file, then I guess you
> can use local locks.  But if you are always doing one file per
> processor instead of a few shared files,  you will end up with at the
> very least a mess of files....

Brilliant, very informative - thanks Rob.

Hopefully this will help aid other people in my situation; this question 
has cropped-up a few times on this list and never seems to make it to a 
consensus - so different people have chosen different options.

Best wishes,

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

Lustre discuss - Nov 2010 - MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre

[Lustre-discuss] MPI-IO / ROMIO support for Lustre