thr3ads.net - Lustre devel - [Lustre-devel] protocol backofs [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Andrew C. Uselton

2009-Mar-16 20:41 UTC

[Lustre-devel] protocol backofs

Howdy Isaac,
   Nice to meet you.  As Eric suggested I am also cc:ing Nick Henke, 
since he might find this an interesting discussion.  For all you 
lustre-devel dwellers out there, feel free to chime in.
   I have been running a few tests on the Franklin Cray XT at NERSC and 
also on Jaguar (Cray XT at ORNL) and on Jacquard (Opteron/Infiniband 
w/GPFS at NERSC).  You can see a lot of what I have done here:

	http://www.nersc.gov/~uselton/ipm-io.html

In particular, this link shows something of interest:

	http://www.nersc.gov/~uselton/frank_jag/

These tests use Madbench, which has a somewhat unusual I/O pattern.  It 
is implementing an out-of-core solution to a series of very large matrix 
operations.  The third row of graphs gives an idea of the aggregate I/O 
emerging from the application over the course of the run.  It has a 
pattern of writes, reads and writes, then reads.  Each of the I/O spikes 
is from every task writing or reading a single 300 MB buffer.  The last 
row of graphs gives a sense of the task by task behavior.

The "frank_jag" page shows data collected during 4 test with 256 tasks
(4 tasks per node on 64 nodes).  The target is a single file striped 
across all OSTs of the Lustre file system.   Two tests are on Franklin 
and two on Jaguar.  Each machine runs a test using the POSIX I/O 
interface and another using the MPI-I/O interface.  In the third column 
the Franklin, MPI-I/O test has extremely long delays in the reads in the 
middle phase, but not during the other reads or any of the writes.  This 
does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O. 
  The results shown are entirely reproducible and not due to 
interference from other jobs on the system.  The only difference between 
the Franklin and Jaguar configurations is that Jaguar has 144 OSTs on 72 
OSSs instead of 80 OSTs on 20 OSSs.

Eric put the notion in my head that that we may be looking at a 
contention issue in the Sea-Star network.  Since the I/O is being necked 
down to 20 OSSs in the case of Franklin, this seems plausible.  If you 
guys have a moment to consider the subject I''d like to think about:
a)  Why would contention introduce the catastrophic delays rather than 
just slow things down generally and more or less evenly?  Is there some 
form of back-off in the protocol(s) that could occasionally get kicked 
up to tens of seconds?
b)  Why is the contention introduced only in the MPI-I/O test and not in 
the POSIX test?  Does the MPI-I/O from Cray''s xt-mpt/3.1.0 divert I/O
to
a subset of nodes so that all the I/O is going through a smaller section 
of the torus?

If I have been too terse in this note feel free to ask questions and 
I''ll try to add more detail.
Cheers,
Andrew

Robert Latham

2009-Mar-16 22:13 UTC

head link

[Lustre-devel] protocol backofs

On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton
wrote:> Howdy Isaac,
>    Nice to meet you.  As Eric suggested I am also cc:ing Nick Henke, 
> since he might find this an interesting discussion.  For all you 
> lustre-devel dwellers out there, feel free to chime in.
Hi Andrew.  Yes, there is no way to avoid me...  I don''t have too much
information about Lustre but I can tell you a bit about Madbench and
MPI-IO.
> b)  Why is the contention introduced only in the MPI-I/O test and not in 
> the POSIX test?  Does the MPI-I/O from Cray''s xt-mpt/3.1.0 divert
I/O to
> a subset of nodes so that all the I/O is going through a smaller section 
> of the torus?
Cray''s MPI-IO is old enough that it''s doing "generic
unix" file system
operations.  (I''ve committed the optimized Lustre driver, but it will
take some time for it to end up on a Cray). 

Madbench is doing independent I/O, though, so optimized or no, there
is no "aggregation" -- it''s a shame, too, as it sounds like
aggregation would at least rule out your contention theory.  

You''ve essentially written this up on your website already, but for
the wider lustre-devel audience, The MPI-IO in Madbench is dead
simple: 

MPI_File_seek
MPI_File_read or MPI_File_write (or the nonblocking versions)
MPI_Barrier

This is *almost* an exact correspondance to the POSIX case:

fseeko64
fread or fwrite
fclose

Did you see the difference?  I know you did because you wrote
http://www.nersc.gov/~uselton/sf-mpi.html

How big is an individual madbench I/O operation for you?  We ran some
I/O tests with madbench on our bluegene that showed about 20 MB per
operation -- large enough that i''d be surprised if the libc buffering
was having much effect.

So, off the top of my head I don''t have too many ideas from an MPI-IO
perspective.  Your graphs suggest irregular performance on franklin
for both reads and writes
(http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so
that kind of rules out interference from the lock manager.

to me, your contention idea is still in play.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B

Andrew C. Uselton

2009-Mar-16 22:41 UTC

head link

[Lustre-devel] protocol backofs

Robert Latham wrote:> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
>> Howdy Isaac,
...> 
> Hi Andrew.  Yes, there is no way to avoid me...  I don''t have too
much
> information about Lustre but I can tell you a bit about Madbench and
> MPI-IO.
> Glad to hear from you :)
...> Cray''s MPI-IO is old enough that it''s doing "generic
unix" file system
> operations.  (I''ve committed the optimized Lustre driver, but it
will
> take some time for it to end up on a Cray). 
> I am looking over David Knaak''s shoulder even as we speak (electron?).
> Madbench is doing independent I/O, though, so optimized or no, there
> is no "aggregation" -- it''s a shame, too, as it sounds
like
> aggregation would at least rule out your contention theory.  
When you say "independent" you mean it isn''t using MPI
"collective" I/O,
yes?  That is true, just making sure I understand your comment.
> 
> How big is an individual madbench I/O operation for you?  We ran some
I usually run madbench "as large as possible".  That ends up with the 
target buffer for I/O in the 300 MB range.
> 
> So, off the top of my head I don''t have too many ideas from an
MPI-IO
> perspective.  Your graphs suggest irregular performance on franklin
> for both reads and writes
> (http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so
> that kind of rules out interference from the lock manager.
There is some variability in the writes (and reads in other tests), but 
the MPI-I/O, middle-phase reads seem to be a special case.  Those delays 
are an order of magnitude higher and do not seem to correspond to any 
I/O activity.  That''s why I''m hoping for a protocol backoff
induced by
congestion.  Also note that in that phase, and only in that phase, each 
node has been given 1.2 GB to send to the file and immediately asked to 
read that much back in from a different offset.  I''ve looked quite 
carefully and none of the I/O is outside its locked range as established 
in the first "writes" phase, so there should be no lock traffic during
this phase.  So in this middle phase there may be extra resource 
contention in kernel space on each node.  So an alternative might be a 
low-probability near-deadlock on those resources where writes are still 
being drained but reads are already demanding attention.
> 
> to me, your contention idea is still in play.
> 
> ==rob
> 
I think I forgot to mention:  NERSC is soon planning to extend the 
Franklin I/O resources so they look a lot more like Jaguar''s.  When
they
do we''ll be able to "do the experiment", in that if the delay
disappears
that argues for contention in the torus getting to the OSSs or in the 
OSSs themselves.  I''m still stumped for why it would only happen in the
MPI-I/O case, though.
Cheers,
Andrew

Isaac Huang

2009-Mar-17 15:28 UTC

head link

[Lustre-devel] protocol backofs

On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton
wrote:> Howdy Isaac,
>   Nice to meet you.  As Eric suggested I am also cc:ing Nick Henke,  
> since he might find this an interesting discussion.  For all you  
> lustre-devel dwellers out there, feel free to chime in.
Hello Andrew, please see my comments inline.
> ......
> The "frank_jag" page shows data collected during 4 test with 256
tasks
> (4 tasks per node on 64 nodes).  The target is a single file striped  
> across all OSTs of the Lustre file system.   Two tests are on Franklin  
> and two on Jaguar.  Each machine runs a test using the POSIX I/O  
> interface and another using the MPI-I/O interface.  In the third column  
> the Franklin, MPI-I/O test has extremely long delays in the reads in the  
> middle phase, but not during the other reads or any of the writes.  This  
I''ve got zero knowledge on MPI-IO. Could you please elaborate for a
bit on how this "delays in the reads" are measured and what "the
middle phase" is?
> does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O.  
> The results shown are entirely reproducible and not due to interference 
> from other jobs on the system.  The only difference between the Franklin 
> and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead 
> of 80 OSTs on 20 OSSs.
Not sure about Franklin, but on Jaguar, depending on the file-system in
use, the OSSs could reside in either the Sea-Star network or an IB
network (accessed via lnet routers). I think it might be worthwhile to 
double check what server network had been used.
> Eric put the notion in my head that that we may be looking at a  
> contention issue in the Sea-Star network.  Since the I/O is being necked  
> down to 20 OSSs in the case of Franklin, this seems plausible.  If you  
> guys have a moment to consider the subject I''d like to think
about:
> a)  Why would contention introduce the catastrophic delays rather than  
> just slow things down generally and more or less evenly?  Is there some  
> form of back-off in the protocol(s) that could occasionally get kicked  
> up to tens of seconds?
It involves many layers:
1. At Lustre/PTLRPC layer, there is a limit on the number of in-flight
RPCs to a server. This is end-to-end, and the limit could change at
runtime.
2. At lnet/lnd layer, for ptllnd and o2iblnd, there''s a credit-based
mechanism to prevent a sending node from overrunning buffers at the
remote end. This is not end-to-end, and the number of pre-granted
credits doesn''t change over runtime. 
3. Cray Portals and the Sea-Star network runs beneath lnet/ptllnd, 
and I''d think that there could also be some similar mechanisms.

Thanks,
Isaac

Isaac Huang

2009-Mar-17 18:13 UTC

head link

[Lustre-devel] protocol backofs

On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton
wrote:> ......
> The "frank_jag" page shows data collected during 4 test with 256
tasks
> (4 tasks per node on 64 nodes).  The target is a single file striped  
> across all OSTs of the Lustre file system.   Two tests are on Franklin  
> and two on Jaguar.  Each machine runs a test using the POSIX I/O  
> interface and another using the MPI-I/O interface.  In the third column  
> the Franklin, MPI-I/O test has extremely long delays in the reads in the  
> middle phase, but not during the other reads or any of the writes.  This  
> does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O.  
> The results shown are entirely reproducible and not due to interference 
> from other jobs on the system.  The only difference between the Franklin 
> and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead 
> of 80 OSTs on 20 OSSs.
I just happened to have a talk with an ORNL folk and was told that,
when compared with the other Cray XT system, it''s relatively easier
to hit congestion in Sea-Star network on Jaguar where the servers are
less distributed with regard to the network topology. So I wonder
whether there could be a similar difference between Franklin and
Jaguar? On the other hand, were the POSIX test and the MPI-IO test on
Franklin run over the same set of client nodes?

Thanks,
Isaac

Andrew C. Uselton

2009-Mar-17 21:45 UTC

head link

[Lustre-devel] protocol backofs

Isaac Huang wrote:> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
>> Howdy Isaac,
...> Hello Andrew, please see my comments inline.
> 
>> ......
>> The "frank_jag" page shows data collected during 4 test with
256 tasks
>> (4 tasks per node on 64 nodes).  The target is a single file striped  
>> across all OSTs of the Lustre file system.   Two tests are on Franklin
>> and two on Jaguar.  Each machine runs a test using the POSIX I/O  
>> interface and another using the MPI-I/O interface.  In the third column
>> the Franklin, MPI-I/O test has extremely long delays in the reads in
the
>> middle phase, but not during the other reads or any of the writes. 
This
> 
> I''ve got zero knowledge on MPI-IO. Could you please elaborate for
a
> bit on how this "delays in the reads" are measured and what
"the
> middle phase" is?
> All discussion is related to figures in:

	http://www.nersc.gov/~uselton/frank_jag/

The application in question is MADbench.  I can send a reference or two 
if you want detail on how MADbench works.  In short it is an MPI 
application that solves a very large matrix problem with an out-of-core 
algorithm.  That is, It works on a matrix problem that fills all the 
memory on all the nodes, 64 nodes/256 tasks in this case.  It must write 
out intermediate results and the read them back in.  As such, every task 
must execute a write of 300 MB at each step in "phase 1".  In our 
example phase 1 has eight steps, so eight 300 MB writes from each of 256 
tasks.  In "phase 2", each of the eight matrices must be read in turn,
a
result calculated, and the result written out - for(i=0;i<8i++){read(300 
MB); compute(); write(300 MB);}.  In "phase 3" the eight results are 
again read back in and a final value calculated.

So the reads in the middle phase take a long time when using an MPI-I/O 
interface and a single-file I/O model.  If you follow along in the 
graphs you should be able ot pick out the above actions and see where 
the slow reads are.

The data for identifying this behavior comes from augmenting the 
application with the "Integrated Performance Monitoring" library
(IPM).
  That tool provides an event trace across the whole application of 
library call, result, and timeing information.  Whith that one may 
reconstruct the trace graphs see in the web page.  Other interesting 
manipulations of that data also appear, for instance a histogram of 
frequency of occurence versus bandwidth exibited by individual I/Os.
> 
> Not sure about Franklin, but on Jaguar, depending on the file-system in
> use, the OSSs could reside in either the Sea-Star network or an IB
> network (accessed via lnet routers). I think it might be worthwhile to 
> double check what server network had been used.
> I was using /lustre/scr144 on Jaguar.  I believe that is SeaStar.
> 
> It involves many layers:
> 1. At Lustre/PTLRPC layer, there is a limit on the number of in-flight
> RPCs to a server. This is end-to-end, and the limit could change at
> runtime.The amount of I/O (1.2 GB per node, per step) is large enough I''d
assume
we hit steady state in the RPC mechanism.  Most of the time all 
available system "cache" is full and RPCs are being issued as quickly
as
they can be completed.
> 2. At lnet/lnd layer, for ptllnd and o2iblnd, there''s a
credit-based
> mechanism to prevent a sending node from overrunning buffers at the
> remote end. This is not end-to-end, and the number of pre-granted
> credits doesn''t change over runtime. 
I am only vaguely familiar with the credit mechanism.   That would be 
relevant for the writes, yes?  Is it possible to exhaust the available 
credits and get blocked trying to clear "cache" such that the reads 
(which got started after) can''t complete until the writes are drained 
from "cache".  that would certain address why the delays only occur in
the read,write,read,write... (middle) phase.
> 3. Cray Portals and the Sea-Star network runs beneath lnet/ptllnd, 
> and I''d think that there could also be some similar mechanisms.
Yes, I''m shopping for an understanding of how things can get bogged
down
this way, and why it only appears to happen for MPI-I/O not POSIX.
> 
> Thanks,
> Isaac
Your follow-up note about congestion is consistent with Eric''s comment.
  It may be that the cross-section bandwidth to the region with the OSSs 
is not high enough to forestall congestion.  This could be worse on 
Franklin (20 OSSs) than on Jaguar (72 OSSs) even if Jaguar does have a 
problem with it.
Cheers,
Andrew

Lustre devel - Mar 2009 - protocol backofs

[Lustre-devel] protocol backofs

[Lustre-devel] protocol backofs

[Lustre-devel] protocol backofs

[Lustre-devel] protocol backofs

[Lustre-devel] protocol backofs

[Lustre-devel] protocol backofs