Howdy Isaac, Nice to meet you. As Eric suggested I am also cc:ing Nick Henke, since he might find this an interesting discussion. For all you lustre-devel dwellers out there, feel free to chime in. I have been running a few tests on the Franklin Cray XT at NERSC and also on Jaguar (Cray XT at ORNL) and on Jacquard (Opteron/Infiniband w/GPFS at NERSC). You can see a lot of what I have done here: http://www.nersc.gov/~uselton/ipm-io.html In particular, this link shows something of interest: http://www.nersc.gov/~uselton/frank_jag/ These tests use Madbench, which has a somewhat unusual I/O pattern. It is implementing an out-of-core solution to a series of very large matrix operations. The third row of graphs gives an idea of the aggregate I/O emerging from the application over the course of the run. It has a pattern of writes, reads and writes, then reads. Each of the I/O spikes is from every task writing or reading a single 300 MB buffer. The last row of graphs gives a sense of the task by task behavior. The "frank_jag" page shows data collected during 4 test with 256 tasks (4 tasks per node on 64 nodes). The target is a single file striped across all OSTs of the Lustre file system. Two tests are on Franklin and two on Jaguar. Each machine runs a test using the POSIX I/O interface and another using the MPI-I/O interface. In the third column the Franklin, MPI-I/O test has extremely long delays in the reads in the middle phase, but not during the other reads or any of the writes. This does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O. The results shown are entirely reproducible and not due to interference from other jobs on the system. The only difference between the Franklin and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead of 80 OSTs on 20 OSSs. Eric put the notion in my head that that we may be looking at a contention issue in the Sea-Star network. Since the I/O is being necked down to 20 OSSs in the case of Franklin, this seems plausible. If you guys have a moment to consider the subject I''d like to think about: a) Why would contention introduce the catastrophic delays rather than just slow things down generally and more or less evenly? Is there some form of back-off in the protocol(s) that could occasionally get kicked up to tens of seconds? b) Why is the contention introduced only in the MPI-I/O test and not in the POSIX test? Does the MPI-I/O from Cray''s xt-mpt/3.1.0 divert I/O to a subset of nodes so that all the I/O is going through a smaller section of the torus? If I have been too terse in this note feel free to ask questions and I''ll try to add more detail. Cheers, Andrew
On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:> Howdy Isaac, > Nice to meet you. As Eric suggested I am also cc:ing Nick Henke, > since he might find this an interesting discussion. For all you > lustre-devel dwellers out there, feel free to chime in.Hi Andrew. Yes, there is no way to avoid me... I don''t have too much information about Lustre but I can tell you a bit about Madbench and MPI-IO.> b) Why is the contention introduced only in the MPI-I/O test and not in > the POSIX test? Does the MPI-I/O from Cray''s xt-mpt/3.1.0 divert I/O to > a subset of nodes so that all the I/O is going through a smaller section > of the torus?Cray''s MPI-IO is old enough that it''s doing "generic unix" file system operations. (I''ve committed the optimized Lustre driver, but it will take some time for it to end up on a Cray). Madbench is doing independent I/O, though, so optimized or no, there is no "aggregation" -- it''s a shame, too, as it sounds like aggregation would at least rule out your contention theory. You''ve essentially written this up on your website already, but for the wider lustre-devel audience, The MPI-IO in Madbench is dead simple: MPI_File_seek MPI_File_read or MPI_File_write (or the nonblocking versions) MPI_Barrier This is *almost* an exact correspondance to the POSIX case: fseeko64 fread or fwrite fclose Did you see the difference? I know you did because you wrote http://www.nersc.gov/~uselton/sf-mpi.html How big is an individual madbench I/O operation for you? We ran some I/O tests with madbench on our bluegene that showed about 20 MB per operation -- large enough that i''d be surprised if the libc buffering was having much effect. So, off the top of my head I don''t have too many ideas from an MPI-IO perspective. Your graphs suggest irregular performance on franklin for both reads and writes (http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so that kind of rules out interference from the lock manager. to me, your contention idea is still in play. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
Robert Latham wrote:> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote: >> Howdy Isaac,...> > Hi Andrew. Yes, there is no way to avoid me... I don''t have too much > information about Lustre but I can tell you a bit about Madbench and > MPI-IO. >Glad to hear from you :) ...> Cray''s MPI-IO is old enough that it''s doing "generic unix" file system > operations. (I''ve committed the optimized Lustre driver, but it will > take some time for it to end up on a Cray). >I am looking over David Knaak''s shoulder even as we speak (electron?).> Madbench is doing independent I/O, though, so optimized or no, there > is no "aggregation" -- it''s a shame, too, as it sounds like > aggregation would at least rule out your contention theory.When you say "independent" you mean it isn''t using MPI "collective" I/O, yes? That is true, just making sure I understand your comment.> > How big is an individual madbench I/O operation for you? We ran someI usually run madbench "as large as possible". That ends up with the target buffer for I/O in the 300 MB range.> > So, off the top of my head I don''t have too many ideas from an MPI-IO > perspective. Your graphs suggest irregular performance on franklin > for both reads and writes > (http://www.nersc.gov/~uselton/frank_jag/20090215183709/rate.png), so > that kind of rules out interference from the lock manager.There is some variability in the writes (and reads in other tests), but the MPI-I/O, middle-phase reads seem to be a special case. Those delays are an order of magnitude higher and do not seem to correspond to any I/O activity. That''s why I''m hoping for a protocol backoff induced by congestion. Also note that in that phase, and only in that phase, each node has been given 1.2 GB to send to the file and immediately asked to read that much back in from a different offset. I''ve looked quite carefully and none of the I/O is outside its locked range as established in the first "writes" phase, so there should be no lock traffic during this phase. So in this middle phase there may be extra resource contention in kernel space on each node. So an alternative might be a low-probability near-deadlock on those resources where writes are still being drained but reads are already demanding attention.> > to me, your contention idea is still in play. > > ==rob >I think I forgot to mention: NERSC is soon planning to extend the Franklin I/O resources so they look a lot more like Jaguar''s. When they do we''ll be able to "do the experiment", in that if the delay disappears that argues for contention in the torus getting to the OSSs or in the OSSs themselves. I''m still stumped for why it would only happen in the MPI-I/O case, though. Cheers, Andrew
On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:> Howdy Isaac, > Nice to meet you. As Eric suggested I am also cc:ing Nick Henke, > since he might find this an interesting discussion. For all you > lustre-devel dwellers out there, feel free to chime in.Hello Andrew, please see my comments inline.> ...... > The "frank_jag" page shows data collected during 4 test with 256 tasks > (4 tasks per node on 64 nodes). The target is a single file striped > across all OSTs of the Lustre file system. Two tests are on Franklin > and two on Jaguar. Each machine runs a test using the POSIX I/O > interface and another using the MPI-I/O interface. In the third column > the Franklin, MPI-I/O test has extremely long delays in the reads in the > middle phase, but not during the other reads or any of the writes. ThisI''ve got zero knowledge on MPI-IO. Could you please elaborate for a bit on how this "delays in the reads" are measured and what "the middle phase" is?> does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O. > The results shown are entirely reproducible and not due to interference > from other jobs on the system. The only difference between the Franklin > and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead > of 80 OSTs on 20 OSSs.Not sure about Franklin, but on Jaguar, depending on the file-system in use, the OSSs could reside in either the Sea-Star network or an IB network (accessed via lnet routers). I think it might be worthwhile to double check what server network had been used.> Eric put the notion in my head that that we may be looking at a > contention issue in the Sea-Star network. Since the I/O is being necked > down to 20 OSSs in the case of Franklin, this seems plausible. If you > guys have a moment to consider the subject I''d like to think about: > a) Why would contention introduce the catastrophic delays rather than > just slow things down generally and more or less evenly? Is there some > form of back-off in the protocol(s) that could occasionally get kicked > up to tens of seconds?It involves many layers: 1. At Lustre/PTLRPC layer, there is a limit on the number of in-flight RPCs to a server. This is end-to-end, and the limit could change at runtime. 2. At lnet/lnd layer, for ptllnd and o2iblnd, there''s a credit-based mechanism to prevent a sending node from overrunning buffers at the remote end. This is not end-to-end, and the number of pre-granted credits doesn''t change over runtime. 3. Cray Portals and the Sea-Star network runs beneath lnet/ptllnd, and I''d think that there could also be some similar mechanisms. Thanks, Isaac
On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:> ...... > The "frank_jag" page shows data collected during 4 test with 256 tasks > (4 tasks per node on 64 nodes). The target is a single file striped > across all OSTs of the Lustre file system. Two tests are on Franklin > and two on Jaguar. Each machine runs a test using the POSIX I/O > interface and another using the MPI-I/O interface. In the third column > the Franklin, MPI-I/O test has extremely long delays in the reads in the > middle phase, but not during the other reads or any of the writes. This > does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O. > The results shown are entirely reproducible and not due to interference > from other jobs on the system. The only difference between the Franklin > and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead > of 80 OSTs on 20 OSSs.I just happened to have a talk with an ORNL folk and was told that, when compared with the other Cray XT system, it''s relatively easier to hit congestion in Sea-Star network on Jaguar where the servers are less distributed with regard to the network topology. So I wonder whether there could be a similar difference between Franklin and Jaguar? On the other hand, were the POSIX test and the MPI-IO test on Franklin run over the same set of client nodes? Thanks, Isaac
Isaac Huang wrote:> On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote: >> Howdy Isaac,...> Hello Andrew, please see my comments inline. > >> ...... >> The "frank_jag" page shows data collected during 4 test with 256 tasks >> (4 tasks per node on 64 nodes). The target is a single file striped >> across all OSTs of the Lustre file system. Two tests are on Franklin >> and two on Jaguar. Each machine runs a test using the POSIX I/O >> interface and another using the MPI-I/O interface. In the third column >> the Franklin, MPI-I/O test has extremely long delays in the reads in the >> middle phase, but not during the other reads or any of the writes. This > > I''ve got zero knowledge on MPI-IO. Could you please elaborate for a > bit on how this "delays in the reads" are measured and what "the > middle phase" is? >All discussion is related to figures in: http://www.nersc.gov/~uselton/frank_jag/ The application in question is MADbench. I can send a reference or two if you want detail on how MADbench works. In short it is an MPI application that solves a very large matrix problem with an out-of-core algorithm. That is, It works on a matrix problem that fills all the memory on all the nodes, 64 nodes/256 tasks in this case. It must write out intermediate results and the read them back in. As such, every task must execute a write of 300 MB at each step in "phase 1". In our example phase 1 has eight steps, so eight 300 MB writes from each of 256 tasks. In "phase 2", each of the eight matrices must be read in turn, a result calculated, and the result written out - for(i=0;i<8i++){read(300 MB); compute(); write(300 MB);}. In "phase 3" the eight results are again read back in and a final value calculated. So the reads in the middle phase take a long time when using an MPI-I/O interface and a single-file I/O model. If you follow along in the graphs you should be able ot pick out the above actions and see where the slow reads are. The data for identifying this behavior comes from augmenting the application with the "Integrated Performance Monitoring" library (IPM). That tool provides an event trace across the whole application of library call, result, and timeing information. Whith that one may reconstruct the trace graphs see in the web page. Other interesting manipulations of that data also appear, for instance a histogram of frequency of occurence versus bandwidth exibited by individual I/Os.> > Not sure about Franklin, but on Jaguar, depending on the file-system in > use, the OSSs could reside in either the Sea-Star network or an IB > network (accessed via lnet routers). I think it might be worthwhile to > double check what server network had been used. >I was using /lustre/scr144 on Jaguar. I believe that is SeaStar.> > It involves many layers: > 1. At Lustre/PTLRPC layer, there is a limit on the number of in-flight > RPCs to a server. This is end-to-end, and the limit could change at > runtime.The amount of I/O (1.2 GB per node, per step) is large enough I''d assume we hit steady state in the RPC mechanism. Most of the time all available system "cache" is full and RPCs are being issued as quickly as they can be completed.> 2. At lnet/lnd layer, for ptllnd and o2iblnd, there''s a credit-based > mechanism to prevent a sending node from overrunning buffers at the > remote end. This is not end-to-end, and the number of pre-granted > credits doesn''t change over runtime.I am only vaguely familiar with the credit mechanism. That would be relevant for the writes, yes? Is it possible to exhaust the available credits and get blocked trying to clear "cache" such that the reads (which got started after) can''t complete until the writes are drained from "cache". that would certain address why the delays only occur in the read,write,read,write... (middle) phase.> 3. Cray Portals and the Sea-Star network runs beneath lnet/ptllnd, > and I''d think that there could also be some similar mechanisms.Yes, I''m shopping for an understanding of how things can get bogged down this way, and why it only appears to happen for MPI-I/O not POSIX.> > Thanks, > IsaacYour follow-up note about congestion is consistent with Eric''s comment. It may be that the cross-section bandwidth to the region with the OSSs is not high enough to forestall congestion. This could be worse on Franklin (20 OSSs) than on Jaguar (72 OSSs) even if Jaguar does have a problem with it. Cheers, Andrew