Hello! It came to my attention that seastar network does not implement message priorities for various reasons. I really think there is very valid case for the priorities of some sort to allow MPI and other latency-critical traffic to go in front of bulk IO traffic on the wire. Consider this test I was running the other day on Jaguar. The application writes 250M of data from every core with plain write() system call, the write() syscall returns very fast (less than 0.5 sec == 400+Mb/sec app-perceived bandwidth) because the data just goes to the memory cache to be flushed later. Then I do 2 barriers one by one with nothing in between. If I run it at sufficient scale (say 1200 cores), the first barrier takes 4.5 seconds to complete and the second one 1.5 seconds, all due to MPI RPCs being stuck behind huge bulk data requests on the clients, presumably (I do not have any other good explanations at least). This makes for a lot of wasted time in applications that would like to use the buffering capabilities provided by the OS. Do you think something like this could be organized if not for current revision then at least for the next version? Bye, Oleg
I wonder if that scenario may have some bearing on the results I''ve mentioned at: http://www.nersc.gov/~uselton/frank_jag/ It would be interesting to step through the logic if anyone is interested in doing so. The web page itself is terse, so feel free to bug me for details if you have not seen this before. Cheers, Andrew Oleg Drokin wrote:> Hello! > > It came to my attention that seastar network does not implement > message priorities for various reasons. > I really think there is very valid case for the priorities of some > sort to allow MPI and other > latency-critical traffic to go in front of bulk IO traffic on the > wire. > Consider this test I was running the other day on Jaguar. The > application writes 250M of data from every > core with plain write() system call, the write() syscall returns > very fast (less than 0.5 sec == 400+Mb/sec > app-perceived bandwidth) because the data just goes to the memory > cache to be flushed later. > Then I do 2 barriers one by one with nothing in between. > If I run it at sufficient scale (say 1200 cores), the first barrier > takes 4.5 seconds to complete and > the second one 1.5 seconds, all due to MPI RPCs being stuck behind > huge bulk data requests on the clients, > presumably (I do not have any other good explanations at least). > This makes for a lot of wasted time in applications that would like > to use the buffering capabilities provided > by the OS. > > Do you think something like this could be organized if not for > current revision then at least for the next > version? > > Bye, > Oleg > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Oleg Drokin wrote:> Hello! > > It came to my attention that seastar network does not implement > message priorities for various reasons. > I really think there is very valid case for the priorities of some > sort to allow MPI and other > latency-critical traffic to go in front of bulk IO traffic on the > wire. >In the ptllnd, the bulk traffic is setup via short messages, so if the barrier is sent right after the write() returns, it really isn''t backed up behind the bulk data.> Consider this test I was running the other day on Jaguar. The > application writes 250M of data from every > core with plain write() system call, the write() syscall returns > very fast (less than 0.5 sec == 400+Mb/sec > app-perceived bandwidth) because the data just goes to the memory > cache to be flushed later. > Then I do 2 barriers one by one with nothing in between. > If I run it at sufficient scale (say 1200 cores), the first barrier > takes 4.5 seconds to complete and > the second one 1.5 seconds, all due to MPI RPCs being stuck behind > huge bulk data requests on the clients, > presumably (I do not have any other good explanations at least). > This makes for a lot of wasted time in applications that would like > to use the buffering capabilities provided > by the OS. >This sounds much more like barrier jitter than backup. The network is capable of servicing the 250M in < .15s. It would be my guess that some of the writes() are taking longer than others and this is causing the barrier to be delayed. A few questions: - how many OSS/OSTs are you writing to ? - can you post the MPI app you are using to do this ? The application folks @ ORNL should be able to help you use Craypat or Apprentice to get some runtime data on this app to find where the time is going. Until we have hard data, I don''t think we can blame the network. Cheers, Nic
On Tue, 2009-03-31 at 22:43 -0600, Oleg Drokin wrote:> Hello! > > It came to my attention that seastar network does not implement > message priorities for various reasons.That is incorrect. The seastar network does implement at least one priority scheme based on age. It''s not something an application can play with if I remember right.> I really think there is very valid case for the priorities of some > sort to allow MPI and other > latency-critical traffic to go in front of bulk IO traffic on the > wire.That would be very difficult to implement without making starvation scenarios trivial.> Consider this test I was running the other day on Jaguar. The > application writes 250M of data from every > core with plain write() system call, the write() syscall returns > very fast (less than 0.5 sec == 400+Mb/sec > app-perceived bandwidth) because the data just goes to the memory > cache to be flushed later. > Then I do 2 barriers one by one with nothing in between. > If I run it at sufficient scale (say 1200 cores), the first barrier > takes 4.5 seconds to complete and > the second one 1.5 seconds, all due to MPI RPCs being stuck behind > huge bulk data requests on the clients, > presumably (I do not have any other good explanations at least). > This makes for a lot of wasted time in applications that would like > to use the buffering capabilities provided > by the OS.I strongly suspect OS jitter, probably related to FS activity, is a much more likely explanation for the above. If just one node has the process/rank suspended then it can''t service the barrier; All will wait until it can. Jitter gets a bad rap. Usually for good reason. However, in this case, it doesn''t seem something to worry overly much about as it will cease. Your test says the 1st barrier after the write completes in 4.5 sec and the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty rapidly. Jitter is really only bad when it is chronic. To me, you are worrying way too much about the situation immediately after a write. Checkpoints are relatively rare, with long periods between. Why worry about something that''s only going to affect a very small portion of the overall job? As long as the jitter dissipates in a short time, things will work out fine. Maybe you could convince yourself of the efficacy of write-back caching in this scenario by altering the app to do an fsync() after the write phase on the node but before the barrier? If the app can get back to computing, even with the jitter-disrupted barrier, faster than it could by waiting for the outstanding dirty buffers to be flushed then it''s a net win to just live with the jitter, no? --Lee> > Do you think something like this could be organized if not for > current revision then at least for the next > version? > > Bye, > Oleg > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Hello! On Apr 1, 2009, at 8:55 AM, Nic Henke wrote:>> It came to my attention that seastar network does not implement >> message priorities for various reasons. >> I really think there is very valid case for the priorities of some >> sort to allow MPI and other >> latency-critical traffic to go in front of bulk IO traffic on the >> wire. > In the ptllnd, the bulk traffic is setup via short messages, so if the > barrier is sent right after the write() returns, it really isn''t > backed > up behind the bulk data.Yes, it is. Lustre starts to send RPCs as soon as 1M (+16) pages of data per RPC become available for sending. So by the time write() syscall for 250M returns, I already potentially have 16 (stripe count) * 4 (core count) * 8 (rpcs in flight) MB in flight from this particular node (since chances are OSTs already accepted the transfers if there are free threads).> This sounds much more like barrier jitter than backup. The network is > capable of servicing the 250M in < .15s. It would be my guess that > some > of the writes() are taking longer than others and this is causing the > barrier to be delayed.No. I time each individual write separately. I know all writes start at the same time (there is barrier before them), I know that each write finishes in aprox 0.5 sec as well.> A few questions: > - how many OSS/OSTs are you writing to ?up to 16 * 4 from a single node.> - can you post the MPI app you are using to do this ?Sure. Attached. (with example output)> The application folks @ ORNL should be able to help you use Craypat or > Apprentice to get some runtime data on this app to find where the time > is going. Until we have hard data, I don''t think we can blame the > network.Interesting idea. Please notice if I run the code at a scale of 4, barrier is instant. As I scale up node count, barrier time begins to rise. In the output you can see I run the code twice in a row. This is done to make sure the grant is primed in case it was not, to take entire amount of data into the cache (otherwise in some runs some individual writes take significant time to complete invalidating the test). Another thing of note is that since I did not want to take any chances, the working files are precreated externally so that no files share any ost for a single node, and the app itself just opens the files, not creates them. Bye, Oleg -------------- next part -------------- A non-text attachment was scrubbed... Name: writespeed-big.c Type: application/octet-stream Size: 3955 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0003.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: writespeed_big.o555745 Type: application/octet-stream Size: 320418 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0004.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: writespeed_big.pbs Type: application/octet-stream Size: 362 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0005.obj
Hello! On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:>> It came to my attention that seastar network does not implement >> message priorities for various reasons. > That is incorrect. The seastar network does implement at least one > priority scheme based on age. It''s not something an application can > play > with if I remember right.Well, then it''s as good as none for our purposes, I think?> I strongly suspect OS jitter, probably related to FS activity, is a > much > more likely explanation for the above. If just one node has the > process/rank suspended then it can''t service the barrier; All will > wait > until it can.That''s of course right and possible too. Though given how nothing else is running on the nodes, I would think it is somewhat irrelevant, since there is nothing else to give resources to. The Lustre processing of the outgoing queue is pretty fast in itself at this phase. Do you think it would be useful if I just run 1 thread per node, there would be 3 empty cores to adsorb all the jitter there might be then?> Jitter gets a bad rap. Usually for good reason. However, in this case, > it doesn''t seem something to worry overly much about as it will cease. > Your test says the 1st barrier after the write completes in 4.5 sec > and > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty > rapidly. Jitter is really only bad when it is chronic.Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my specific job. So I thought it would be a good idea to get to the root of it. We hear many arguments here at the lab that "what good the buffered io is for me when my app performance is degraded if I don''t do sync. I''ll just do the sync and be over with it". Of course I believe there is still benefit to not doing the sync, but that''s just me.> To me, you are worrying way too much about the situation immediately > after a write. Checkpoints are relatively rare, with long periods > between. Why worry about something that''s only going to affect a very > small portion of the overall job? As long as the jitter dissipates > in a > short time, things will work out fine.I worry abut it specifically because users tend to do sync after the write and that wastes a lot of time. So as a result - I want as much of data to enter into cache and then trickle out all by itself and I want users not to see any bad effects (or otherwise to show to them that there are still benefits).> Maybe you could convince yourself of the efficacy of write-back > caching > in this scenario by altering the app to do an fsync() after the write > phase on the node but before the barrier? If the app can get back to > computing, even with the jitter-disrupted barrier, faster than it > could > by waiting for the outstanding dirty buffers to be flushed then it''s a > net win to just live with the jitter, no?I do not need to convince myself. IT''s the app programmers that are fixated on "oh, look, my program is slower after the write if I do not do sync, I must do sync!" Bye, Oleg
On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote:> Hello! > > On Apr 1, 2009, at 10:26 AM, Lee Ward wrote: > >> It came to my attention that seastar network does not implement > >> message priorities for various reasons. > > That is incorrect. The seastar network does implement at least one > > priority scheme based on age. It''s not something an application can > > play > > with if I remember right. > > Well, then it''s as good as none for our purposes, I think?Other than that traffic moves (only very roughly) in a fair manner and that packets from different nodes can arrive out of order, I guess. I think my point was that there is already a priority scheme in the Seastar. Are there additional bits related to priority that you might use, also?> > > I strongly suspect OS jitter, probably related to FS activity, is a > > much > > more likely explanation for the above. If just one node has the > > process/rank suspended then it can''t service the barrier; All will > > wait > > until it can. > > That''s of course right and possible too. > Though given how nothing else is running on the nodes, I would think > it is somewhat irrelevant, since there is nothing else to give > resources to.How and where memory is used on two nodes is different. How, where, when, scheduling occurs on two nodes is different. Any two nodes, even running the same app with barrier synchronization, perform things at different times outside of the barriers; They very quickly desynchronize in the presence of jitter.> The Lustre processing of the outgoing queue is pretty fast in itself at > this phase. > Do you think it would be useful if I just run 1 thread per node, there > would be > 3 empty cores to adsorb all the jitter there might be then?You will still get jitter. I would hope less, though, so it wouldn''t hurt to try to leave at least one idle core. We''ve toyed with the idea of leaving a core idle for IO and other background processing in the past. The idea was a non-starter with our apps folks though. Maybe the ORNL folks will feel differently?> > > Jitter gets a bad rap. Usually for good reason. However, in this case, > > it doesn''t seem something to worry overly much about as it will cease. > > Your test says the 1st barrier after the write completes in 4.5 sec > > and > > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty > > rapidly. Jitter is really only bad when it is chronic. > > Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my > specific job.That 1200 is the number of checkpoints? If so, I agree. If it''s the number of nodes, I do not.> So I thought it would be a good idea to get to the root of it. > We hear many arguments here at the lab that "what good the buffered io > is for > me when my app performance is degraded if I don''t do sync. I''ll just do > the sync and be over with it". Of course I believe there is still > benefit to not > doing the sync, but that''s just me.If the time to settle the jitter is on the order of 10 seconds but it takes 15 seconds to sync, it would be better to live with the jitter, no? I suggested an experiment to make this comparison. Why argue with them? just do the experiment and you can know which strategy is better.> > > To me, you are worrying way too much about the situation immediately > > after a write. Checkpoints are relatively rare, with long periods > > between. Why worry about something that''s only going to affect a very > > small portion of the overall job? As long as the jitter dissipates > > in a > > short time, things will work out fine. > > I worry abut it specifically because users tend to do sync after the > write and that > wastes a lot of time. So as a result - I want as much of data to enter > into cache > and then trickle out all by itself and I want users not to see any bad > effects > (or otherwise to show to them that there are still benefits).Users tend to do sync for more reasons than making the IO deterministic. They should be doing it so that they can have some faith that the last checkpoint is actually persistent when interrupted. However, they should do the sync right before they enter the IO phase, in order to also get the benefits of write-back caching. Not after the IO phase. In the event of an interrupt, this forces them to throw away an in-progress checkpoint and the last one before that, to be safe, but the one before the last should be good. The apps could also be more reasonable about their checkpoints, I''ve noticed. Often, for us anyway, the machine just behaves. If the app began by assuming the machine was unreliable but as it ran for longer and longer periods, it could (I argue should) allow the period between checkpoints to grow. If the idea is to make progress, as I''m told, then on a well behaved machine far fewer checkpoints are required. Most apps, though, just use a fixed period and waste a lot of time doing their checkpoints when the machine is being nice to them.> > > Maybe you could convince yourself of the efficacy of write-back > > caching > > in this scenario by altering the app to do an fsync() after the write > > phase on the node but before the barrier? If the app can get back to > > computing, even with the jitter-disrupted barrier, faster than it > > could > > by waiting for the outstanding dirty buffers to be flushed then it''s a > > net win to just live with the jitter, no? > > I do not need to convince myself. IT''s the app programmers that are > fixated > on "oh, look, my program is slower after the write if I do not do > sync, I must > do sync!"Try the experiment. Show them the data. They are, in theory, reasoning people, right? In some cases, your app programmers will be unfortunately correct. An app that uses so much memory that the system cannot buffer the entire write will incur at least some issues while doing IO; Some of the IO must move synchronously and that amount will differ from node to node. This will have the effect of magnifying this post-IO jitter they are so worried about. It is also why I wrote in the original requirements for Lustre that if write-back caching is employed there must be a way to turn it off. If they aren''t sizing their app for the node''s physical memory, though, I would think that the experiment should show that write-back caching is a win. --Lee> > Bye, > Oleg >
Lee, I completely agree with your comments on measurement. I''d really, really like to see some. Cheers, Eric> -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Lee Ward > Sent: 01 April 2009 4:58 PM > To: Oleg Drokin > Cc: Lustre Development Mailing List > Subject: Re: [Lustre-devel] SeaStar message priority > > On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote: > > Hello! > > > > On Apr 1, 2009, at 10:26 AM, Lee Ward wrote: > > >> It came to my attention that seastar network does not implement > > >> message priorities for various reasons. > > > That is incorrect. The seastar network does implement at least one > > > priority scheme based on age. It''s not something an application can > > > play > > > with if I remember right. > > > > Well, then it''s as good as none for our purposes, I think? > > Other than that traffic moves (only very roughly) in a fair manner and > that packets from different nodes can arrive out of order, I guess. > > I think my point was that there is already a priority scheme in the > Seastar. Are there additional bits related to priority that you might > use, also? > > > > > > I strongly suspect OS jitter, probably related to FS activity, is a > > > much > > > more likely explanation for the above. If just one node has the > > > process/rank suspended then it can''t service the barrier; All will > > > wait > > > until it can. > > > > That''s of course right and possible too. > > Though given how nothing else is running on the nodes, I would think > > it is somewhat irrelevant, since there is nothing else to give > > resources to. > > How and where memory is used on two nodes is different. How, where, > when, scheduling occurs on two nodes is different. Any two nodes, even > running the same app with barrier synchronization, perform things at > different times outside of the barriers; They very quickly desynchronize > in the presence of jitter. > > > The Lustre processing of the outgoing queue is pretty fast in itself at > > this phase. > > Do you think it would be useful if I just run 1 thread per node, there > > would be > > 3 empty cores to adsorb all the jitter there might be then? > > You will still get jitter. I would hope less, though, so it wouldn''t > hurt to try to leave at least one idle core. We''ve toyed with the idea > of leaving a core idle for IO and other background processing in the > past. The idea was a non-starter with our apps folks though. Maybe the > ORNL folks will feel differently? > > > > > > Jitter gets a bad rap. Usually for good reason. However, in this case, > > > it doesn''t seem something to worry overly much about as it will cease. > > > Your test says the 1st barrier after the write completes in 4.5 sec > > > and > > > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty > > > rapidly. Jitter is really only bad when it is chronic. > > > > Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my > > specific job. > > That 1200 is the number of checkpoints? If so, I agree. If it''s the > number of nodes, I do not. > > > So I thought it would be a good idea to get to the root of it. > > We hear many arguments here at the lab that "what good the buffered io > > is for > > me when my app performance is degraded if I don''t do sync. I''ll just do > > the sync and be over with it". Of course I believe there is still > > benefit to not > > doing the sync, but that''s just me. > > If the time to settle the jitter is on the order of 10 seconds but it > takes 15 seconds to sync, it would be better to live with the jitter, > no? I suggested an experiment to make this comparison. Why argue with > them? just do the experiment and you can know which strategy is better. > > > > > > To me, you are worrying way too much about the situation immediately > > > after a write. Checkpoints are relatively rare, with long periods > > > between. Why worry about something that''s only going to affect a very > > > small portion of the overall job? As long as the jitter dissipates > > > in a > > > short time, things will work out fine. > > > > I worry abut it specifically because users tend to do sync after the > > write and that > > wastes a lot of time. So as a result - I want as much of data to enter > > into cache > > and then trickle out all by itself and I want users not to see any bad > > effects > > (or otherwise to show to them that there are still benefits). > > Users tend to do sync for more reasons than making the IO deterministic. > They should be doing it so that they can have some faith that the last > checkpoint is actually persistent when interrupted. > > However, they should do the sync right before they enter the IO phase, > in order to also get the benefits of write-back caching. Not after the > IO phase. In the event of an interrupt, this forces them to throw away > an in-progress checkpoint and the last one before that, to be safe, but > the one before the last should be good. > > The apps could also be more reasonable about their checkpoints, I''ve > noticed. Often, for us anyway, the machine just behaves. If the app > began by assuming the machine was unreliable but as it ran for longer > and longer periods, it could (I argue should) allow the period between > checkpoints to grow. If the idea is to make progress, as I''m told, then > on a well behaved machine far fewer checkpoints are required. Most apps, > though, just use a fixed period and waste a lot of time doing their > checkpoints when the machine is being nice to them. > > > > > > Maybe you could convince yourself of the efficacy of write-back > > > caching > > > in this scenario by altering the app to do an fsync() after the write > > > phase on the node but before the barrier? If the app can get back to > > > computing, even with the jitter-disrupted barrier, faster than it > > > could > > > by waiting for the outstanding dirty buffers to be flushed then it''s a > > > net win to just live with the jitter, no? > > > > I do not need to convince myself. IT''s the app programmers that are > > fixated > > on "oh, look, my program is slower after the write if I do not do > > sync, I must > > do sync!" > > Try the experiment. Show them the data. They are, in theory, reasoning > people, right? > > In some cases, your app programmers will be unfortunately correct. An > app that uses so much memory that the system cannot buffer the entire > write will incur at least some issues while doing IO; Some of the IO > must move synchronously and that amount will differ from node to node. > This will have the effect of magnifying this post-IO jitter they are so > worried about. It is also why I wrote in the original requirements for > Lustre that if write-back caching is employed there must be a way to > turn it off. > > If they aren''t sizing their app for the node''s physical memory, though, > I would think that the experiment should show that write-back caching is > a win. > > --Lee > > > > > Bye, > > Oleg > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Hello! On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:> I think my point was that there is already a priority scheme in the > Seastar. Are there additional bits related to priority that you might > use, also?But if we cannot use it, there is none. Like we want mpi rpcs go out first to some degree.>>> I strongly suspect OS jitter, probably related to FS activity, is a >>> much >>> more likely explanation for the above. If just one node has the >>> process/rank suspended then it can''t service the barrier; All will >>> wait >>> until it can. >> That''s of course right and possible too. >> Though given how nothing else is running on the nodes, I would think >> it is somewhat irrelevant, since there is nothing else to give >> resources to. > How and where memory is used on two nodes is different. How, where,That''s irrelevant.> when, scheduling occurs on two nodes is different. Any two nodes, even > running the same app with barrier synchronization, perform things at > different times outside of the barriers; They very quickly > desynchronize > in the presence of jitter.But since the only thing I have in my app inside barriers is write call, there is no much way to desynchronize.>> The Lustre processing of the outgoing queue is pretty fast in >> itself at >> this phase. >> Do you think it would be useful if I just run 1 thread per node, >> there >> would be >> 3 empty cores to adsorb all the jitter there might be then? > You will still get jitter. I would hope less, though, so it wouldn''t > hurt to try to leave at least one idle core. We''ve toyed with the idea > of leaving a core idle for IO and other background processing in the > past. The idea was a non-starter with our apps folks though. Maybe the > ORNL folks will feel differently?No, I do not think they would like the idea to forfeit 1/4 of their CPU just so io is better. If the jitter is due to cpu occupied with io, and apps stalled due to this (though I have hard time believing an app to be not given a cpu for 4.5 seconds, even though there are potentially 4 idle cpus, or even 3 (remember other cores are also idle waiting on a barrier).>>> Jitter gets a bad rap. Usually for good reason. However, in this >>> case, >>> it doesn''t seem something to worry overly much about as it will >>> cease. >>> Your test says the 1st barrier after the write completes in 4.5 sec >>> and >>> the 2nd in 1.5 sec. That seems to imply the jitter is settling >>> pretty >>> rapidly. Jitter is really only bad when it is chronic. >> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my >> specific job. > That 1200 is the number of checkpoints? If so, I agree. If it''s the > number of nodes, I do not.1200 is number of cores waiting on a barrier. Every core spends 4.5 seconds == total wasted single-cpu core time is 1.5 hours. And the more often this happens the worse.>> So I thought it would be a good idea to get to the root of it. >> We hear many arguments here at the lab that "what good the buffered >> io >> is for >> me when my app performance is degraded if I don''t do sync. I''ll >> just do >> the sync and be over with it". Of course I believe there is still >> benefit to not >> doing the sync, but that''s just me. > If the time to settle the jitter is on the order of 10 seconds but it > takes 15 seconds to sync, it would be better to live with the jitter, > no? I suggested an experiment to make this comparison. Why argue with > them? just do the experiment and you can know which strategy is > better.I know which one is better. I did the experiment. (though I have no realistic way to measure when "jitter" settles out).>>> To me, you are worrying way too much about the situation immediately >>> after a write. Checkpoints are relatively rare, with long periods >>> between. Why worry about something that''s only going to affect a >>> very >>> small portion of the overall job? As long as the jitter dissipates >>> in a >>> short time, things will work out fine. >> I worry abut it specifically because users tend to do sync after the >> write and that >> wastes a lot of time. So as a result - I want as much of data to >> enter >> into cache >> and then trickle out all by itself and I want users not to see any >> bad >> effects >> (or otherwise to show to them that there are still benefits). > Users tend to do sync for more reasons than making the IO > deterministic. > They should be doing it so that they can have some faith that the last > checkpoint is actually persistent when interrupted.For that they only need to do fsync before their next checkpoint, to make sure that the previous one completed.> However, they should do the sync right before they enter the IO phase, > in order to also get the benefits of write-back caching. Not after the > IO phase. In the event of an interrupt, this forces them to throw away > an in-progress checkpoint and the last one before that, to be safe, > but > the one before the last should be good.Right. Yet they do some microbenchmark and decide it is bad idea. Besides, reducing jitter, or whatever is the cause for the delays would still be useful.> In some cases, your app programmers will be unfortunately correct. An > app that uses so much memory that the system cannot buffer the entire > write will incur at least some issues while doing IO; Some of the IO > must move synchronously and that amount will differ from node to node. > This will have the effect of magnifying this post-IO jitter they are > so > worried about. It is also why I wrote in the original requirements forWhy would it? There still is potentially a benefit for the available cache size.> Lustre that if write-back caching is employed there must be a way to > turn it off.There is around 3 ways to do that that I am aware of. Bye, Oleg
On Wed, 2009-04-01 at 10:35 -0600, Oleg Drokin wrote:> Hello! > > On Apr 1, 2009, at 11:58 AM, Lee Ward wrote: > > I think my point was that there is already a priority scheme in the > > Seastar. Are there additional bits related to priority that you might > > use, also? > > But if we cannot use it, there is none. > Like we want mpi rpcs go out first to some degree.If you don''t want to follow up, I''m ok with that. It''s up to you. I understood what you want. There are at least two things I can imagine that would better the situation without trying to leverage something in the network, itself. 1) Partition the adapter CAM so that there is always room to accommodate a user-space receive. 2) Prioritize injection to favor sends originating from user-space. One or both of these might already be implemented. I don''t know.> > >>> I strongly suspect OS jitter, probably related to FS activity, is a > >>> much > >>> more likely explanation for the above. If just one node has the > >>> process/rank suspended then it can''t service the barrier; All will > >>> wait > >>> until it can. > >> That''s of course right and possible too. > >> Though given how nothing else is running on the nodes, I would think > >> it is somewhat irrelevant, since there is nothing else to give > >> resources to. > > How and where memory is used on two nodes is different. How, where, > > That''s irrelevant. > > > when, scheduling occurs on two nodes is different. Any two nodes, even > > running the same app with barrier synchronization, perform things at > > different times outside of the barriers; They very quickly > > desynchronize > > in the presence of jitter. > > But since the only thing I have in my app inside barriers is write call, > there is no much way to desynchronize.Modify your test to report the length of time each node spent in the barrier (not just rank 0, as it is written now) immediately after the write call, then? If you are correct, they will all be roughly the same. If they have desynchronized, most will have very long wait times but at least one will be relatively short.> > >> The Lustre processing of the outgoing queue is pretty fast in > >> itself at > >> this phase. > >> Do you think it would be useful if I just run 1 thread per node, > >> there > >> would be > >> 3 empty cores to adsorb all the jitter there might be then? > > You will still get jitter. I would hope less, though, so it wouldn''t > > hurt to try to leave at least one idle core. We''ve toyed with the idea > > of leaving a core idle for IO and other background processing in the > > past. The idea was a non-starter with our apps folks though. Maybe the > > ORNL folks will feel differently? > > No, I do not think they would like the idea to forfeit 1/4 of their > CPU just so io is better. > If the jitter is due to cpu occupied with io, and apps stalled due to > this > (though I have hard time believing an app to be not given a cpu for > 4.5 seconds, > even though there are potentially 4 idle cpus, or even 3 (remember > other cores are > also idle waiting on a barrier).Oh, I''m sure they''re getting the CPU. They just won''t come out of the barrier until all have processed the operation. The rates at which the nodes reach the barrier will be different. The rates at which they proceed through will be different. The only invariant after a barrier is that all the involved ranks *have* reached that point. Nothing about when that happened is stated or implied.> > >>> Jitter gets a bad rap. Usually for good reason. However, in this > >>> case, > >>> it doesn''t seem something to worry overly much about as it will > >>> cease. > >>> Your test says the 1st barrier after the write completes in 4.5 sec > >>> and > >>> the 2nd in 1.5 sec. That seems to imply the jitter is settling > >>> pretty > >>> rapidly. Jitter is really only bad when it is chronic. > >> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my > >> specific job. > > That 1200 is the number of checkpoints? If so, I agree. If it''s the > > number of nodes, I do not. > > 1200 is number of cores waiting on a barrier. > Every core spends 4.5 seconds == total wasted single-cpu core time is > 1.5 hours.It doesn''t work that way. The barrier operation is implemented as a collective on the Cray. What you are missing in the math above is that every core waited during the *same* 4.5 second period. Total wasted time is only 4.5 seconds then.> And the more often this happens the worse. > > >> So I thought it would be a good idea to get to the root of it. > >> We hear many arguments here at the lab that "what good the buffered > >> io > >> is for > >> me when my app performance is degraded if I don''t do sync. I''ll > >> just do > >> the sync and be over with it". Of course I believe there is still > >> benefit to not > >> doing the sync, but that''s just me. > > If the time to settle the jitter is on the order of 10 seconds but it > > takes 15 seconds to sync, it would be better to live with the jitter, > > no? I suggested an experiment to make this comparison. Why argue with > > them? just do the experiment and you can know which strategy is > > better. > > I know which one is better. I did the experiment. (though I have no > realistic way > to measure when "jitter" settles out).Which was better then? By how much? Were you just measuring a barrier or do those numbers still work out when the app uses the network heavily after doing it''s writes?> > >>> To me, you are worrying way too much about the situation immediately > >>> after a write. Checkpoints are relatively rare, with long periods > >>> between. Why worry about something that''s only going to affect a > >>> very > >>> small portion of the overall job? As long as the jitter dissipates > >>> in a > >>> short time, things will work out fine. > >> I worry abut it specifically because users tend to do sync after the > >> write and that > >> wastes a lot of time. So as a result - I want as much of data to > >> enter > >> into cache > >> and then trickle out all by itself and I want users not to see any > >> bad > >> effects > >> (or otherwise to show to them that there are still benefits). > > Users tend to do sync for more reasons than making the IO > > deterministic. > > They should be doing it so that they can have some faith that the last > > checkpoint is actually persistent when interrupted. > > For that they only need to do fsync before their next checkpoint, > to make sure that the previous one completed. > > > However, they should do the sync right before they enter the IO phase, > > in order to also get the benefits of write-back caching. Not after the > > IO phase. In the event of an interrupt, this forces them to throw away > > an in-progress checkpoint and the last one before that, to be safe, > > but > > the one before the last should be good. > > Right. > Yet they do some microbenchmark and decide it is bad idea. > Besides, reducing jitter, or whatever is the cause for the delays > would still be useful.You''re making a wonderful argument for Catamount :)> > > In some cases, your app programmers will be unfortunately correct. An > > app that uses so much memory that the system cannot buffer the entire > > write will incur at least some issues while doing IO; Some of the IO > > must move synchronously and that amount will differ from node to node. > > This will have the effect of magnifying this post-IO jitter they are > > so > > worried about. It is also why I wrote in the original requirements for > > Why would it? There still is potentially a benefit for the available > cache size.In a fitted application, there is no useful amount of memory left over for the cache. Using it, then, is just unnecessary overhead. As I said, there''s a very real possibility your app programmers are correct. It goes beyond memory. Any resource under intense pressure due to contention offers the possibility that it can take longer to perform it''s requests independently than to serialize them. For instance, if an app does not use all of memory then there is plenty of room for Lustre to cache. Since these apps presumably are going to communicate after the IO phase (why else the barrier after the IO?) then they will contend heavily with the Lustre client for the network interface and that interface does not deal well with such a situation on the Cray. I can easily believe it would take longer for the app to get back to computing because of the asynchronous network traffic from the write-back than it would to just force the IO phase to complete, via fsync, and, after, do what it needs to do to get back to work. If, instead, an app does use all of the memory then it''s blocked for a long time in the IO calls waiting for a free buffer, before the sync. If, when, that happens then the fsync is nearly a no-op as most of the dirty data have already been written. Were I an app programmer, I could easily come to the conclusion that the fsync is either useful or does not hurt. The only cooperative app I can think of that seems to be able to win universally is the one structured to: for (;;) { barrier fsync checkpoint for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) { compute communicate } } I don''t know any that work that way though :(> > > Lustre that if write-back caching is employed there must be a way to > > turn it off. > > There is around 3 ways to do that that I am aware of.That''s nice. It was a requirement, after all. ;) --Lee> > Bye, > Oleg >
Oleg Drokin wrote:> Hello! > > On Apr 1, 2009, at 11:58 AM, Lee Ward wrote: >> I think my point was that there is already a priority scheme in the >> Seastar. Are there additional bits related to priority that you might >> use, also? > > But if we cannot use it, there is none. > Like we want mpi rpcs go out first to some degree.If we have to deal with ordering - we are already sunk. The Lustre RPCs will go out and affect MPI latency to some degree, introducing jitter into the calls and affecting application performance.> > But since the only thing I have in my app inside barriers is write call, > there is no much way to desynchronize.Incorrect - you are running your app on all 4 CPUs on the node at the same time Lustre is sending RPCs. The kernel threads will get scheduled and run, pushing your app to the side and desynchronizing the barrier for the app as a whole.> No, I do not think they would like the idea to forfeit 1/4 of their > CPU just so io is better. > If the jitter is due to cpu occupied with io, and apps stalled due to > this > (though I have hard time believing an app to be not given a cpu for > 4.5 seconds, > even though there are potentially 4 idle cpus, or even 3 (remember > other cores are > also idle waiting on a barrier).This gets easier to swallow in the future with 12core and larger nodes - 1/12 is much easier to sacrifice. What we really need to "prove" is where the delay is occurring. The MPI_Barrier messages are 0-byte sends, effectively turning them into Portals headers and these are sent and processed very fast. In fact, the total amount of data being sent is _much_ less than the NIC is capable of. A rough estimate for 2 nodes talking to each other is 1700 MB/s and 50K lnet pings/s. One thing to try is changing your aprun to use fewer CPUs per node: aprun -n 1200 -N [1,2,3] -cc 1-3. The -cc 1-3 will keep it off cpu 0 - a known location for some IRQs and other servicing. You should also try to capture compute-node stats like cpu usage, # of threads active during barrier, etc to help narrow down where the time is going. Nic
Hello! On Apr 1, 2009, at 3:15 PM, Nicholas Henke wrote:>> But since the only thing I have in my app inside barriers is write >> call, >> there is no much way to desynchronize. > Incorrect - you are running your app on all 4 CPUs on the node at > the same time Lustre is sending RPCs. The kernel threads will get > scheduled and run, pushing your app to the side and desynchronizing > the barrier for the app as a whole.But I am measuring each write and I see that none of them significantly exceeds 0.5 seconds. Let it be 0.1 seconds difference. So then 4.5 seconds - 0.1 seconds for the write speed difference = 4.4 seconds.> What we really need to "prove" is where the delay is occurring. The > MPI_Barrier messages are 0-byte sends, effectively turning them into > Portals headers and these are sent and processed very fast. In fact, > the total amount of data being sent is _much_ less than the NIC is > capable of. A rough estimate for 2 nodes talking to each other is > 1700 MB/s and 50K lnet pings/s.Yes. I understand this point.> One thing to try is changing your aprun to use fewer CPUs per node: > aprun -n 1200 -N [1,2,3] -cc 1-3.I just run with 1 cpu per node, 1200 threads. Leaving 3 cpus/core for kernel and whatnot. The actual write syscall return time decreased, but the barrier is not, even though we know that less data is in time at any given time now (due to only 16 osts accessed per node, not 16*4). So something is going on, but I do not think we can blindly attribute it to just "ah kernel ate your cpu for important things pushing the data" 0: barrier after write time: 4.528383 sec 0: barrier after write 2 time: 4.043252 sec The pre-write barrier took only 0.096675 sec (to rule out general network congestion). Bye, Oleg
Hello! On Apr 1, 2009, at 3:13 PM, Lee Ward wrote:>> But if we cannot use it, there is none. >> Like we want mpi rpcs go out first to some degree. > If you don''t want to follow up, I''m ok with that. It''s up to you. > I understood what you want. There are at least two things I can > imagine > that would better the situation without trying to leverage something > in > the network, itself. > 1) Partition the adapter CAM so that there is always room to > accommodate > a user-space receive.I cannot really comment on this option.> 2) Prioritize injection to favor sends originating from user-space.This is what I am speaking about, actually, perhaps not being able to explain myself clearly. Except perhaps just userspace is too generic and a bit more fine-grained controls would be more beneficial.> One or both of these might already be implemented. I don''t know.The second option does not look like it is implemented.>>> when, scheduling occurs on two nodes is different. Any two nodes, >>> even >>> running the same app with barrier synchronization, perform things at >>> different times outside of the barriers; They very quickly >>> desynchronize >>> in the presence of jitter. >> But since the only thing I have in my app inside barriers is write >> call, >> there is no much way to desynchronize. > Modify your test to report the length of time each node spent in the > barrier (not just rank 0, as it is written now) immediately after the > write call, then? If you are correct, they will all be roughly the > same. > If they have desynchronized, most will have very long wait times but > at > least one will be relatively short.That''s a fair point. I just scheduled the run.> Oh, I''m sure they''re getting the CPU. They just won''t come out of the > barrier until all have processed the operation. The rates at which the > nodes reach the barrier will be different. The rates at which theyI believe the rates at which they come to the barrier are the aprox the same. I do time the write system call. And the barrier is next to it. And write system call has relatively small variability in time, so we can assume that all barriers start within 0.1 seconds from each other.> proceed through will be different. The only invariant after a > barrier is > that all the involved ranks *have* reached that point. Nothing about > when that happened is stated or implied.Ok, I did not realize that, though that makes sense. I believe in my test the problem is on the sending side - i.e. the bottleneck does not let all nodes to report that the point was reached by every thread. But as soon as all nodes gathered, whatever control node sends the messages (that are of course could be delayed in the queue if it is also doing the io, hm, I wonder what node coordinates this (set of nodes?). Rank 0?) and once injected, they should be processed instantly since we do not have any significant incoming traffic on the nodes. Don''t take my word for it of course, the test is already scheduled and I''ll share the results.>>>>> Jitter gets a bad rap. Usually for good reason. However, in this >>>>> case, >>>>> it doesn''t seem something to worry overly much about as it will >>>>> cease. >>>>> Your test says the 1st barrier after the write completes in 4.5 >>>>> sec >>>>> and >>>>> the 2nd in 1.5 sec. That seems to imply the jitter is settling >>>>> pretty >>>>> rapidly. Jitter is really only bad when it is chronic. >>>> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my >>>> specific job. >>> That 1200 is the number of checkpoints? If so, I agree. If it''s the >>> number of nodes, I do not. >> 1200 is number of cores waiting on a barrier. >> Every core spends 4.5 seconds == total wasted single-cpu core time is >> 1.5 hours. > It doesn''t work that way. The barrier operation is implemented as a > collective on the Cray. What you are missing in the math above is that > every core waited during the *same* 4.5 second period. Total wasted > time > is only 4.5 seconds then.I have a feeling we are speaking about different subjects here. You are speaking about wall-clock time. I am speaking about total cpu- cycles wasted across all nodes.>>> If the time to settle the jitter is on the order of 10 seconds but >>> it >>> takes 15 seconds to sync, it would be better to live with the >>> jitter, >>> no? I suggested an experiment to make this comparison. Why argue >>> with >>> them? just do the experiment and you can know which strategy is >>> better. >> I know which one is better. I did the experiment. (though I have no >> realistic way >> to measure when "jitter" settles out). > Which was better then? By how much? Were you just measuring a > barrier or > do those numbers still work out when the app uses the network heavily > after doing it''s writes?Unfortunately I do not have any real applications instrumented, so my barrier is a substitute for "network-heavy app activity". I started with it because app programmers I spoke with complained about how their network latency is affected if they do buffered writes. The fsync takes upward from 10 seconds, depending on other load in the system, I guess. I have no easy way to measure the jitter. I do not think writeout with or without fsync would take significantly different time because the underlying io paths don''t change, but that''s non- scientific. Unfortunately just doing a write, time the fsync then doing the write again and wait the same amount of time as fsync took, then do another fsync and see if it is instantly returned, since lustre only eagerly writes 1M chunks, and vm pressure only ensures data older than 30 seconds would be pushed out. And that before taking into account possible variability of the loads on OSTs over time (since I cannot have entire Jaguar all for myself).>>> However, they should do the sync right before they enter the IO >>> phase, >>> in order to also get the benefits of write-back caching. Not after >>> the >>> IO phase. In the event of an interrupt, this forces them to throw >>> away >>> an in-progress checkpoint and the last one before that, to be safe, >>> but >>> the one before the last should be good. >> >> Right. >> Yet they do some microbenchmark and decide it is bad idea. >> Besides, reducing jitter, or whatever is the cause for the delays >> would still be useful. > You''re making a wonderful argument for Catamount :)Actually, catamount definitely has its strong points, but there are drawbacks as well. With Linux it''s just another set of benefits and drawbacks.>>> In some cases, your app programmers will be unfortunately correct. >>> An >>> app that uses so much memory that the system cannot buffer the >>> entire >>> write will incur at least some issues while doing IO; Some of the IO >>> must move synchronously and that amount will differ from node to >>> node. >>> This will have the effect of magnifying this post-IO jitter they are >>> so >>> worried about. It is also why I wrote in the original requirements >>> for >> Why would it? There still is potentially a benefit for the available >> cache size. > In a fitted application, there is no useful amount of memory left over > for the cache. Using it, then, is just unnecessary overhead.Right. In this case it is even better to do non-caching io (directio style) to reduce the memory copy overhead as well.> As I said, there''s a very real possibility your app programmers are > correct. It goes beyond memory. Any resource under intense pressure > due > to contention offers the possibility that it can take longer to > perform > it''s requests independently than to serialize them. For instance, if > an > app does not use all of memory then there is plenty of room for Lustre > to cache. Since these apps presumably are going to communicate after > the > IO phase (why else the barrier after the IO?) then they will contend > heavily with the Lustre client for the network interface and that > interface does not deal well with such a situation on the Cray. I can > easily believe it would take longer for the app to get back to > computing > because of the asynchronous network traffic from the write-back than > it > would to just force the IO phase to complete, via fsync, and, after, > do > what it needs to do to get back to work. If, instead, an app does use > all of the memory then it''s blocked for a long time in the IO calls > waiting for a free buffer, before the sync. If, when, that happens > then > the fsync is nearly a no-op as most of the dirty data have already > been > written.This is all very true. Currently I am only focusing on applications that do leave enough space for the fs cache, since that''s where the possible benefit is and there is no drawback for applications that don''t do cached io. (And this is the case for the app programmers I spoke with).> The only cooperative app I can think of that seems to be able to win > universally is the one structured to: > > for (;;) { > barrier > fsync > checkpoint > for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) { > compute > communicate > } > }> I don''t know any that work that way though :(We here at ORNL are trying hard to convice app programmers that this is indeed beneficial. Unfortunately it is not all that clean-cut, the machine itself behaves differently every time due to all different workloads going on is part of the problem too. Of course we stand in our way too, with the default 32Mb limit of dirty cache per osc, in order to get meaningful caching size we need to stripe the files over waay to many OSTs, and as a result the overall IO performance is degraded compared to just the case of outputting to a single OST from every job, due to reduced randomness in IO pattern. Bye, Oleg
Hello! On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote:>>>> when, scheduling occurs on two nodes is different. Any two nodes, >>>> even >>>> running the same app with barrier synchronization, perform things >>>> at >>>> different times outside of the barriers; They very quickly >>>> desynchronize >>>> in the presence of jitter. >>> But since the only thing I have in my app inside barriers is write >>> call, >>> there is no much way to desynchronize. >> Modify your test to report the length of time each node spent in the >> barrier (not just rank 0, as it is written now) immediately after the >> write call, then? If you are correct, they will all be roughly the >> same. >> If they have desynchronized, most will have very long wait times >> but at >> least one will be relatively short. > That''s a fair point. I just scheduled the run.Ok. The results are in. I scheduled 2 runs. One at 4 threads/node and one at 1 thread/node. For the 4 threads/node case the 1st barrier took anywhere from 1.497 sec to 3.025 sec with rank 0 reporting 1.627 sec. The second barrier took 0.916 to 2.758 seconds with rank 0 reporting 1.992 sec. For the barrier 2 I can actually clearly observe that thread terminate in groups of 4 with very close times, and ranks suggest those nids are on the same nodes. On 1st barrier this trend is much less visible, though. On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and slowest was 10.176 For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is pretty close to the difference between fastest and slowest 1st barrier, since amount of data written per node in this case 4 smaller, I guess we just flushed all the data to the disk before the 1st barrier finished and the difference in waiting was due to the differences in start times. As you can see, numbers tend to jump around, but there are still relatively big delays due to something else than just threads getting out of sync. Bye, Oleg
On Wed, 2009-04-01 at 20:46 -0600, Oleg Drokin wrote:> Hello! > > On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote: > > >>>> when, scheduling occurs on two nodes is different. Any two nodes, > >>>> even > >>>> running the same app with barrier synchronization, perform things > >>>> at > >>>> different times outside of the barriers; They very quickly > >>>> desynchronize > >>>> in the presence of jitter. > >>> But since the only thing I have in my app inside barriers is write > >>> call, > >>> there is no much way to desynchronize. > >> Modify your test to report the length of time each node spent in the > >> barrier (not just rank 0, as it is written now) immediately after the > >> write call, then? If you are correct, they will all be roughly the > >> same. > >> If they have desynchronized, most will have very long wait times > >> but at > >> least one will be relatively short. > > That''s a fair point. I just scheduled the run. > > Ok. > The results are in. I scheduled 2 runs. One at 4 threads/node and one > at 1 thread/node. > > For the 4 threads/node case the 1st barrier took anywhere from 1.497 > sec to > 3.025 sec with rank 0 reporting 1.627 sec. > The second barrier took 0.916 to 2.758 seconds with rank 0 reporting > 1.992 sec. > For the barrier 2 I can actually clearly observe that thread terminate > in > groups of 4 with very close times, and ranks suggest those nids are on > the same > nodes. On 1st barrier this trend is much less visible, though. > > On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and > slowest was 10.176 > For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is > pretty close > to the difference between fastest and slowest 1st barrier, since > amount of data > written per node in this case 4 smaller, I guess we just flushed all > the data > to the disk before the 1st barrier finished and the difference in > waiting was due > to the differences in start times. > > As you can see, numbers tend to jump around, but there are still > relatively big delays > due to something else than just threads getting out of sync.Agreed. It''s something more than simple jitter.>From everything you have described, the nodes are otherwise idle. Theonly other thing I can think, then, of would be one or more Lustre client threads, injecting traffic into the network, which is where you started. A useful test might be to grab the MPI ping-pong from the test suite, modify it to slow it down a bit. Say 4 times a second? Augment it to report the ping-pong time and a time stamp. Augment your existing test to report time stamps for the beginning of the write call. Launch one, each, of these on your set of nodes; I.e., each node has both your write test and the ping-pong running at the same time. This presumes you can launch two mpi jobs onto your set of nodes. If not, come up with an equivalent that is supported? If the ping-pong latency goes way up at the write calls you can claim a correlation. Not definitive as correlation does not equal cause but it is pretty strong. If there is correlation, it means Cray has kind of messed up the portals implementation. The portals implementation would be attempting to send *everything* in order. All portals needs is for traffic to go in order per nid and pid pair. An implementation is free to mix in unrelated traffic, and should, to prevent one process from starving others. An idea... Does the Lustre service side restrict the number of simultaneous get operations it issues? I don''t just mean to a particular client, but to all from a single server, be it OST or MDS. If not, consider it. If there are too many outstanding receives an arriving message may miss the corresponding CAM entry due to a flush. What happens after that can''t be pretty. At one time, it caused the client to resend. Does it still? If so, and resends are occurring the affected clients have their bandwidth reduced by more than 50% for the affected operations. Since there is a barrier operation stuck behind it, well... Mr. Booth has suggested that the portals client might offer to send less data per transfer. This would allow latency sensitive sends to reach the front of the queue more quickly. It would also, I think, lower overall throughput. It''s an idea worth considering but is a case of two evils. Can this be mitigated by peeking at the portals send queue in some way? If Lustre can identify outbound traffic in the queue that it didn''t present then it could respond as Mr. Booth has suggested or back off on the rate at which it presents traffic, or both even? Initial latencies would be unchanged but would get better as the app did more communication, especially if it used the one-sided calls and overlapped them. I''m sorry, if it''s contention for the adapter I don''t see a work around without changing Lustre or Cray changing the driver to more fairly service the independent streams. In any case, right now, your apps guys suspicions probably have merit if it is indeed contention on the network adapter. They may really be better off forcing the IO to complete before moving to the next phase if that phase involves the network. How sad. You do need to do the test, though, before you try to "fix" anything. Right now, it''s only supposition that contention for the network adapter is the evil here. --Lee> > Bye, > Oleg >