thr3ads.net - Lustre devel - [Lustre-devel] SeaStar message priority [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Oleg Drokin

2009-Apr-01 04:43 UTC

[Lustre-devel] SeaStar message priority

Hello!

   It came to my attention that seastar network does not implement  
message priorities for various reasons.
   I really think there is very valid case for the priorities of some  
sort to allow MPI and other
   latency-critical traffic to go in front of bulk IO traffic on the  
wire.
   Consider this test I was running the other day on Jaguar. The  
application writes 250M of data from every
   core with plain write() system call, the write() syscall returns  
very fast (less than 0.5 sec == 400+Mb/sec
   app-perceived bandwidth) because the data just goes to the memory  
cache to be flushed later.
   Then I do 2 barriers one by one with nothing in between.
   If I run it at sufficient scale (say 1200 cores), the first barrier  
takes 4.5 seconds to complete and
   the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
huge bulk data requests on the clients,
   presumably (I do not have any other good explanations at least).
   This makes for a lot of wasted time in applications that would like  
to use the buffering capabilities provided
   by the OS.

   Do you think something like this could be organized if not for  
current revision then at least for the next
   version?

Bye,
     Oleg

Andrew C. Uselton

2009-Apr-01 05:10 UTC

head link

[Lustre-devel] SeaStar message priority

I wonder if that scenario may have some bearing on the results I''ve 
mentioned at:

	http://www.nersc.gov/~uselton/frank_jag/

It would be interesting to step through the logic if anyone is 
interested in doing so.  The web page itself is terse, so feel free to 
bug me for details if you have not seen this before.
Cheers,
Andrew


Oleg Drokin wrote:> Hello!
> 
>    It came to my attention that seastar network does not implement  
> message priorities for various reasons.
>    I really think there is very valid case for the priorities of some  
> sort to allow MPI and other
>    latency-critical traffic to go in front of bulk IO traffic on the  
> wire.
>    Consider this test I was running the other day on Jaguar. The  
> application writes 250M of data from every
>    core with plain write() system call, the write() syscall returns  
> very fast (less than 0.5 sec == 400+Mb/sec
>    app-perceived bandwidth) because the data just goes to the memory  
> cache to be flushed later.
>    Then I do 2 barriers one by one with nothing in between.
>    If I run it at sufficient scale (say 1200 cores), the first barrier  
> takes 4.5 seconds to complete and
>    the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
> huge bulk data requests on the clients,
>    presumably (I do not have any other good explanations at least).
>    This makes for a lot of wasted time in applications that would like  
> to use the buffering capabilities provided
>    by the OS.
> 
>    Do you think something like this could be organized if not for  
> current revision then at least for the next
>    version?
> 
> Bye,
>      Oleg
>   
>   
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nic Henke

2009-Apr-01 12:55 UTC

head link

[Lustre-devel] SeaStar message priority

Oleg Drokin wrote:> Hello!
>
>    It came to my attention that seastar network does not implement  
> message priorities for various reasons.
>    I really think there is very valid case for the priorities of some  
> sort to allow MPI and other
>    latency-critical traffic to go in front of bulk IO traffic on the  
> wire.
>   In the ptllnd, the bulk traffic is setup via short messages, so if the
barrier is sent right after the write() returns, it really isn''t backed
up behind the bulk data.
>    Consider this test I was running the other day on Jaguar. The  
> application writes 250M of data from every
>    core with plain write() system call, the write() syscall returns  
> very fast (less than 0.5 sec == 400+Mb/sec
>    app-perceived bandwidth) because the data just goes to the memory  
> cache to be flushed later.
>    Then I do 2 barriers one by one with nothing in between.
>    If I run it at sufficient scale (say 1200 cores), the first barrier  
> takes 4.5 seconds to complete and
>    the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
> huge bulk data requests on the clients,
>    presumably (I do not have any other good explanations at least).
>    This makes for a lot of wasted time in applications that would like  
> to use the buffering capabilities provided
>    by the OS.
>   This sounds much more like barrier jitter than backup. The network is
capable of servicing the 250M in < .15s. It would be my guess that some
of the writes() are taking longer than others and this is causing the
barrier to be delayed.

A few questions:
- how many OSS/OSTs are you writing to ?
- can you post the MPI app you are using to do this ?

The application folks @ ORNL should be able to help you use Craypat or
Apprentice to get some runtime data on this app to find where the time
is going. Until we have hard data, I don''t think we can blame the
network.

Cheers,
Nic

Lee Ward

2009-Apr-01 14:26 UTC

head link

[Lustre-devel] SeaStar message priority

On Tue, 2009-03-31 at 22:43 -0600, Oleg Drokin wrote:> Hello!
> 
>    It came to my attention that seastar network does not implement  
> message priorities for various reasons.
That is incorrect. The seastar network does implement at least one
priority scheme based on age. It''s not something an application can
play
with if I remember right.
>    I really think there is very valid case for the priorities of some  
> sort to allow MPI and other
>    latency-critical traffic to go in front of bulk IO traffic on the  
> wire.
That would be very difficult to implement without making starvation
scenarios trivial.
>    Consider this test I was running the other day on Jaguar. The  
> application writes 250M of data from every
>    core with plain write() system call, the write() syscall returns  
> very fast (less than 0.5 sec == 400+Mb/sec
>    app-perceived bandwidth) because the data just goes to the memory  
> cache to be flushed later.
>    Then I do 2 barriers one by one with nothing in between.
>    If I run it at sufficient scale (say 1200 cores), the first barrier  
> takes 4.5 seconds to complete and
>    the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
> huge bulk data requests on the clients,
>    presumably (I do not have any other good explanations at least).
>    This makes for a lot of wasted time in applications that would like  
> to use the buffering capabilities provided
>    by the OS.
I strongly suspect OS jitter, probably related to FS activity, is a much
more likely explanation for the above. If just one node has the
process/rank suspended then it can''t service the barrier; All will wait
until it can.

Jitter gets a bad rap. Usually for good reason. However, in this case,
it doesn''t seem something to worry overly much about as it will cease.
Your test says the 1st barrier after the write completes in 4.5 sec and
the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
rapidly. Jitter is really only bad when it is chronic.

To me, you are worrying way too much about the situation immediately
after a write. Checkpoints are relatively rare, with long periods
between. Why worry about something that''s only going to affect a very
small portion of the overall job? As long as the jitter dissipates in a
short time, things will work out fine.

Maybe you could convince yourself of the efficacy of write-back caching
in this scenario by altering the  app to do an fsync() after the write
phase on the node but before the barrier? If the app can get back to
computing, even with the jitter-disrupted barrier, faster than it could
by waiting for the outstanding dirty buffers to be flushed then it''s a
net win to just live with the jitter, no?

		--Lee
> 
>    Do you think something like this could be organized if not for  
> current revision then at least for the next
>    version?
> 
> Bye,
>      Oleg
>   
>   
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Oleg Drokin

2009-Apr-01 15:02 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 8:55 AM, Nic Henke wrote:>>   It came to my attention that seastar network does not implement
>> message priorities for various reasons.
>>   I really think there is very valid case for the priorities of some
>> sort to allow MPI and other
>>   latency-critical traffic to go in front of bulk IO traffic on the
>> wire.
> In the ptllnd, the bulk traffic is setup via short messages, so if the
> barrier is sent right after the write() returns, it really isn''t  
> backed
> up behind the bulk data.
Yes, it is.
Lustre starts to send RPCs as soon as 1M (+16) pages of data per RPC  
become
available for sending.
So by the time write() syscall for 250M returns, I already potentially  
have
16 (stripe count) * 4 (core count) * 8 (rpcs in flight) MB in flight
from this particular node (since chances are OSTs already accepted the
transfers if there are free threads).
> This sounds much more like barrier jitter than backup. The network is
> capable of servicing the 250M in < .15s. It would be my guess that  
> some
> of the writes() are taking longer than others and this is causing the
> barrier to be delayed.
No.
I time each individual write separately.
I know all writes start at the same time (there is barrier before them),
I know that each write finishes in aprox 0.5 sec as well.
> A few questions:
> - how many OSS/OSTs are you writing to ?
up to 16 * 4 from a single node.
> - can you post the MPI app you are using to do this ?
Sure.
Attached. (with example output)
> The application folks @ ORNL should be able to help you use Craypat or
> Apprentice to get some runtime data on this app to find where the time
> is going. Until we have hard data, I don''t think we can blame the
> network.
Interesting idea.
Please notice if I run the code at a scale of 4, barrier is instant.
As I scale up node count, barrier time begins to rise.

In the output you can see I run the code twice in a row.
This is done to make sure the grant is primed in case it was not, to  
take
entire amount of data into the cache (otherwise in some runs some
individual writes take significant time to complete invalidating the  
test).
Another thing of note is that since I did not want to take any chances,
the working files are precreated externally so that no files
share any ost for a single node, and the app itself just opens the  
files,
not creates them.


Bye,
     Oleg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed-big.c
Type: application/octet-stream
Size: 3955 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0003.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed_big.o555745
Type: application/octet-stream
Size: 320418 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0004.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed_big.pbs
Type: application/octet-stream
Size: 362 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/adcae1f6/attachment-0005.obj

Oleg Drokin

2009-Apr-01 15:14 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:>>   It came to my attention that seastar network does not implement
>> message priorities for various reasons.
> That is incorrect. The seastar network does implement at least one
> priority scheme based on age. It''s not something an application
can
> play
> with if I remember right.
Well, then it''s as good as none for our purposes, I think?
> I strongly suspect OS jitter, probably related to FS activity, is a  
> much
> more likely explanation for the above. If just one node has the
> process/rank suspended then it can''t service the barrier; All will
> wait
> until it can.
That''s of course right and possible too.
Though given how nothing else is running on the nodes, I would think
it is somewhat irrelevant, since there is nothing else to give  
resources to.
The Lustre processing of the outgoing queue is pretty fast in itself at
this phase.
Do you think it would be useful if I just run 1 thread per node, there  
would be
3 empty cores to adsorb all the jitter there might be then?
> Jitter gets a bad rap. Usually for good reason. However, in this case,
> it doesn''t seem something to worry overly much about as it will
cease.
> Your test says the 1st barrier after the write completes in 4.5 sec  
> and
> the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
> rapidly. Jitter is really only bad when it is chronic.
Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my  
specific job.
So I thought it would be a good idea to get to the root of it.
We hear many arguments here at the lab that "what good the buffered io  
is for
me when my app performance is degraded if I don''t do sync.
I''ll just do
the sync and be over with it". Of course I believe there is still  
benefit to not
doing the sync, but that''s just me.
> To me, you are worrying way too much about the situation immediately
> after a write. Checkpoints are relatively rare, with long periods
> between. Why worry about something that''s only going to affect a
very
> small portion of the overall job? As long as the jitter dissipates  
> in a
> short time, things will work out fine.
I worry abut it specifically because users tend to do sync after the  
write and that
wastes a lot of time. So as a result - I want as much of data to enter  
into cache
and then trickle out all by itself and I want users not to see any bad  
effects
(or otherwise to show to them that there are still benefits).
> Maybe you could convince yourself of the efficacy of write-back  
> caching
> in this scenario by altering the  app to do an fsync() after the write
> phase on the node but before the barrier? If the app can get back to
> computing, even with the jitter-disrupted barrier, faster than it  
> could
> by waiting for the outstanding dirty buffers to be flushed then
it''s a
> net win to just live with the jitter, no?
I do not need to convince myself. IT''s the app programmers that are  
fixated
on "oh, look, my program is slower after the write if I do not do  
sync, I must
do sync!"

Bye,
     Oleg

Lee Ward

2009-Apr-01 15:58 UTC

head link

[Lustre-devel] SeaStar message priority

On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote:> Hello!
> 
> On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:
> >>   It came to my attention that seastar network does not implement
> >> message priorities for various reasons.
> > That is incorrect. The seastar network does implement at least one
> > priority scheme based on age. It''s not something an
application can
> > play
> > with if I remember right.
> 
> Well, then it''s as good as none for our purposes, I think?
Other than that traffic moves (only very roughly) in a fair manner and
that packets from different nodes can arrive out of order, I guess.

I think my point was that there is already a priority scheme in the
Seastar. Are there additional bits related to priority that you might
use, also?
> 
> > I strongly suspect OS jitter, probably related to FS activity, is a  
> > much
> > more likely explanation for the above. If just one node has the
> > process/rank suspended then it can''t service the barrier; All
will
> > wait
> > until it can.
> 
> That''s of course right and possible too.
> Though given how nothing else is running on the nodes, I would think
> it is somewhat irrelevant, since there is nothing else to give  
> resources to.
How and where memory is used on two nodes is different. How, where,
when, scheduling occurs on two nodes is different. Any two nodes, even
running the same app with barrier synchronization, perform things at
different times outside of the barriers; They very quickly desynchronize
in the presence of jitter.
> The Lustre processing of the outgoing queue is pretty fast in itself at
> this phase.
> Do you think it would be useful if I just run 1 thread per node, there  
> would be
> 3 empty cores to adsorb all the jitter there might be then?
You will still get jitter. I would hope less, though, so it wouldn''t
hurt to try to leave at least one idle core. We''ve toyed with the idea
of leaving a core idle for IO and other background processing in the
past. The idea was a non-starter with our apps folks though. Maybe the
ORNL folks will feel differently?
> 
> > Jitter gets a bad rap. Usually for good reason. However, in this case,
> > it doesn''t seem something to worry overly much about as it
will cease.
> > Your test says the 1st barrier after the write completes in 4.5 sec  
> > and
> > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
> > rapidly. Jitter is really only bad when it is chronic.
> 
> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my  
> specific job.
That 1200 is the number of checkpoints? If so, I agree. If it''s the
number of nodes, I do not.
> So I thought it would be a good idea to get to the root of it.
> We hear many arguments here at the lab that "what good the buffered io
> is for
> me when my app performance is degraded if I don''t do sync.
I''ll just do
> the sync and be over with it". Of course I believe there is still  
> benefit to not
> doing the sync, but that''s just me.
If the time to settle the jitter is on the order of 10 seconds but it
takes 15 seconds to sync, it would be better to live with the jitter,
no? I suggested an experiment to make this comparison. Why argue with
them? just do the experiment and you can know which strategy is better.
> 
> > To me, you are worrying way too much about the situation immediately
> > after a write. Checkpoints are relatively rare, with long periods
> > between. Why worry about something that''s only going to
affect a very
> > small portion of the overall job? As long as the jitter dissipates  
> > in a
> > short time, things will work out fine.
> 
> I worry abut it specifically because users tend to do sync after the  
> write and that
> wastes a lot of time. So as a result - I want as much of data to enter  
> into cache
> and then trickle out all by itself and I want users not to see any bad  
> effects
> (or otherwise to show to them that there are still benefits).
Users tend to do sync for more reasons than making the IO deterministic.
They should be doing it so that they can have some faith that the last
checkpoint is actually persistent when interrupted.

However, they should do the sync right before they enter the IO phase,
in order to also get the benefits of write-back caching. Not after the
IO phase. In the event of an interrupt, this forces them to throw away
an in-progress checkpoint and the last one before that, to be safe, but
the one before the last should be good.

The apps could also be more reasonable about their checkpoints, I''ve
noticed. Often, for us anyway, the machine just behaves. If the app
began by assuming the machine was unreliable but as it ran for longer
and longer periods, it could (I argue should) allow the period between
checkpoints to grow. If the idea is to make progress, as I''m told, then
on a well behaved machine far fewer checkpoints are required. Most apps,
though, just use a fixed period and waste a lot of time doing their
checkpoints when the machine is being nice to them.
> 
> > Maybe you could convince yourself of the efficacy of write-back  
> > caching
> > in this scenario by altering the  app to do an fsync() after the write
> > phase on the node but before the barrier? If the app can get back to
> > computing, even with the jitter-disrupted barrier, faster than it  
> > could
> > by waiting for the outstanding dirty buffers to be flushed then
it''s a
> > net win to just live with the jitter, no?
> 
> I do not need to convince myself. IT''s the app programmers that
are
> fixated
> on "oh, look, my program is slower after the write if I do not do  
> sync, I must
> do sync!"
Try the experiment. Show them the data. They are, in theory, reasoning
people, right?

In some cases, your app programmers will be unfortunately correct. An
app that uses so much memory that the system cannot buffer the entire
write will incur at least some issues while doing IO; Some of the IO
must move synchronously and that amount will differ from node to node.
This will have the effect of magnifying this post-IO jitter they are so
worried about. It is also why I wrote in the original requirements for
Lustre that if write-back caching is employed there must be a way to
turn it off.

If they aren''t sizing their app for the node''s physical
memory, though,
I would think that the experiment should show that write-back caching is
a win.

		--Lee
> 
> Bye,
>      Oleg
>

Eric Barton

2009-Apr-01 16:20 UTC

head link

[Lustre-devel] SeaStar message priority

Lee,

I completely agree with your comments on measurement.  I''d
really, really like to see some.

    Cheers,
              Eric
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces
at lists.lustre.org] On Behalf Of Lee Ward
> Sent: 01 April 2009 4:58 PM
> To: Oleg Drokin
> Cc: Lustre Development Mailing List
> Subject: Re: [Lustre-devel] SeaStar message priority
> 
> On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote:
> > Hello!
> >
> > On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:
> > >>   It came to my attention that seastar network does not
implement
> > >> message priorities for various reasons.
> > > That is incorrect. The seastar network does implement at least
one
> > > priority scheme based on age. It''s not something an
application can
> > > play
> > > with if I remember right.
> >
> > Well, then it''s as good as none for our purposes, I think?
> 
> Other than that traffic moves (only very roughly) in a fair manner and
> that packets from different nodes can arrive out of order, I guess.
> 
> I think my point was that there is already a priority scheme in the
> Seastar. Are there additional bits related to priority that you might
> use, also?
> 
> >
> > > I strongly suspect OS jitter, probably related to FS activity, is
a
> > > much
> > > more likely explanation for the above. If just one node has the
> > > process/rank suspended then it can''t service the
barrier; All will
> > > wait
> > > until it can.
> >
> > That''s of course right and possible too.
> > Though given how nothing else is running on the nodes, I would think
> > it is somewhat irrelevant, since there is nothing else to give
> > resources to.
> 
> How and where memory is used on two nodes is different. How, where,
> when, scheduling occurs on two nodes is different. Any two nodes, even
> running the same app with barrier synchronization, perform things at
> different times outside of the barriers; They very quickly desynchronize
> in the presence of jitter.
> 
> > The Lustre processing of the outgoing queue is pretty fast in itself
at
> > this phase.
> > Do you think it would be useful if I just run 1 thread per node, there
> > would be
> > 3 empty cores to adsorb all the jitter there might be then?
> 
> You will still get jitter. I would hope less, though, so it
wouldn''t
> hurt to try to leave at least one idle core. We''ve toyed with the
idea
> of leaving a core idle for IO and other background processing in the
> past. The idea was a non-starter with our apps folks though. Maybe the
> ORNL folks will feel differently?
> 
> >
> > > Jitter gets a bad rap. Usually for good reason. However, in this
case,
> > > it doesn''t seem something to worry overly much about as
it will cease.
> > > Your test says the 1st barrier after the write completes in 4.5
sec
> > > and
> > > the 2nd in 1.5 sec. That seems to imply the jitter is settling
pretty
> > > rapidly. Jitter is really only bad when it is chronic.
> >
> > Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
> > specific job.
> 
> That 1200 is the number of checkpoints? If so, I agree. If it''s
the
> number of nodes, I do not.
> 
> > So I thought it would be a good idea to get to the root of it.
> > We hear many arguments here at the lab that "what good the
buffered io
> > is for
> > me when my app performance is degraded if I don''t do sync.
I''ll just do
> > the sync and be over with it". Of course I believe there is still
> > benefit to not
> > doing the sync, but that''s just me.
> 
> If the time to settle the jitter is on the order of 10 seconds but it
> takes 15 seconds to sync, it would be better to live with the jitter,
> no? I suggested an experiment to make this comparison. Why argue with
> them? just do the experiment and you can know which strategy is better.
> 
> >
> > > To me, you are worrying way too much about the situation
immediately
> > > after a write. Checkpoints are relatively rare, with long periods
> > > between. Why worry about something that''s only going to
affect a very
> > > small portion of the overall job? As long as the jitter
dissipates
> > > in a
> > > short time, things will work out fine.
> >
> > I worry abut it specifically because users tend to do sync after the
> > write and that
> > wastes a lot of time. So as a result - I want as much of data to enter
> > into cache
> > and then trickle out all by itself and I want users not to see any bad
> > effects
> > (or otherwise to show to them that there are still benefits).
> 
> Users tend to do sync for more reasons than making the IO deterministic.
> They should be doing it so that they can have some faith that the last
> checkpoint is actually persistent when interrupted.
> 
> However, they should do the sync right before they enter the IO phase,
> in order to also get the benefits of write-back caching. Not after the
> IO phase. In the event of an interrupt, this forces them to throw away
> an in-progress checkpoint and the last one before that, to be safe, but
> the one before the last should be good.
> 
> The apps could also be more reasonable about their checkpoints,
I''ve
> noticed. Often, for us anyway, the machine just behaves. If the app
> began by assuming the machine was unreliable but as it ran for longer
> and longer periods, it could (I argue should) allow the period between
> checkpoints to grow. If the idea is to make progress, as I''m told,
then
> on a well behaved machine far fewer checkpoints are required. Most apps,
> though, just use a fixed period and waste a lot of time doing their
> checkpoints when the machine is being nice to them.
> 
> >
> > > Maybe you could convince yourself of the efficacy of write-back
> > > caching
> > > in this scenario by altering the  app to do an fsync() after the
write
> > > phase on the node but before the barrier? If the app can get back
to
> > > computing, even with the jitter-disrupted barrier, faster than it
> > > could
> > > by waiting for the outstanding dirty buffers to be flushed then
it''s a
> > > net win to just live with the jitter, no?
> >
> > I do not need to convince myself. IT''s the app programmers
that are
> > fixated
> > on "oh, look, my program is slower after the write if I do not do
> > sync, I must
> > do sync!"
> 
> Try the experiment. Show them the data. They are, in theory, reasoning
> people, right?
> 
> In some cases, your app programmers will be unfortunately correct. An
> app that uses so much memory that the system cannot buffer the entire
> write will incur at least some issues while doing IO; Some of the IO
> must move synchronously and that amount will differ from node to node.
> This will have the effect of magnifying this post-IO jitter they are so
> worried about. It is also why I wrote in the original requirements for
> Lustre that if write-back caching is employed there must be a way to
> turn it off.
> 
> If they aren''t sizing their app for the node''s physical
memory, though,
> I would think that the experiment should show that write-back caching is
> a win.
> 
> 		--Lee
> 
> >
> > Bye,
> >      Oleg
> >
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Oleg Drokin

2009-Apr-01 16:35 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:> I think my point was that there is already a priority scheme in the
> Seastar. Are there additional bits related to priority that you might
> use, also?
But if we cannot use it, there is none.
Like we want mpi rpcs go out first to some degree.
>>> I strongly suspect OS jitter, probably related to FS activity, is a
>>> much
>>> more likely explanation for the above. If just one node has the
>>> process/rank suspended then it can''t service the barrier;
All will
>>> wait
>>> until it can.
>> That''s of course right and possible too.
>> Though given how nothing else is running on the nodes, I would think
>> it is somewhat irrelevant, since there is nothing else to give
>> resources to.
> How and where memory is used on two nodes is different. How, where,
That''s irrelevant.
> when, scheduling occurs on two nodes is different. Any two nodes, even
> running the same app with barrier synchronization, perform things at
> different times outside of the barriers; They very quickly  
> desynchronize
> in the presence of jitter.
But since the only thing I have in my app inside barriers is write call,
there is no much way to desynchronize.
>> The Lustre processing of the outgoing queue is pretty fast in  
>> itself at
>> this phase.
>> Do you think it would be useful if I just run 1 thread per node,  
>> there
>> would be
>> 3 empty cores to adsorb all the jitter there might be then?
> You will still get jitter. I would hope less, though, so it
wouldn''t
> hurt to try to leave at least one idle core. We''ve toyed with the
idea
> of leaving a core idle for IO and other background processing in the
> past. The idea was a non-starter with our apps folks though. Maybe the
> ORNL folks will feel differently?
No, I do not think they would like the idea to forfeit 1/4 of their
CPU just so io is better.
If the jitter is due to cpu occupied with io, and apps stalled due to  
this
(though I have hard time believing an app to be not given a cpu for  
4.5 seconds,
even though there are potentially 4 idle cpus, or even 3 (remember  
other cores are
also idle waiting on a barrier).
>>> Jitter gets a bad rap. Usually for good reason. However, in this  
>>> case,
>>> it doesn''t seem something to worry overly much about as it
will
>>> cease.
>>> Your test says the 1st barrier after the write completes in 4.5 sec
>>> and
>>> the 2nd in 1.5 sec. That seems to imply the jitter is settling  
>>> pretty
>>> rapidly. Jitter is really only bad when it is chronic.
>> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
>> specific job.
> That 1200 is the number of checkpoints? If so, I agree. If it''s
the
> number of nodes, I do not.
1200 is number of cores waiting on a barrier.
Every core spends 4.5 seconds == total wasted single-cpu core time is  
1.5 hours.
And the more often this happens the worse.
>> So I thought it would be a good idea to get to the root of it.
>> We hear many arguments here at the lab that "what good the
buffered
>> io
>> is for
>> me when my app performance is degraded if I don''t do sync.
I''ll
>> just do
>> the sync and be over with it". Of course I believe there is still
>> benefit to not
>> doing the sync, but that''s just me.
> If the time to settle the jitter is on the order of 10 seconds but it
> takes 15 seconds to sync, it would be better to live with the jitter,
> no? I suggested an experiment to make this comparison. Why argue with
> them? just do the experiment and you can know which strategy is  
> better.
I know which one is better. I did the experiment. (though I have no  
realistic way
to measure when "jitter" settles out).
>>> To me, you are worrying way too much about the situation
immediately
>>> after a write. Checkpoints are relatively rare, with long periods
>>> between. Why worry about something that''s only going to
affect a
>>> very
>>> small portion of the overall job? As long as the jitter dissipates
>>> in a
>>> short time, things will work out fine.
>> I worry abut it specifically because users tend to do sync after the
>> write and that
>> wastes a lot of time. So as a result - I want as much of data to  
>> enter
>> into cache
>> and then trickle out all by itself and I want users not to see any  
>> bad
>> effects
>> (or otherwise to show to them that there are still benefits).
> Users tend to do sync for more reasons than making the IO  
> deterministic.
> They should be doing it so that they can have some faith that the last
> checkpoint is actually persistent when interrupted.
For that they only need to do fsync before their next checkpoint,
to make sure that the previous one completed.
> However, they should do the sync right before they enter the IO phase,
> in order to also get the benefits of write-back caching. Not after the
> IO phase. In the event of an interrupt, this forces them to throw away
> an in-progress checkpoint and the last one before that, to be safe,  
> but
> the one before the last should be good.
Right.
Yet they do some microbenchmark and decide it is bad idea.
Besides, reducing jitter, or whatever is the cause for the delays
would still be useful.
> In some cases, your app programmers will be unfortunately correct. An
> app that uses so much memory that the system cannot buffer the entire
> write will incur at least some issues while doing IO; Some of the IO
> must move synchronously and that amount will differ from node to node.
> This will have the effect of magnifying this post-IO jitter they are  
> so
> worried about. It is also why I wrote in the original requirements for
Why would it? There still is potentially a benefit for the available
cache size.
> Lustre that if write-back caching is employed there must be a way to
> turn it off.
There is around 3 ways to do that that I am aware of.

Bye,
     Oleg

Lee Ward

2009-Apr-01 19:13 UTC

head link

[Lustre-devel] SeaStar message priority

On Wed, 2009-04-01 at 10:35 -0600, Oleg Drokin wrote:> Hello!
> 
> On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
> > I think my point was that there is already a priority scheme in the
> > Seastar. Are there additional bits related to priority that you might
> > use, also?
> 
> But if we cannot use it, there is none.
> Like we want mpi rpcs go out first to some degree.
If you don''t want to follow up, I''m ok with that.
It''s up to you.

I understood what you want. There are at least two things I can imagine
that would better the situation without trying to leverage something in
the network, itself.

1) Partition the adapter CAM so that there is always room to accommodate
a user-space receive.
2) Prioritize injection to favor sends originating from user-space.

One or both of these might already be implemented. I don''t know.
> 
> >>> I strongly suspect OS jitter, probably related to FS activity,
is a
> >>> much
> >>> more likely explanation for the above. If just one node has
the
> >>> process/rank suspended then it can''t service the
barrier; All will
> >>> wait
> >>> until it can.
> >> That''s of course right and possible too.
> >> Though given how nothing else is running on the nodes, I would
think
> >> it is somewhat irrelevant, since there is nothing else to give
> >> resources to.
> > How and where memory is used on two nodes is different. How, where,
> 
> That''s irrelevant.
> 
> > when, scheduling occurs on two nodes is different. Any two nodes, even
> > running the same app with barrier synchronization, perform things at
> > different times outside of the barriers; They very quickly  
> > desynchronize
> > in the presence of jitter.
> 
> But since the only thing I have in my app inside barriers is write call,
> there is no much way to desynchronize.
Modify your test to report the length of time each node spent in the
barrier (not just rank 0, as it is written now) immediately after the
write call, then? If you are correct, they will all be roughly the same.
If they have desynchronized, most will have very long wait times but at
least one will be relatively short.
> 
> >> The Lustre processing of the outgoing queue is pretty fast in  
> >> itself at
> >> this phase.
> >> Do you think it would be useful if I just run 1 thread per node,  
> >> there
> >> would be
> >> 3 empty cores to adsorb all the jitter there might be then?
> > You will still get jitter. I would hope less, though, so it
wouldn''t
> > hurt to try to leave at least one idle core. We''ve toyed with
the idea
> > of leaving a core idle for IO and other background processing in the
> > past. The idea was a non-starter with our apps folks though. Maybe the
> > ORNL folks will feel differently?
> 
> No, I do not think they would like the idea to forfeit 1/4 of their
> CPU just so io is better.
> If the jitter is due to cpu occupied with io, and apps stalled due to  
> this
> (though I have hard time believing an app to be not given a cpu for  
> 4.5 seconds,
> even though there are potentially 4 idle cpus, or even 3 (remember  
> other cores are
> also idle waiting on a barrier).
Oh, I''m sure they''re getting the CPU. They just won''t
come out of the
barrier until all have processed the operation. The rates at which the
nodes reach the barrier will be different. The rates at which they
proceed through will be different. The only invariant after a barrier is
that all the involved ranks *have* reached that point. Nothing about
when that happened is stated or implied.
> 
> >>> Jitter gets a bad rap. Usually for good reason. However, in
this
> >>> case,
> >>> it doesn''t seem something to worry overly much about
as it will
> >>> cease.
> >>> Your test says the 1st barrier after the write completes in
4.5 sec
> >>> and
> >>> the 2nd in 1.5 sec. That seems to imply the jitter is settling
> >>> pretty
> >>> rapidly. Jitter is really only bad when it is chronic.
> >> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
> >> specific job.
> > That 1200 is the number of checkpoints? If so, I agree. If
it''s the
> > number of nodes, I do not.
> 
> 1200 is number of cores waiting on a barrier.
> Every core spends 4.5 seconds == total wasted single-cpu core time is  
> 1.5 hours.
It doesn''t work that way. The barrier operation is implemented as a
collective on the Cray. What you are missing in the math above is that
every core waited during the *same* 4.5 second period. Total wasted time
is only 4.5 seconds then.
> And the more often this happens the worse.
> 
> >> So I thought it would be a good idea to get to the root of it.
> >> We hear many arguments here at the lab that "what good the
buffered
> >> io
> >> is for
> >> me when my app performance is degraded if I don''t do
sync. I''ll
> >> just do
> >> the sync and be over with it". Of course I believe there is
still
> >> benefit to not
> >> doing the sync, but that''s just me.
> > If the time to settle the jitter is on the order of 10 seconds but it
> > takes 15 seconds to sync, it would be better to live with the jitter,
> > no? I suggested an experiment to make this comparison. Why argue with
> > them? just do the experiment and you can know which strategy is  
> > better.
> 
> I know which one is better. I did the experiment. (though I have no  
> realistic way
> to measure when "jitter" settles out).
Which was better then? By how much? Were you just measuring a barrier or
do those numbers still work out when the app uses the network heavily
after doing it''s writes?
> 
> >>> To me, you are worrying way too much about the situation
immediately
> >>> after a write. Checkpoints are relatively rare, with long
periods
> >>> between. Why worry about something that''s only going
to affect a
> >>> very
> >>> small portion of the overall job? As long as the jitter
dissipates
> >>> in a
> >>> short time, things will work out fine.
> >> I worry abut it specifically because users tend to do sync after
the
> >> write and that
> >> wastes a lot of time. So as a result - I want as much of data to  
> >> enter
> >> into cache
> >> and then trickle out all by itself and I want users not to see any
> >> bad
> >> effects
> >> (or otherwise to show to them that there are still benefits).
> > Users tend to do sync for more reasons than making the IO  
> > deterministic.
> > They should be doing it so that they can have some faith that the last
> > checkpoint is actually persistent when interrupted.
> 
> For that they only need to do fsync before their next checkpoint,
> to make sure that the previous one completed.
> 
> > However, they should do the sync right before they enter the IO phase,
> > in order to also get the benefits of write-back caching. Not after the
> > IO phase. In the event of an interrupt, this forces them to throw away
> > an in-progress checkpoint and the last one before that, to be safe,  
> > but
> > the one before the last should be good.
> 
> Right.
> Yet they do some microbenchmark and decide it is bad idea.
> Besides, reducing jitter, or whatever is the cause for the delays
> would still be useful.
You''re making a wonderful argument for Catamount :)
> 
> > In some cases, your app programmers will be unfortunately correct. An
> > app that uses so much memory that the system cannot buffer the entire
> > write will incur at least some issues while doing IO; Some of the IO
> > must move synchronously and that amount will differ from node to node.
> > This will have the effect of magnifying this post-IO jitter they are  
> > so
> > worried about. It is also why I wrote in the original requirements for
> 
> Why would it? There still is potentially a benefit for the available
> cache size.
In a fitted application, there is no useful amount of memory left over
for the cache. Using it, then, is just unnecessary overhead.

As I said, there''s a very real possibility your app programmers are
correct. It goes beyond memory. Any resource under intense pressure due
to contention offers the possibility that it can take longer to perform
it''s requests independently than to serialize them. For instance, if an
app does not use all of memory then there is plenty of room for Lustre
to cache. Since these apps presumably are going to communicate after the
IO phase (why else the barrier after the IO?) then they will contend
heavily with the Lustre client for the network interface and that
interface does not deal well with such a situation on the Cray. I  can
easily believe it would take longer for the app to get back to computing
because of the asynchronous network traffic from the write-back than it
would to just force the IO phase to complete, via fsync, and, after, do
what it needs to do to get back to work. If, instead, an app does use
all of the memory then it''s blocked for a long time in the IO calls
waiting for a free buffer, before the sync. If, when, that happens then
the fsync is nearly a no-op as most of the dirty data have already been
written.

Were I an app programmer, I could easily come to the conclusion that the
fsync is either useful or does not hurt.

The only cooperative app I can think of that seems to be able to win
universally is the one structured to:

	for (;;) {
		barrier
		fsync
		checkpoint
		for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) {
			compute
			communicate
		}
	}

I don''t know any that work that way though :(
> 
> > Lustre that if write-back caching is employed there must be a way to
> > turn it off.
> 
> There is around 3 ways to do that that I am aware of.
That''s nice. It was a requirement, after all. ;)

		--Lee
> 
> Bye,
>      Oleg
>

Nicholas Henke

2009-Apr-01 19:15 UTC

head link

[Lustre-devel] SeaStar message priority

Oleg Drokin wrote:> Hello!
> 
> On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
>> I think my point was that there is already a priority scheme in the
>> Seastar. Are there additional bits related to priority that you might
>> use, also?
> 
> But if we cannot use it, there is none.
> Like we want mpi rpcs go out first to some degree.
If we have to deal with ordering - we are already sunk. The Lustre RPCs will go 
out and affect MPI latency to some degree, introducing jitter into the calls and
affecting application performance.

> 
> But since the only thing I have in my app inside barriers is write call,
> there is no much way to desynchronize.
Incorrect - you are running your app on all 4 CPUs on the node at the same time 
Lustre is sending RPCs. The kernel threads will get scheduled and run, pushing 
your app to the side and desynchronizing the barrier for the app as a whole.

> No, I do not think they would like the idea to forfeit 1/4 of their
> CPU just so io is better.
> If the jitter is due to cpu occupied with io, and apps stalled due to  
> this
> (though I have hard time believing an app to be not given a cpu for  
> 4.5 seconds,
> even though there are potentially 4 idle cpus, or even 3 (remember  
> other cores are
> also idle waiting on a barrier).
This gets easier to swallow in the future with 12core and larger nodes - 1/12 is
much easier to sacrifice.

What we really need to "prove" is where the delay is occurring. The
MPI_Barrier
messages are 0-byte sends, effectively turning them into Portals headers and 
these are sent and processed very fast. In fact, the total amount of data being 
sent is _much_ less than the NIC is capable of. A rough estimate for 2 nodes 
talking to each other is 1700 MB/s and 50K lnet pings/s.

One thing to try is changing your aprun to use fewer CPUs per node:
aprun -n 1200 -N [1,2,3] -cc 1-3.

The -cc 1-3 will keep it off cpu 0 - a known location for some IRQs and other 
servicing.

You should also try to capture compute-node stats like cpu usage, # of threads 
active during barrier, etc to help narrow down where the time is going.

Nic

Oleg Drokin

2009-Apr-01 19:26 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 3:15 PM, Nicholas Henke wrote:>> But since the only thing I have in my app inside barriers is write  
>> call,
>> there is no much way to desynchronize.
> Incorrect - you are running your app on all 4 CPUs on the node at  
> the same time Lustre is sending RPCs. The kernel threads will get  
> scheduled and run, pushing your app to the side and desynchronizing  
> the barrier for the app as a whole.
But I am measuring each write and I see that none of them  
significantly exceeds 0.5 seconds.
Let it be 0.1 seconds difference.
So then 4.5 seconds - 0.1 seconds for the write speed difference = 4.4  
seconds.
> What we really need to "prove" is where the delay is occurring.
The
> MPI_Barrier messages are 0-byte sends, effectively turning them into  
> Portals headers and these are sent and processed very fast. In fact,  
> the total amount of data being sent is _much_ less than the NIC is  
> capable of. A rough estimate for 2 nodes talking to each other is  
> 1700 MB/s and 50K lnet pings/s.
Yes. I understand this point.
> One thing to try is changing your aprun to use fewer CPUs per node:
> aprun -n 1200 -N [1,2,3] -cc 1-3.
I just run with 1 cpu per node, 1200 threads. Leaving 3 cpus/core for  
kernel and whatnot.
The actual write syscall return time decreased, but the barrier is  
not, even though we know
that less data is in time at any given time now (due to only 16 osts  
accessed per node, not 16*4).
So something is going on, but I do not think we can blindly attribute  
it to just "ah kernel ate your
cpu for important things pushing the data"
0: barrier after write time: 4.528383 sec
0: barrier after write 2 time: 4.043252 sec

The pre-write barrier took only 0.096675 sec (to rule out general  
network congestion).

Bye,
     Oleg

Oleg Drokin

2009-Apr-01 20:17 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 3:13 PM, Lee Ward wrote:>> But if we cannot use it, there is none.
>> Like we want mpi rpcs go out first to some degree.
> If you don''t want to follow up, I''m ok with that.
It''s up to you.
> I understood what you want. There are at least two things I can  
> imagine
> that would better the situation without trying to leverage something  
> in
> the network, itself.
> 1) Partition the adapter CAM so that there is always room to  
> accommodate
> a user-space receive.
I cannot really comment on this option.
> 2) Prioritize injection to favor sends originating from user-space.
This is what I am speaking about, actually, perhaps not being able to  
explain
myself clearly. Except perhaps just userspace is too generic and a bit  
more
fine-grained controls would be more beneficial.
> One or both of these might already be implemented. I don''t know.
The second option does not look like it is implemented.
>>> when, scheduling occurs on two nodes is different. Any two nodes,  
>>> even
>>> running the same app with barrier synchronization, perform things
at
>>> different times outside of the barriers; They very quickly
>>> desynchronize
>>> in the presence of jitter.
>> But since the only thing I have in my app inside barriers is write  
>> call,
>> there is no much way to desynchronize.
> Modify your test to report the length of time each node spent in the
> barrier (not just rank 0, as it is written now) immediately after the
> write call, then? If you are correct, they will all be roughly the  
> same.
> If they have desynchronized, most will have very long wait times but  
> at
> least one will be relatively short.
That''s a fair point. I just scheduled the run.
> Oh, I''m sure they''re getting the CPU. They just
won''t come out of the
> barrier until all have processed the operation. The rates at which the
> nodes reach the barrier will be different. The rates at which they
I believe the rates at which they come to the barrier are the aprox  
the same.
I do time the write system call. And the barrier is next to it. And  
write
system call has relatively small variability in time, so we can assume
that all barriers start within 0.1 seconds from each other.
> proceed through will be different. The only invariant after a  
> barrier is
> that all the involved ranks *have* reached that point. Nothing about
> when that happened is stated or implied.
Ok, I did not realize that, though that makes sense.
I believe in my test the problem is on the sending side - i.e. the  
bottleneck
does not let all nodes to report that the point was reached by every  
thread.
But as soon as all nodes gathered, whatever control node sends the  
messages
(that are of course could be delayed in the queue if it is also doing  
the io,
hm, I wonder what node coordinates this (set of nodes?). Rank 0?) and  
once injected, they should
be processed instantly since we do not have any significant incoming  
traffic on the
nodes.
Don''t take my word for it of course, the test is already scheduled and
I''ll share the results.
>>>>> Jitter gets a bad rap. Usually for good reason. However, in
this
>>>>> case,
>>>>> it doesn''t seem something to worry overly much
about as it will
>>>>> cease.
>>>>> Your test says the 1st barrier after the write completes in
4.5
>>>>> sec
>>>>> and
>>>>> the 2nd in 1.5 sec. That seems to imply the jitter is
settling
>>>>> pretty
>>>>> rapidly. Jitter is really only bad when it is chronic.
>>>> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
>>>> specific job.
>>> That 1200 is the number of checkpoints? If so, I agree. If
it''s the
>>> number of nodes, I do not.
>> 1200 is number of cores waiting on a barrier.
>> Every core spends 4.5 seconds == total wasted single-cpu core time is
>> 1.5 hours.
> It doesn''t work that way. The barrier operation is implemented as
a
> collective on the Cray. What you are missing in the math above is that
> every core waited during the *same* 4.5 second period. Total wasted  
> time
> is only 4.5 seconds then.
I have a feeling we are speaking about different subjects here.
You are speaking about wall-clock time. I am speaking about total cpu- 
cycles
wasted across all nodes.
>>> If the time to settle the jitter is on the order of 10 seconds but
>>> it
>>> takes 15 seconds to sync, it would be better to live with the  
>>> jitter,
>>> no? I suggested an experiment to make this comparison. Why argue  
>>> with
>>> them? just do the experiment and you can know which strategy is
>>> better.
>> I know which one is better. I did the experiment. (though I have no
>> realistic way
>> to measure when "jitter" settles out).
> Which was better then? By how much? Were you just measuring a  
> barrier or
> do those numbers still work out when the app uses the network heavily
> after doing it''s writes?
Unfortunately I do not have any real applications instrumented, so my  
barrier
is a substitute for "network-heavy app activity". I started with it  
because
app programmers I spoke with complained about how their network  
latency is
affected if they do buffered writes.
The fsync takes upward from 10 seconds, depending on other load in the  
system,
I guess. I have no easy way to measure the jitter.
I do not think writeout with or without fsync would take significantly  
different
time because the underlying io paths don''t change, but that''s
non-
scientific.
Unfortunately just doing a write, time the fsync then doing the write  
again and
wait the same amount of time as fsync took, then do another fsync and  
see if it is
instantly returned, since lustre only eagerly writes 1M chunks, and vm  
pressure only
ensures data older than 30 seconds would be pushed out. And that  
before taking into
account possible variability of the loads on OSTs over time (since I  
cannot have
entire Jaguar all for myself).
>>> However, they should do the sync right before they enter the IO  
>>> phase,
>>> in order to also get the benefits of write-back caching. Not after
>>> the
>>> IO phase. In the event of an interrupt, this forces them to throw  
>>> away
>>> an in-progress checkpoint and the last one before that, to be safe,
>>> but
>>> the one before the last should be good.
>>
>> Right.
>> Yet they do some microbenchmark and decide it is bad idea.
>> Besides, reducing jitter, or whatever is the cause for the delays
>> would still be useful.
> You''re making a wonderful argument for Catamount :)
Actually, catamount definitely has its strong points, but there are
drawbacks as well. With Linux it''s just another set of benefits and
drawbacks.
>>> In some cases, your app programmers will be unfortunately correct.
>>> An
>>> app that uses so much memory that the system cannot buffer the  
>>> entire
>>> write will incur at least some issues while doing IO; Some of the
IO
>>> must move synchronously and that amount will differ from node to  
>>> node.
>>> This will have the effect of magnifying this post-IO jitter they
are
>>> so
>>> worried about. It is also why I wrote in the original requirements
>>> for
>> Why would it? There still is potentially a benefit for the available
>> cache size.
> In a fitted application, there is no useful amount of memory left over
> for the cache. Using it, then, is just unnecessary overhead.
Right. In this case it is even better to do non-caching io (directio  
style)
to reduce the memory copy overhead as well.
> As I said, there''s a very real possibility your app programmers
are
> correct. It goes beyond memory. Any resource under intense pressure  
> due
> to contention offers the possibility that it can take longer to  
> perform
> it''s requests independently than to serialize them. For instance,
if
> an
> app does not use all of memory then there is plenty of room for Lustre
> to cache. Since these apps presumably are going to communicate after  
> the
> IO phase (why else the barrier after the IO?) then they will contend
> heavily with the Lustre client for the network interface and that
> interface does not deal well with such a situation on the Cray. I  can
> easily believe it would take longer for the app to get back to  
> computing
> because of the asynchronous network traffic from the write-back than  
> it
> would to just force the IO phase to complete, via fsync, and, after,  
> do
> what it needs to do to get back to work. If, instead, an app does use
> all of the memory then it''s blocked for a long time in the IO
calls
> waiting for a free buffer, before the sync. If, when, that happens  
> then
> the fsync is nearly a no-op as most of the dirty data have already  
> been
> written.
This is all very true.
Currently I am only focusing on applications that do leave enough space
for the fs cache, since that''s where the possible benefit is and there
is
no drawback for applications that don''t do cached io. (And this is the
case for the app programmers I spoke with).
> The only cooperative app I can think of that seems to be able to win
> universally is the one structured to:
>
> 	for (;;) {
> 		barrier
> 		fsync
> 		checkpoint
> 		for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) {
> 			compute
> 			communicate
> 		}
> 	}
> I don''t know any that work that way though :(
We here at ORNL are trying hard to convice app programmers that this is
indeed beneficial.
Unfortunately it is not all that clean-cut, the machine itself behaves
differently every time due to all different workloads going on is part
of the problem too.
Of course we stand in our way too, with the default 32Mb limit of  
dirty cache
per osc, in order to get meaningful caching size we need to stripe the  
files
over waay to many OSTs, and as a result the overall IO performance is
degraded compared to just the case of outputting to a single OST from  
every
job, due to reduced randomness in IO pattern.

Bye,
     Oleg

Oleg Drokin

2009-Apr-02 02:46 UTC

head link

[Lustre-devel] SeaStar message priority

Hello!

On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote:
>>>> when, scheduling occurs on two nodes is different. Any two
nodes,
>>>> even
>>>> running the same app with barrier synchronization, perform
things
>>>> at
>>>> different times outside of the barriers; They very quickly
>>>> desynchronize
>>>> in the presence of jitter.
>>> But since the only thing I have in my app inside barriers is write
>>> call,
>>> there is no much way to desynchronize.
>> Modify your test to report the length of time each node spent in the
>> barrier (not just rank 0, as it is written now) immediately after the
>> write call, then? If you are correct, they will all be roughly the  
>> same.
>> If they have desynchronized, most will have very long wait times  
>> but at
>> least one will be relatively short.
> That''s a fair point. I just scheduled the run.
Ok.
The results are in. I scheduled 2 runs. One at 4 threads/node and one
at 1 thread/node.

For the 4 threads/node case the 1st barrier took anywhere from 1.497  
sec to
3.025 sec with rank 0 reporting 1.627 sec.
The second barrier took 0.916 to 2.758 seconds with rank 0 reporting  
1.992 sec.
For the barrier 2 I can actually clearly observe that thread terminate  
in
groups of 4 with very close times, and ranks suggest those nids are on  
the same
nodes. On 1st barrier this trend is much less visible, though.

On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and
slowest was 10.176
For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is  
pretty close
to the difference between fastest and slowest 1st barrier, since  
amount of data
written per node in this case 4 smaller, I guess we just flushed all  
the data
to the disk before the 1st barrier finished and the difference in  
waiting was due
to the differences in start times.

As you can see, numbers tend to jump around, but there are still  
relatively big delays
due to something else than just threads getting out of sync.

Bye,
    Oleg

Lee Ward

2009-Apr-02 04:28 UTC

head link

[Lustre-devel] SeaStar message priority

On Wed, 2009-04-01 at 20:46 -0600, Oleg Drokin wrote:> Hello!
> 
> On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote:
> 
> >>>> when, scheduling occurs on two nodes is different. Any two
nodes,
> >>>> even
> >>>> running the same app with barrier synchronization, perform
things
> >>>> at
> >>>> different times outside of the barriers; They very quickly
> >>>> desynchronize
> >>>> in the presence of jitter.
> >>> But since the only thing I have in my app inside barriers is
write
> >>> call,
> >>> there is no much way to desynchronize.
> >> Modify your test to report the length of time each node spent in
the
> >> barrier (not just rank 0, as it is written now) immediately after
the
> >> write call, then? If you are correct, they will all be roughly the
> >> same.
> >> If they have desynchronized, most will have very long wait times  
> >> but at
> >> least one will be relatively short.
> > That''s a fair point. I just scheduled the run.
> 
> Ok.
> The results are in. I scheduled 2 runs. One at 4 threads/node and one
> at 1 thread/node.
> 
> For the 4 threads/node case the 1st barrier took anywhere from 1.497  
> sec to
> 3.025 sec with rank 0 reporting 1.627 sec.
> The second barrier took 0.916 to 2.758 seconds with rank 0 reporting  
> 1.992 sec.
> For the barrier 2 I can actually clearly observe that thread terminate  
> in
> groups of 4 with very close times, and ranks suggest those nids are on  
> the same
> nodes. On 1st barrier this trend is much less visible, though.
> 
> On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and
> slowest was 10.176
> For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is  
> pretty close
> to the difference between fastest and slowest 1st barrier, since  
> amount of data
> written per node in this case 4 smaller, I guess we just flushed all  
> the data
> to the disk before the 1st barrier finished and the difference in  
> waiting was due
> to the differences in start times.
> 
> As you can see, numbers tend to jump around, but there are still  
> relatively big delays
> due to something else than just threads getting out of sync.
Agreed. It''s something more than simple jitter.
>From everything you have described, the nodes are otherwise idle. Theonly other thing I can think, then, of would be one or more Lustre
client threads, injecting traffic into the network, which is where you
started.

A useful test might be to grab the MPI ping-pong from the test suite,
modify it to slow it down a bit. Say 4 times a second? Augment it to
report the ping-pong time and a time stamp. Augment your existing test
to report time stamps for the beginning of the write call. Launch one,
each, of these on your set of nodes; I.e., each node has both your write
test and the ping-pong running at the same time. This presumes you can
launch two mpi jobs onto your set of nodes. If not, come up with an
equivalent that is supported?

If the ping-pong latency goes way up at the write calls you can claim a
correlation. Not definitive as correlation does not equal cause but it
is pretty strong.

If there is correlation, it means Cray has kind of messed up the portals
implementation. The portals implementation would be attempting to send
*everything* in order. All portals needs is for traffic to go in order
per nid and pid pair. An implementation is free to mix in unrelated
traffic, and should, to prevent one process from starving others.

An idea... Does the Lustre service side restrict the number of
simultaneous get operations it issues? I don''t just mean to a
particular
client, but to all from a single server, be it OST or MDS. If not,
consider it. If there are too many outstanding receives an arriving
message may miss the corresponding CAM entry due to a flush. What
happens after that can''t be pretty. At one time, it caused the client
to
resend. Does it still? If so, and resends are occurring the affected
clients have their bandwidth reduced by more than 50% for the affected
operations. Since there is a barrier operation stuck behind it, well...

Mr. Booth has suggested that the portals client might offer to send less
data per transfer. This would allow latency sensitive sends to reach the
front of the queue more quickly. It would also, I think, lower overall
throughput. It''s an idea worth considering but is a case of two evils.
Can this be mitigated by peeking at the portals send queue in some way?
If Lustre can identify outbound traffic in the queue that it didn''t
present then it could respond as Mr. Booth has suggested or back off on
the rate at which it presents traffic, or both even? Initial latencies
would be unchanged but would get better as the app did more
communication, especially if it used the one-sided calls and overlapped
them.

I''m sorry, if it''s contention for the adapter I don''t
see a work around
without changing Lustre or Cray changing the driver to more fairly
service the independent streams.

In any case, right now, your apps guys suspicions probably have merit if
it is indeed contention on the network adapter. They may really be
better off forcing the IO to complete before moving to the next phase if
that phase involves the network. How sad.

You do need to do the test, though, before you try to "fix" anything.
Right now, it''s only supposition that contention for the network
adapter
is the evil here.

		--Lee
> 
> Bye,
>     Oleg
>

Lustre devel - Apr 2009 - SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority

[Lustre-devel] SeaStar message priority