thr3ads.net - Lustre devel - [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15 [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2009-Mar-16 12:56 UTC

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Mike,

Yes, it would be fun to discuss - but I''m probably not going to be
available for a discussion like that for a week or 2.

BTW, I''m cc-ing lustre-devel since this is of general interest.

I _do_ agree that for some apps, if there was sufficient memory on the
app node to buffer the local component of a checkpoint and let it
"dribble" out to disk would achieve better utilization of the compute
resource.  However parallel apps can be very sensitive to "noise" on
the network they''re using for inter- process communication - i.e. the
checkpoint data has either to be written all the way to disk, or at
least buffered somewhere so that moving it to disk will not interfere
with the app''s own communications.

This latter concept is the basis for the "flash cache" concept.
Actually, I think it''s worth exploring the economics of it in more
detail.

The variables are aggregate network bandwidth into the distributed
checkpoint cache, which determines the checkpoint time, and aggregate
path-minimum bandwidth (i.e. lesser of network and disk bandwidth)
from the cache to disk, which determines how soon the cache can be
ready for the next checkpoint.  The cache could be dedicated nodes and
storage (e.g. flash) or additional storage on the OSSes, or any
combination of either.  And the interesting relationship is how
compute cluster utilisation varies with the cost of the server and
cache subsystems.

-- 

        Cheers,
                   Eric
> -----Original Message-----
> From: Michael.Booth at Sun.COM [mailto:Michael.Booth at Sun.COM]
> Sent: 16 March 2009 3:06 AM
> To: Eric Barton
> Subject: Re: Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending
2009.03.15
> 
> Eric,
> 
> This is too bad.  I should run the test on my laptop and see if I get
> the same behavior.
> 
> The huge bandwidth requirements (30+ gbyes/sec) that I see for
> checkpoint-style I/O is driven in burst that last about 1/10 of the
> time of the following computation.  There is not a desire to assure
> that everything is on disk before resuming computations.  If while the
> computations proceeded the system cleared out the cache, the next
> write would go to cache at memory speed if the previous clean pages
> could be reused for the next write.  The bandwidth requirement to
> achieve what appears to be memory speed I/o could be achieved in this
> case with 3 gbytes/sec.
> 
>   There are middleware schemes being developed to do asynchronous I/O
> on "other" nodes to transfer the checkpoint data out to the other
> nodes so they write it all out.  To me this is the middleware working
> at odds with what the system software should naturally do for the
> application.
> 
> I think it is safe to say it is a minority of scientific applications
> that are writing out and quickly reading it back like a typical linux
> application, like web browsers.  This type of I/O is usually limited
> to codes that are larger than the sum  of the nodes memory,, which is
> rarer and rarer these days.
> 
> I believe that making this work for these codes is a win in three ways;,
> 
>    One: reduces the need for high burst rate I/O to disk for many
> programs while giving the perception of much faster I/O to the
> application.
> 
>    Two:  helps to reduce the impact of filesystem performance
> variability,
> 
>    Three: Overall in the system, not having the system being hit with
> huge burst of I/O by tens of thousands of cores at seemingly random
> times, could reduce the variability of the complete file system.
> 
> Should we discuss on the phone, with Oleg?
> 
> Thanks,, this is fun,
> 
> Mike
> 
> 
> Michael Booth
> michael.booth at sun.com
> mobile  512-289-3805
>

Oleg Drokin

2009-Mar-18 20:31 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Hello!

On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:> I _do_ agree that for some apps, if there was sufficient memory on the
> app node to buffer the local component of a checkpoint and let it
> "dribble" out to disk would achieve better utilization of the
compute
> resource.  However parallel apps can be very sensitive to "noise"
on
> the network they''re using for inter- process communication - i.e.
the
> checkpoint data has either to be written all the way to disk, or at
> least buffered somewhere so that moving it to disk will not interfere
> with the app''s own communications.
> This latter concept is the basis for the "flash cache" concept.
> Actually, I think it''s worth exploring the economics of it in more
> detail.
This turns out to be a very true assertion. We (I) do see a huge delay
in e.g. MPI barriers done immediately after write.
> The variables are aggregate network bandwidth into the distributed
> checkpoint cache, which determines the checkpoint time, and aggregate
> path-minimum bandwidth (i.e. lesser of network and disk bandwidth)
> from the cache to disk, which determines how soon the cache can be
> ready for the next checkpoint.  The cache could be dedicated nodes and
> storage (e.g. flash) or additional storage on the OSSes, or any
> combination of either.  And the interesting relationship is how
> compute cluster utilisation varies with the cost of the server and
> cache subsystems.
The thing is, if we can just flush out data from the cache at the moment
when there is no network-latency critical activity on the app side  
(somehow
signaled by the app), why would we need the flash storage at all? We can
write nice sequential chunks to normal disks just as fast, I presume.
It is the random i/o patterns that make flash shine.

Bye,
     Oleg

Andreas Dilger

2009-Mar-31 18:51 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Mar 18, 2009  16:31 -0400, Oleg Drokin wrote:> On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:
> > I _do_ agree that for some apps, if there was sufficient memory on the
> > app node to buffer the local component of a checkpoint and let it
> > "dribble" out to disk would achieve better utilization of
the compute
> > resource.  However parallel apps can be very sensitive to
"noise" on
> > the network they''re using for inter- process communication -
i.e. the
> > checkpoint data has either to be written all the way to disk, or at
> > least buffered somewhere so that moving it to disk will not interfere
> > with the app''s own communications.
> > This latter concept is the basis for the "flash cache"
concept.
> > Actually, I think it''s worth exploring the economics of it in
more
> > detail.
> 
> This turns out to be a very true assertion. We (I) do see a huge delay
> in e.g. MPI barriers done immediately after write.
While this is true, I still believe that the amount of delay seen by
the application cannot possibly be worse than waiting for all of the
IO to complete.  Also, the question is whether you are measuring the
FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier
after the write?  Since Lustre is currently aggressively flushing the
write cache then the first MPI barrier is essentially waiting for all
of the IO to complete, which is of course very slow.  The real item
of interest is how long the SECOND MPI barrier takes, which is what
the overhead of Lustre IO is on the network performance.


It is impossible that Lustre IO completely saturates the entire
cross-sectional bandwidth of the system OR the client CPUs, so having
some amount of computation for "free" during IO is still better than
waiting for the IO to complete.

For example, we have 1000 processes each doing a 1GB write to their own
file, and the aggregate IO bandwidth is 10GB/s the IO will take about 100s
to write if (as we currently do) limit the amount of dirty data on each
client to avoid interfering with the application, and no computation can
be done during this time.  If we allowed clients to cache that 1GB of
IO it might only take 1s to complete the "write" and then 99s to flush
the IO to the OSTs.

If each compute timestep takes 0.1s during IO vs 0.01s without IO and
you would get 990 timesteps during the write flush in the second case
until the cache was cleared, vs. none in the first case.  I suspect
that the overhead of the MPI communication on the Lustre IO is small,
since the IO will be limited by the OST network and disk bandwidth,
which is generally a small fraction of the cross-sectional bandwidth.

This could be tested fairly easily with a real application that is
doing computation between IO, instead of a benchmark that is only doing
IO or only sleeping between IO, simply by increasing the per-OSC write
cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the
case where 2 processes on the same node are writing to the same OST).
Then, measure the time taken for the application to do, say, 1M timesteps
and 100 checkpoints with the 32MB and the 2GB write cache sizes.
> > The variables are aggregate network bandwidth into the distributed
> > checkpoint cache, which determines the checkpoint time, and aggregate
> > path-minimum bandwidth (i.e. lesser of network and disk bandwidth)
> > from the cache to disk, which determines how soon the cache can be
> > ready for the next checkpoint.  The cache could be dedicated nodes and
> > storage (e.g. flash) or additional storage on the OSSes, or any
> > combination of either.  And the interesting relationship is how
> > compute cluster utilisation varies with the cost of the server and
> > cache subsystems.
> 
> The thing is, if we can just flush out data from the cache at the moment
> when there is no network-latency critical activity on the app side  
> (somehow signaled by the app), why would we need the flash storage at
> all? We can write nice sequential chunks to normal disks just as fast,
> I presume.
Actually, an idea I had for clusters that are running many jobs at once
was essentially having the apps poll for IO capacity when doing a
checkpoint, so that they can avoid contending with other jobs that are
doing checkpoints at the same time.  That way, an app might skip an
occasional checkpoint if the filesystem is busy, and instead compute
until the filesystem is less busy.

This would be equivalent to the client node being able to cache all of
the write data and flushing it out in the background, so long as the
time to flush a single checkpoint never took longer than the time
between checkpoints.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Oleg Drokin

2009-Mar-31 20:58 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Hello!

On Mar 31, 2009, at 2:51 PM, Andreas Dilger wrote:>>> This latter concept is the basis for the "flash cache"
concept.
>>> Actually, I think it''s worth exploring the economics of it
in more
>>> detail.
>> This turns out to be a very true assertion. We (I) do see a huge  
>> delay
>> in e.g. MPI barriers done immediately after write.
> While this is true, I still believe that the amount of delay seen by
> the application cannot possibly be worse than waiting for all of the
> IO to complete.  Also, the question is whether you are measuring the
Absolutely.
> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier
> after the write?  Since Lustre is currently aggressively flushing the
> write cache then the first MPI barrier is essentially waiting for all
> of the IO to complete, which is of course very slow.  The real item
> of interest is how long the SECOND MPI barrier takes, which is what
> the overhead of Lustre IO is on the network performance.
Second MPI takes 1.5 seconds for me.
> It is impossible that Lustre IO completely saturates the entire
> cross-sectional bandwidth of the system OR the client CPUs, so having
> some amount of computation for "free" during IO is still better
than
> waiting for the IO to complete.
No arguments about that from me, I am advocating this same thing from
the very beginning

Bye,
     Oleg

di wang

2009-Apr-01 03:35 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Hello,
Andreas Dilger wrote:> If each compute timestep takes 0.1s during IO vs 0.01s without IO and
> you would get 990 timesteps during the write flush in the second case
> until the cache was cleared, vs. none in the first case.  I suspect
> that the overhead of the MPI communication on the Lustre IO is small,
> since the IO will be limited by the OST network and disk bandwidth,
> which is generally a small fraction of the cross-sectional bandwidth.
>
> This could be tested fairly easily with a real application that is
> doing computation between IO, instead of a benchmark that is only doing
> IO or only sleeping between IO, simply by increasing the per-OSC write
> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the
> case where 2 processes on the same node are writing to the same OST).
> Then, measure the time taken for the application to do, say, 1M timesteps
> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>
>   Can we implement aio here? for  example  the  aio buffer can be treated  
different as other dirty buffer, not
being pushed aggressively to server. It seems with buffer_write, the 
user have to deal with fs buffer cache
issue in his application, not sure it is good for them, and we may not 
even output these features to the
application.

Thanks
WangDi

Michael Booth

2009-Apr-01 03:55 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Mar 31, 2009, at 11:35 PM, di wang wrote:

Hello,
Andreas Dilger wrote:> If each compute timestep takes 0.1s during IO vs 0.01s without IO and
> you would get 990 timesteps during the write flush in the second case
> until the cache was cleared, vs. none in the first case.  I suspect
> that the overhead of the MPI communication on the Lustre IO is small,
> since the IO will be limited by the OST network and disk bandwidth,
> which is generally a small fraction of the cross-sectional bandwidth.
>
> This could be tested fairly easily with a real application that is
> doing computation between IO, instead of a benchmark that is only  
> doing
> IO or only sleeping between IO, simply by increasing the per-OSC write
> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid  
> the
> case where 2 processes on the same node are writing to the same OST).
> Then, measure the time taken for the application to do, say, 1M  
> timesteps
> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>
>Can we implement aio here? for  example  the  aio buffer can be  
treated  different as other dirty buffer, not
being pushed aggressively to server. It seems with buffer_write, the  
user have to deal with fs buffer cache
issue in his application, not sure it is good for them, and we may not  
even output these features to the
application.

Thanks
WangDi

(My Opinion) The large size of the I/O request put onto the SeaStar by  
the Lustre client is giving it an artificially high priority.   
Barriers are just a few bytes, the I/Os from the client are in  
megabytes.   SeaStar has no priority in is queue, but  the amount of  
time it takes to clear megabyte request results in a priority that is  
thousands of times more impact on the hardware than the small  
synchronization requests of many collectives.  I am wondering if the  
interference from I/O to computation is more an artifact of message  
size and bursts,  than of congestion or routing inefficiencies in  
seastar..

If there are hundreds of megabytes of request queued up on the  
network, and there is no priority way to push a barrier or other small  
mpi request up on the queue, it is bound to create a disruption.

To borrow the elevator metaphor from Eric,  if all the elevators are  
queued up from 8:00 to 9:00 delivering office supplies on carts that  
occupy the entire elevator, maybe the carts should be smaller, and  
limited to a few per elevator trip.

Mike Booth

Oleg Drokin

2009-Apr-01 04:34 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Hello!

On Mar 31, 2009, at 11:55 PM, Michael Booth wrote:> (My Opinion) The large size of the I/O request put onto the SeaStar  
> by the Lustre client is giving it an artificially high priority.   
> Barriers are just a few bytes, the I/Os from the client are in  
> megabytes.   SeaStar has no priority in is queue, but  the amount of  
> time it takes to clear megabyte request results in a priority that  
> is thousands of times more impact on the hardware than the small  
> synchronization requests of many collectives.  I am wondering if the  
> interference from I/O to computation is more an artifact of message  
> size and bursts,  than of congestion or routing inefficiencies in  
> seastar..
> If there are hundreds of megabytes of request queued up on the  
> network, and there is no priority way to push a barrier or other  
> small mpi request up on the queue, it is bound to create a disruption.
> To borrow the elevator metaphor from Eric,  if all the elevators are  
> queued up from 8:00 to 9:00 delivering office supplies on carts that  
> occupy the entire elevator, maybe the carts should be smaller, and  
> limited to a few per elevator trip.
As we discussed in the past, just sending small i/o messages is going  
to uncover all kinds of slowdowns all the way back to the disk storage,
and the collateral damage would be other tasks that do need fast i/o  
and do send big chunks of data.

Bye,
     Oleg

Eric Barton

2009-Apr-01 05:01 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

I''d really like to see measurements that confirm that allowing the
checkpoint I/O to overlap the application compute phase really
delivers the estimated benefits.  My concern is that Lustre and
application communications will interfere to the detriment of both and
end up being less efficient overall.

Mike, can you find 2 apps, one which is communications-intensive and
another that is only CPU-bound immediately after a checkpoint and measure
them?

    Cheers,
              Eric
> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On
Behalf Of Andreas Dilger
> Sent: 31 March 2009 7:51 PM
> To: Oleg Drokin
> Cc: Eric Barton; lustre-devel at lists.lustre.org; Michael.Booth at Sun.COM
> Subject: Re: [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth
week ending 2009.03.15
> 
> On Mar 18, 2009  16:31 -0400, Oleg Drokin wrote:
> > On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:
> > > I _do_ agree that for some apps, if there was sufficient memory
on the
> > > app node to buffer the local component of a checkpoint and let it
> > > "dribble" out to disk would achieve better utilization
of the compute
> > > resource.  However parallel apps can be very sensitive to
"noise" on
> > > the network they''re using for inter- process
communication - i.e. the
> > > checkpoint data has either to be written all the way to disk, or
at
> > > least buffered somewhere so that moving it to disk will not
interfere
> > > with the app''s own communications.
> > > This latter concept is the basis for the "flash cache"
concept.
> > > Actually, I think it''s worth exploring the economics of
it in more
> > > detail.
> >
> > This turns out to be a very true assertion. We (I) do see a huge delay
> > in e.g. MPI barriers done immediately after write.
> 
> While this is true, I still believe that the amount of delay seen by
> the application cannot possibly be worse than waiting for all of the
> IO to complete.  Also, the question is whether you are measuring the
> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier
> after the write?  Since Lustre is currently aggressively flushing the
> write cache then the first MPI barrier is essentially waiting for all
> of the IO to complete, which is of course very slow.  The real item
> of interest is how long the SECOND MPI barrier takes, which is what
> the overhead of Lustre IO is on the network performance.
> 
> 
> It is impossible that Lustre IO completely saturates the entire
> cross-sectional bandwidth of the system OR the client CPUs, so having
> some amount of computation for "free" during IO is still better
than
> waiting for the IO to complete.
> 
> For example, we have 1000 processes each doing a 1GB write to their own
> file, and the aggregate IO bandwidth is 10GB/s the IO will take about 100s
> to write if (as we currently do) limit the amount of dirty data on each
> client to avoid interfering with the application, and no computation can
> be done during this time.  If we allowed clients to cache that 1GB of
> IO it might only take 1s to complete the "write" and then 99s to
flush
> the IO to the OSTs.
> 
> If each compute timestep takes 0.1s during IO vs 0.01s without IO and
> you would get 990 timesteps during the write flush in the second case
> until the cache was cleared, vs. none in the first case.  I suspect
> that the overhead of the MPI communication on the Lustre IO is small,
> since the IO will be limited by the OST network and disk bandwidth,
> which is generally a small fraction of the cross-sectional bandwidth.
> 
> This could be tested fairly easily with a real application that is
> doing computation between IO, instead of a benchmark that is only doing
> IO or only sleeping between IO, simply by increasing the per-OSC write
> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the
> case where 2 processes on the same node are writing to the same OST).
> Then, measure the time taken for the application to do, say, 1M timesteps
> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
> 
> > > The variables are aggregate network bandwidth into the
distributed
> > > checkpoint cache, which determines the checkpoint time, and
aggregate
> > > path-minimum bandwidth (i.e. lesser of network and disk
bandwidth)
> > > from the cache to disk, which determines how soon the cache can
be
> > > ready for the next checkpoint.  The cache could be dedicated
nodes and
> > > storage (e.g. flash) or additional storage on the OSSes, or any
> > > combination of either.  And the interesting relationship is how
> > > compute cluster utilisation varies with the cost of the server
and
> > > cache subsystems.
> >
> > The thing is, if we can just flush out data from the cache at the
moment
> > when there is no network-latency critical activity on the app side
> > (somehow signaled by the app), why would we need the flash storage at
> > all? We can write nice sequential chunks to normal disks just as fast,
> > I presume.
> 
> Actually, an idea I had for clusters that are running many jobs at once
> was essentially having the apps poll for IO capacity when doing a
> checkpoint, so that they can avoid contending with other jobs that are
> doing checkpoints at the same time.  That way, an app might skip an
> occasional checkpoint if the filesystem is busy, and instead compute
> until the filesystem is less busy.
> 
> This would be equivalent to the client node being able to cache all of
> the write data and flushing it out in the background, so long as the
> time to flush a single checkpoint never took longer than the time
> between checkpoints.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

Mike Booth

2009-Apr-01 05:08 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Yes. But I do need to finish the assignment that Scott Klasky has for  
me first.

Mike

Mike Booth
512 289-3805 mobile
512 692-9602

On Apr 1, 2009, at 1:01 AM, Eric Barton <eeb at sun.com> wrote:
> I''d really like to see measurements that confirm that allowing the
> checkpoint I/O to overlap the application compute phase really
> delivers the estimated benefits.  My concern is that Lustre and
> application communications will interfere to the detriment of both and
> end up being less efficient overall.
>
> Mike, can you find 2 apps, one which is communications-intensive and
> another that is only CPU-bound immediately after a checkpoint and  
> measure
> them?
>
>    Cheers,
>              Eric
>
>> -----Original Message-----
>> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On  
>> Behalf Of Andreas Dilger
>> Sent: 31 March 2009 7:51 PM
>> To: Oleg Drokin
>> Cc: Eric Barton; lustre-devel at lists.lustre.org; Michael.Booth at
Sun.COM
>> Subject: Re: [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW:  
>> Mike Booth week ending 2009.03.15
>>
>> On Mar 18, 2009  16:31 -0400, Oleg Drokin wrote:
>>> On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:
>>>> I _do_ agree that for some apps, if there was sufficient memory
>>>> on the
>>>> app node to buffer the local component of a checkpoint and let
it
>>>> "dribble" out to disk would achieve better
utilization of the
>>>> compute
>>>> resource.  However parallel apps can be very sensitive to
"noise"
>>>> on
>>>> the network they''re using for inter- process
communication - i.e.
>>>> the
>>>> checkpoint data has either to be written all the way to disk,
or at
>>>> least buffered somewhere so that moving it to disk will not  
>>>> interfere
>>>> with the app''s own communications.
>>>> This latter concept is the basis for the "flash
cache" concept.
>>>> Actually, I think it''s worth exploring the economics
of it in more
>>>> detail.
>>>
>>> This turns out to be a very true assertion. We (I) do see a huge  
>>> delay
>>> in e.g. MPI barriers done immediately after write.
>>
>> While this is true, I still believe that the amount of delay seen by
>> the application cannot possibly be worse than waiting for all of the
>> IO to complete.  Also, the question is whether you are measuring the
>> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier
>> after the write?  Since Lustre is currently aggressively flushing the
>> write cache then the first MPI barrier is essentially waiting for all
>> of the IO to complete, which is of course very slow.  The real item
>> of interest is how long the SECOND MPI barrier takes, which is what
>> the overhead of Lustre IO is on the network performance.
>>
>>
>> It is impossible that Lustre IO completely saturates the entire
>> cross-sectional bandwidth of the system OR the client CPUs, so having
>> some amount of computation for "free" during IO is still
better than
>> waiting for the IO to complete.
>>
>> For example, we have 1000 processes each doing a 1GB write to their  
>> own
>> file, and the aggregate IO bandwidth is 10GB/s the IO will take  
>> about 100s
>> to write if (as we currently do) limit the amount of dirty data on  
>> each
>> client to avoid interfering with the application, and no  
>> computation can
>> be done during this time.  If we allowed clients to cache that 1GB of
>> IO it might only take 1s to complete the "write" and then 99s
to
>> flush
>> the IO to the OSTs.
>>
>> If each compute timestep takes 0.1s during IO vs 0.01s without IO and
>> you would get 990 timesteps during the write flush in the second case
>> until the cache was cleared, vs. none in the first case.  I suspect
>> that the overhead of the MPI communication on the Lustre IO is small,
>> since the IO will be limited by the OST network and disk bandwidth,
>> which is generally a small fraction of the cross-sectional bandwidth.
>>
>> This could be tested fairly easily with a real application that is
>> doing computation between IO, instead of a benchmark that is only  
>> doing
>> IO or only sleeping between IO, simply by increasing the per-OSC  
>> write
>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to  
>> avoid the
>> case where 2 processes on the same node are writing to the same OST).
>> Then, measure the time taken for the application to do, say, 1M  
>> timesteps
>> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>>
>>>> The variables are aggregate network bandwidth into the
distributed
>>>> checkpoint cache, which determines the checkpoint time, and  
>>>> aggregate
>>>> path-minimum bandwidth (i.e. lesser of network and disk
bandwidth)
>>>> from the cache to disk, which determines how soon the cache can
be
>>>> ready for the next checkpoint.  The cache could be dedicated  
>>>> nodes and
>>>> storage (e.g. flash) or additional storage on the OSSes, or any
>>>> combination of either.  And the interesting relationship is how
>>>> compute cluster utilisation varies with the cost of the server
and
>>>> cache subsystems.
>>>
>>> The thing is, if we can just flush out data from the cache at the  
>>> moment
>>> when there is no network-latency critical activity on the app side
>>> (somehow signaled by the app), why would we need the flash storage
>>> at
>>> all? We can write nice sequential chunks to normal disks just as  
>>> fast,
>>> I presume.
>>
>> Actually, an idea I had for clusters that are running many jobs at  
>> once
>> was essentially having the apps poll for IO capacity when doing a
>> checkpoint, so that they can avoid contending with other jobs that  
>> are
>> doing checkpoints at the same time.  That way, an app might skip an
>> occasional checkpoint if the filesystem is busy, and instead compute
>> until the filesystem is less busy.
>>
>> This would be equivalent to the client node being able to cache all  
>> of
>> the write data and flushing it out in the background, so long as the
>> time to flush a single checkpoint never took longer than the time
>> between checkpoints.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>
>

Michael Booth

2009-Apr-01 11:41 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Apr 1, 2009, at 12:34 AM, Oleg Drokin wrote:
> Hello!
>
> On Mar 31, 2009, at 11:55 PM, Michael Booth wrote:
>> (My Opinion) The large size of the I/O request put onto the SeaStar
>> by the Lustre client is giving it an artificially high priority.
>> Barriers are just a few bytes, the I/Os from the client are in
>> megabytes.   SeaStar has no priority in is queue, but  the amount of
>> time it takes to clear megabyte request results in a priority that
>> is thousands of times more impact on the hardware than the small
>> synchronization requests of many collectives.  I am wondering if the
>> interference from I/O to computation is more an artifact of message
>> size and bursts,  than of congestion or routing inefficiencies in
>> seastar..
>> If there are hundreds of megabytes of request queued up on the
>> network, and there is no priority way to push a barrier or other
>> small mpi request up on the queue, it is bound to create a  
>> disruption.
>> To borrow the elevator metaphor from Eric,  if all the elevators are
>> queued up from 8:00 to 9:00 delivering office supplies on carts that
>> occupy the entire elevator, maybe the carts should be smaller, and
>> limited to a few per elevator trip.
>
> As we discussed in the past, just sending small i/o messages is going
> to uncover all kinds of slowdowns all the way back to the disk  
> storage,
> and the collateral damage would be other tasks that do need fast i/o
> and do send big chunks of data.
>
> Bye,
>     Oleg
> _______________________________________________
Don''t take my explanation as a general suggestion for all I/O, It is a
suggestion for I/O taking place during times of need for high response  
for mpi.  How to know when the need is high is another issue.

To over extend the metaphor, can the size of the office supply carts  
be small and the amount number allowed on the elevator up be limited  
from 8:00 to 9:00 am when the office people traffic is in need of high  
response?

It is clear to the application  when they are doing synchronous i/o  
and don''t care so much about mpi response and when they are in a stage
that a collective response is important.  For example:  the barrier  
after an fsync is definitely desiring highest response to I/O  
request.   After a barrier is complete, mpi would likely need highest  
response, until another user synchronous i/o call, (not to include  
printf''s).  I could even be an explicit call to the Lustre client from
the application to switch the priority between states.

Mike

Andreas Dilger

2009-Apr-02 22:43 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Mar 31, 2009  23:55 -0400, Michael Booth wrote:> On Mar 31, 2009, at 11:35 PM, di wang wrote:
>> Andreas Dilger wrote:
>>> If each compute timestep takes 0.1s during IO vs 0.01s without IO
and
>>> you would get 990 timesteps during the write flush in the second
case
>>> until the cache was cleared, vs. none in the first case.  I suspect
>>> that the overhead of the MPI communication on the Lustre IO is
small,
>>> since the IO will be limited by the OST network and disk bandwidth,
>>> which is generally a small fraction of the cross-sectional
bandwidth.
>>>
>>> This could be tested fairly easily with a real application that is
>>> doing computation between IO, instead of a benchmark that is only  
>>> doing
>>> IO or only sleeping between IO, simply by increasing the per-OSC
write
>>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to
avoid
>>> the
>>> case where 2 processes on the same node are writing to the same
OST).
>>> Then, measure the time taken for the application to do, say, 1M  
>>> timesteps
>>> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>>
>> Can we implement aio here? for  example  the  aio buffer can be treated
>> different as other dirty buffer, not
>> being pushed aggressively to server. It seems with buffer_write, the  
>> user have to deal with fs buffer cache
>> issue in his application, not sure it is good for them, and we may not
>> even output these features to the
>> application.
I''m not sure what you mean.  Implementing AIO is _more_ complex for the
application, and in essence the current IO is mostly async except when
the client hits the max dirty limit.  The client will still flush the
dirty data in the background (despite Michaels experiment), it just takes
the VM some time to catch up.

Linux VM /proc tunables can be tweaked on the client to have it be more
aggressive about pushing out dirty data.  I suspect they are currently
tuned for desktop workloads more than IO-intensive workloads.

On Mar 31, 2009  23:55 -0400, Michael Booth wrote:> The large size of the I/O request put onto the SeaStar by  
> the Lustre client is giving it an artificially high priority.  Barriers 
> are just a few bytes, the I/Os from the client are in megabytes.   
> SeaStar has no priority in is queue, but  the amount of time it takes to 
> clear megabyte request results in a priority that is thousands of times 
> more impact on the hardware than the small synchronization requests of 
> many collectives.  I am wondering if the interference from I/O to 
> computation is more an artifact of message size and bursts,  than of 
> congestion or routing inefficiencies in seastar..
>
> If there are hundreds of megabytes of request queued up on the network, 
> and there is no priority way to push a barrier or other small mpi request 
> up on the queue, it is bound to create a disruption.
Note that the Lustre IO REQUESTS are not very large in themselves (under
512 bytes for a 1MB write), but it is the bulk xfer that is large.  The
LND code could definitely cooperate with the network hardware to ensure
that small requests get a decent share of the network bandwidth, and
Lustre itself would also benefit from this (allowing e.g. lock requests
to bypass the bulk IO traffic) but whether the network hardware can do
this in any manner is a separate question.
> To borrow the elevator metaphor from Eric,  if all the elevators are  
> queued up from 8:00 to 9:00 delivering office supplies on carts that  
> occupy the entire elevator, maybe the carts should be smaller, and  
> limited to a few per elevator trip.
A better analogy would be the elevators occasionally have a mail cart
with tens or hundreds of requests for office supplies, and this can
easily share the elevator with other workers.  Having a separate
freight elevator to handle the supplies themselves is one way to do
it, having the elevators alternate people and supplies is another,
but cutting desks into small pieces so they can share space with
people is not an option.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Michael Booth

2009-Apr-03 18:27 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Apr 2, 2009, at 6:43 PM, Andreas Dilger wrote:
> On Mar 31, 2009  23:55 -0400, Michael Booth wrote:
>> On Mar 31, 2009, at 11:35 PM, di wang wrote:
>>> Andreas Dilger wrote:
>>>> If each compute timestep takes 0.1s during IO vs 0.01s without
IO
>>>> and
>>>> you would get 990 timesteps during the write flush in the
second
>>>> case
>>>> until the cache was cleared, vs. none in the first case.  I
suspect
>>>> that the overhead of the MPI communication on the Lustre IO is
>>>> small,
>>>> since the IO will be limited by the OST network and disk
bandwidth,
>>>> which is generally a small fraction of the cross-sectional  
>>>> bandwidth.
>>>>
>>>> This could be tested fairly easily with a real application that
is
>>>> doing computation between IO, instead of a benchmark that is
only
>>>> doing
>>>> IO or only sleeping between IO, simply by increasing the
per-OSC
>>>> write
>>>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to
>>>> avoid
>>>> the
>>>> case where 2 processes on the same node are writing to the same
>>>> OST).
>>>> Then, measure the time taken for the application to do, say, 1M
>>>> timesteps
>>>> and 100 checkpoints with the 32MB and the 2GB write cache
sizes.
>>>
>>> Can we implement aio here? for  example  the  aio buffer can be  
>>> treated
>>> different as other dirty buffer, not
>>> being pushed aggressively to server. It seems with buffer_write,
the
>>> user have to deal with fs buffer cache
>>> issue in his application, not sure it is good for them, and we may
>>> not
>>> even output these features to the
>>> application.
>
> I''m not sure what you mean.  Implementing AIO is _more_ complex
for
> the
> application, and in essence the current IO is mostly async except when
> the client hits the max dirty limit.  The client will still flush the
> dirty data in the background (despite Michaels experiment), it just  
> takes
> the VM some time to catch up.
even after an fsync?

The experiment was to see if the dirty cache is in fact being voided  
as designed.



for (writesize = small;writesize<hundeds of megabytes;writesize 
+=increment){
  for (sixty iterations){
   write<---- writesize
  timer1
  fsync
  timer2}}

Run again, but sleep

for (writesize = small;writesize<hundeds of megabytes;writesize 
+=increment){
  for (sixty iterations){
   write<---- writesize
  sleep(1 second)
  timer1
  fsync
timer2}}

for each iteration, take the best time for each fsync, and plot the  
speedup that the second routine
has for it''s fsync over the non-slept fsync.   The results

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedGraphic.pdf
Type: application/pdf
Size: 26637 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090403/39d8a9fc/attachment-0002.pdf
-------------- next part --------------



To me this is not behavior that is consistent with the design  
behavior.  Are there tests for writes to assure that the cache is  
behaving as designed?

another way to view this is to calculate the amount of data that is  
moved during the 1 second of sleep.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedGraphic.pdf
Type: application/pdf
Size: 27283 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090403/39d8a9fc/attachment-0003.pdf
-------------- next part --------------



This seems to show a "bug" in how much dirty data is shipped off to  
disk.   Again,, I don''t think that cache is clearing like it was  
designed.

>
> Linux VM /proc tunables can be tweaked on the client to have it be  
> more
> aggressive about pushing out dirty data.  I suspect they are currently
> tuned for desktop workloads more than IO-intensive workloads.
I would not characterize these as i/o intensive,, they are large bulk  
sequential writes,, the cache design is for good performance assuming  
most of what is written is also read right away,, yes this is good for  
desktops, but really in the way for checkpoint restart and data  
dumping Scientific applications.
>
> On Mar 31, 2009  23:55 -0400, Michael Booth wrote:
>> The large size of the I/O request put onto the SeaStar by
>> the Lustre client is giving it an artificially high priority.   
>> Barriers
>> are just a few bytes, the I/Os from the client are in megabytes.
>> SeaStar has no priority in is queue, but  the amount of time it  
>> takes to
>> clear megabyte request results in a priority that is thousands of  
>> times
>> more impact on the hardware than the small synchronization requests  
>> of
>> many collectives.  I am wondering if the interference from I/O to
>> computation is more an artifact of message size and bursts,  than of
>> congestion or routing inefficiencies in seastar..
>>
>> If there are hundreds of megabytes of request queued up on the  
>> network,
>> and there is no priority way to push a barrier or other small mpi  
>> request
>> up on the queue, it is bound to create a disruption.
>
> Note that the Lustre IO REQUESTS are not very large in themselves  
> (under
> 512 bytes for a 1MB write), but it is the bulk xfer that is large.   
> The
> LND code could definitely cooperate with the network hardware to  
> ensure
> that small requests get a decent share of the network bandwidth, and
> Lustre itself would also benefit from this (allowing e.g. lock  
> requests
> to bypass the bulk IO traffic) but whether the network hardware can do
> this in any manner is a separate question.
Some codes see 80% slowdown after writes due to collectives that are  
chocked out by background i/o.
>
>> To borrow the elevator metaphor from Eric,  if all the elevators are
>> queued up from 8:00 to 9:00 delivering office supplies on carts that
>> occupy the entire elevator, maybe the carts should be smaller, and
>> limited to a few per elevator trip.
>
> A better analogy would be the elevators occasionally have a mail cart
> with tens or hundreds of requests for office supplies, and this can
> easily share the elevator with other workers.  Having a separate
> freight elevator to handle the supplies themselves is one way to do
> it, having the elevators alternate people and supplies is another,
> but cutting desks into small pieces so they can share space with
> people is not an option.
I don''t want to cut the desks up,, just want to be sure that small  
collective messages are not slowed down orders of magnitude by what  
should be a background process.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Mike

di wang

2009-Apr-06 22:12 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Hello,

Andreas Dilger wrote:> I''m not sure what you mean.  Implementing AIO is _more_ complex
for the
> application, and in essence the current IO is mostly async except when
> the client hits the max dirty limit.  The client will still flush the
> dirty data in the background (despite Michaels experiment), it just takes
> the VM some time to catch up.
>
> Linux VM /proc tunables can be tweaked on the client to have it be more
> aggressive about pushing out dirty data.  I suspect they are currently
> tuned for desktop workloads more than IO-intensive workloads.
>
>
>   I am not sure the current IO is "async" enough, since it still
includes
some sync "process", for example,
locks, read for partial page, some other stack overhead in commit_write, 
sometimes you can not ignore
these overhead. For example, even without hit the dirty max limit, you 
might get quite different write time
for writing same data. I guess some of the reason might be the VM is 
just "out of control".

With AIO,

1) The application  can  skip  these "sync" process?  For example by 
creating an daemon to do the those routine  process.
2) We can control the write_page(sent to OST) ourselves, instead of rely 
on VM.
3) These aio pages do not need comply the dirty max limit in the 
submit_io(AIO) process,
   So user application do "real" memory for writing.                 

Yes, AIO will be indeed complex for the application, but not that much, 
IMHO.                             

Thanks
Wangdi
>
>

Andreas Dilger

2009-Apr-07 07:54 UTC

head link

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

On Apr 06, 2009  18:12 -0400, di wang wrote:> I am not sure the current IO is "async" enough, since it still
includes
> some sync "process", for example, locks, read for partial page,
some
> other stack overhead in commit_write, sometimes you can not ignore
> these overhead. For example, even without hit the dirty max limit, you  
> might get quite different write time for writing same data. I guess
> some of the reason might be the VM is just "out of control".
>
> With AIO,
>
> 1) The application  can  skip  these "sync" process?  For example
by
> creating an daemon to do the those routine  process.
> 2) We can control the write_page(sent to OST) ourselves, instead of rely  
> on VM.
> 3) These aio pages do not need comply the dirty max limit in the  
> submit_io(AIO) process, So user application do "real" memory for
writing.
>
> Yes, AIO will be indeed complex for the application, but not that much,  
There was just a discussion about Linux AIO today (from former Linux AIO
maintainer, as it is currently unmaintained) and his statements were that
AIO is not really async at all, as it will block for local filesystems
on any file allocation, or any buffered IO.  Only preallocated files and
O_DIRECT even have a chance at AIO, and even that will not always work.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Mar 2009 - Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15