Eric Barton
2009-Mar-16 12:56 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Mike, Yes, it would be fun to discuss - but I''m probably not going to be available for a discussion like that for a week or 2. BTW, I''m cc-ing lustre-devel since this is of general interest. I _do_ agree that for some apps, if there was sufficient memory on the app node to buffer the local component of a checkpoint and let it "dribble" out to disk would achieve better utilization of the compute resource. However parallel apps can be very sensitive to "noise" on the network they''re using for inter- process communication - i.e. the checkpoint data has either to be written all the way to disk, or at least buffered somewhere so that moving it to disk will not interfere with the app''s own communications. This latter concept is the basis for the "flash cache" concept. Actually, I think it''s worth exploring the economics of it in more detail. The variables are aggregate network bandwidth into the distributed checkpoint cache, which determines the checkpoint time, and aggregate path-minimum bandwidth (i.e. lesser of network and disk bandwidth) from the cache to disk, which determines how soon the cache can be ready for the next checkpoint. The cache could be dedicated nodes and storage (e.g. flash) or additional storage on the OSSes, or any combination of either. And the interesting relationship is how compute cluster utilisation varies with the cost of the server and cache subsystems. -- Cheers, Eric> -----Original Message----- > From: Michael.Booth at Sun.COM [mailto:Michael.Booth at Sun.COM] > Sent: 16 March 2009 3:06 AM > To: Eric Barton > Subject: Re: Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15 > > Eric, > > This is too bad. I should run the test on my laptop and see if I get > the same behavior. > > The huge bandwidth requirements (30+ gbyes/sec) that I see for > checkpoint-style I/O is driven in burst that last about 1/10 of the > time of the following computation. There is not a desire to assure > that everything is on disk before resuming computations. If while the > computations proceeded the system cleared out the cache, the next > write would go to cache at memory speed if the previous clean pages > could be reused for the next write. The bandwidth requirement to > achieve what appears to be memory speed I/o could be achieved in this > case with 3 gbytes/sec. > > There are middleware schemes being developed to do asynchronous I/O > on "other" nodes to transfer the checkpoint data out to the other > nodes so they write it all out. To me this is the middleware working > at odds with what the system software should naturally do for the > application. > > I think it is safe to say it is a minority of scientific applications > that are writing out and quickly reading it back like a typical linux > application, like web browsers. This type of I/O is usually limited > to codes that are larger than the sum of the nodes memory,, which is > rarer and rarer these days. > > I believe that making this work for these codes is a win in three ways;, > > One: reduces the need for high burst rate I/O to disk for many > programs while giving the perception of much faster I/O to the > application. > > Two: helps to reduce the impact of filesystem performance > variability, > > Three: Overall in the system, not having the system being hit with > huge burst of I/O by tens of thousands of cores at seemingly random > times, could reduce the variability of the complete file system. > > Should we discuss on the phone, with Oleg? > > Thanks,, this is fun, > > Mike > > > Michael Booth > michael.booth at sun.com > mobile 512-289-3805 >
Oleg Drokin
2009-Mar-18 20:31 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Hello! On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:> I _do_ agree that for some apps, if there was sufficient memory on the > app node to buffer the local component of a checkpoint and let it > "dribble" out to disk would achieve better utilization of the compute > resource. However parallel apps can be very sensitive to "noise" on > the network they''re using for inter- process communication - i.e. the > checkpoint data has either to be written all the way to disk, or at > least buffered somewhere so that moving it to disk will not interfere > with the app''s own communications. > This latter concept is the basis for the "flash cache" concept. > Actually, I think it''s worth exploring the economics of it in more > detail.This turns out to be a very true assertion. We (I) do see a huge delay in e.g. MPI barriers done immediately after write.> The variables are aggregate network bandwidth into the distributed > checkpoint cache, which determines the checkpoint time, and aggregate > path-minimum bandwidth (i.e. lesser of network and disk bandwidth) > from the cache to disk, which determines how soon the cache can be > ready for the next checkpoint. The cache could be dedicated nodes and > storage (e.g. flash) or additional storage on the OSSes, or any > combination of either. And the interesting relationship is how > compute cluster utilisation varies with the cost of the server and > cache subsystems.The thing is, if we can just flush out data from the cache at the moment when there is no network-latency critical activity on the app side (somehow signaled by the app), why would we need the flash storage at all? We can write nice sequential chunks to normal disks just as fast, I presume. It is the random i/o patterns that make flash shine. Bye, Oleg
Andreas Dilger
2009-Mar-31 18:51 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Mar 18, 2009 16:31 -0400, Oleg Drokin wrote:> On Mar 16, 2009, at 8:56 AM, Eric Barton wrote: > > I _do_ agree that for some apps, if there was sufficient memory on the > > app node to buffer the local component of a checkpoint and let it > > "dribble" out to disk would achieve better utilization of the compute > > resource. However parallel apps can be very sensitive to "noise" on > > the network they''re using for inter- process communication - i.e. the > > checkpoint data has either to be written all the way to disk, or at > > least buffered somewhere so that moving it to disk will not interfere > > with the app''s own communications. > > This latter concept is the basis for the "flash cache" concept. > > Actually, I think it''s worth exploring the economics of it in more > > detail. > > This turns out to be a very true assertion. We (I) do see a huge delay > in e.g. MPI barriers done immediately after write.While this is true, I still believe that the amount of delay seen by the application cannot possibly be worse than waiting for all of the IO to complete. Also, the question is whether you are measuring the FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier after the write? Since Lustre is currently aggressively flushing the write cache then the first MPI barrier is essentially waiting for all of the IO to complete, which is of course very slow. The real item of interest is how long the SECOND MPI barrier takes, which is what the overhead of Lustre IO is on the network performance. It is impossible that Lustre IO completely saturates the entire cross-sectional bandwidth of the system OR the client CPUs, so having some amount of computation for "free" during IO is still better than waiting for the IO to complete. For example, we have 1000 processes each doing a 1GB write to their own file, and the aggregate IO bandwidth is 10GB/s the IO will take about 100s to write if (as we currently do) limit the amount of dirty data on each client to avoid interfering with the application, and no computation can be done during this time. If we allowed clients to cache that 1GB of IO it might only take 1s to complete the "write" and then 99s to flush the IO to the OSTs. If each compute timestep takes 0.1s during IO vs 0.01s without IO and you would get 990 timesteps during the write flush in the second case until the cache was cleared, vs. none in the first case. I suspect that the overhead of the MPI communication on the Lustre IO is small, since the IO will be limited by the OST network and disk bandwidth, which is generally a small fraction of the cross-sectional bandwidth. This could be tested fairly easily with a real application that is doing computation between IO, instead of a benchmark that is only doing IO or only sleeping between IO, simply by increasing the per-OSC write cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the case where 2 processes on the same node are writing to the same OST). Then, measure the time taken for the application to do, say, 1M timesteps and 100 checkpoints with the 32MB and the 2GB write cache sizes.> > The variables are aggregate network bandwidth into the distributed > > checkpoint cache, which determines the checkpoint time, and aggregate > > path-minimum bandwidth (i.e. lesser of network and disk bandwidth) > > from the cache to disk, which determines how soon the cache can be > > ready for the next checkpoint. The cache could be dedicated nodes and > > storage (e.g. flash) or additional storage on the OSSes, or any > > combination of either. And the interesting relationship is how > > compute cluster utilisation varies with the cost of the server and > > cache subsystems. > > The thing is, if we can just flush out data from the cache at the moment > when there is no network-latency critical activity on the app side > (somehow signaled by the app), why would we need the flash storage at > all? We can write nice sequential chunks to normal disks just as fast, > I presume.Actually, an idea I had for clusters that are running many jobs at once was essentially having the apps poll for IO capacity when doing a checkpoint, so that they can avoid contending with other jobs that are doing checkpoints at the same time. That way, an app might skip an occasional checkpoint if the filesystem is busy, and instead compute until the filesystem is less busy. This would be equivalent to the client node being able to cache all of the write data and flushing it out in the background, so long as the time to flush a single checkpoint never took longer than the time between checkpoints. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Oleg Drokin
2009-Mar-31 20:58 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Hello! On Mar 31, 2009, at 2:51 PM, Andreas Dilger wrote:>>> This latter concept is the basis for the "flash cache" concept. >>> Actually, I think it''s worth exploring the economics of it in more >>> detail. >> This turns out to be a very true assertion. We (I) do see a huge >> delay >> in e.g. MPI barriers done immediately after write. > While this is true, I still believe that the amount of delay seen by > the application cannot possibly be worse than waiting for all of the > IO to complete. Also, the question is whether you are measuring theAbsolutely.> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier > after the write? Since Lustre is currently aggressively flushing the > write cache then the first MPI barrier is essentially waiting for all > of the IO to complete, which is of course very slow. The real item > of interest is how long the SECOND MPI barrier takes, which is what > the overhead of Lustre IO is on the network performance.Second MPI takes 1.5 seconds for me.> It is impossible that Lustre IO completely saturates the entire > cross-sectional bandwidth of the system OR the client CPUs, so having > some amount of computation for "free" during IO is still better than > waiting for the IO to complete.No arguments about that from me, I am advocating this same thing from the very beginning Bye, Oleg
di wang
2009-Apr-01 03:35 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Hello, Andreas Dilger wrote:> If each compute timestep takes 0.1s during IO vs 0.01s without IO and > you would get 990 timesteps during the write flush in the second case > until the cache was cleared, vs. none in the first case. I suspect > that the overhead of the MPI communication on the Lustre IO is small, > since the IO will be limited by the OST network and disk bandwidth, > which is generally a small fraction of the cross-sectional bandwidth. > > This could be tested fairly easily with a real application that is > doing computation between IO, instead of a benchmark that is only doing > IO or only sleeping between IO, simply by increasing the per-OSC write > cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the > case where 2 processes on the same node are writing to the same OST). > Then, measure the time taken for the application to do, say, 1M timesteps > and 100 checkpoints with the 32MB and the 2GB write cache sizes. > >Can we implement aio here? for example the aio buffer can be treated different as other dirty buffer, not being pushed aggressively to server. It seems with buffer_write, the user have to deal with fs buffer cache issue in his application, not sure it is good for them, and we may not even output these features to the application. Thanks WangDi
Michael Booth
2009-Apr-01 03:55 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Mar 31, 2009, at 11:35 PM, di wang wrote: Hello, Andreas Dilger wrote:> If each compute timestep takes 0.1s during IO vs 0.01s without IO and > you would get 990 timesteps during the write flush in the second case > until the cache was cleared, vs. none in the first case. I suspect > that the overhead of the MPI communication on the Lustre IO is small, > since the IO will be limited by the OST network and disk bandwidth, > which is generally a small fraction of the cross-sectional bandwidth. > > This could be tested fairly easily with a real application that is > doing computation between IO, instead of a benchmark that is only > doing > IO or only sleeping between IO, simply by increasing the per-OSC write > cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid > the > case where 2 processes on the same node are writing to the same OST). > Then, measure the time taken for the application to do, say, 1M > timesteps > and 100 checkpoints with the 32MB and the 2GB write cache sizes. > >Can we implement aio here? for example the aio buffer can be treated different as other dirty buffer, not being pushed aggressively to server. It seems with buffer_write, the user have to deal with fs buffer cache issue in his application, not sure it is good for them, and we may not even output these features to the application. Thanks WangDi (My Opinion) The large size of the I/O request put onto the SeaStar by the Lustre client is giving it an artificially high priority. Barriers are just a few bytes, the I/Os from the client are in megabytes. SeaStar has no priority in is queue, but the amount of time it takes to clear megabyte request results in a priority that is thousands of times more impact on the hardware than the small synchronization requests of many collectives. I am wondering if the interference from I/O to computation is more an artifact of message size and bursts, than of congestion or routing inefficiencies in seastar.. If there are hundreds of megabytes of request queued up on the network, and there is no priority way to push a barrier or other small mpi request up on the queue, it is bound to create a disruption. To borrow the elevator metaphor from Eric, if all the elevators are queued up from 8:00 to 9:00 delivering office supplies on carts that occupy the entire elevator, maybe the carts should be smaller, and limited to a few per elevator trip. Mike Booth
Oleg Drokin
2009-Apr-01 04:34 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Hello! On Mar 31, 2009, at 11:55 PM, Michael Booth wrote:> (My Opinion) The large size of the I/O request put onto the SeaStar > by the Lustre client is giving it an artificially high priority. > Barriers are just a few bytes, the I/Os from the client are in > megabytes. SeaStar has no priority in is queue, but the amount of > time it takes to clear megabyte request results in a priority that > is thousands of times more impact on the hardware than the small > synchronization requests of many collectives. I am wondering if the > interference from I/O to computation is more an artifact of message > size and bursts, than of congestion or routing inefficiencies in > seastar.. > If there are hundreds of megabytes of request queued up on the > network, and there is no priority way to push a barrier or other > small mpi request up on the queue, it is bound to create a disruption. > To borrow the elevator metaphor from Eric, if all the elevators are > queued up from 8:00 to 9:00 delivering office supplies on carts that > occupy the entire elevator, maybe the carts should be smaller, and > limited to a few per elevator trip.As we discussed in the past, just sending small i/o messages is going to uncover all kinds of slowdowns all the way back to the disk storage, and the collateral damage would be other tasks that do need fast i/o and do send big chunks of data. Bye, Oleg
Eric Barton
2009-Apr-01 05:01 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
I''d really like to see measurements that confirm that allowing the checkpoint I/O to overlap the application compute phase really delivers the estimated benefits. My concern is that Lustre and application communications will interfere to the detriment of both and end up being less efficient overall. Mike, can you find 2 apps, one which is communications-intensive and another that is only CPU-bound immediately after a checkpoint and measure them? Cheers, Eric> -----Original Message----- > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger > Sent: 31 March 2009 7:51 PM > To: Oleg Drokin > Cc: Eric Barton; lustre-devel at lists.lustre.org; Michael.Booth at Sun.COM > Subject: Re: [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15 > > On Mar 18, 2009 16:31 -0400, Oleg Drokin wrote: > > On Mar 16, 2009, at 8:56 AM, Eric Barton wrote: > > > I _do_ agree that for some apps, if there was sufficient memory on the > > > app node to buffer the local component of a checkpoint and let it > > > "dribble" out to disk would achieve better utilization of the compute > > > resource. However parallel apps can be very sensitive to "noise" on > > > the network they''re using for inter- process communication - i.e. the > > > checkpoint data has either to be written all the way to disk, or at > > > least buffered somewhere so that moving it to disk will not interfere > > > with the app''s own communications. > > > This latter concept is the basis for the "flash cache" concept. > > > Actually, I think it''s worth exploring the economics of it in more > > > detail. > > > > This turns out to be a very true assertion. We (I) do see a huge delay > > in e.g. MPI barriers done immediately after write. > > While this is true, I still believe that the amount of delay seen by > the application cannot possibly be worse than waiting for all of the > IO to complete. Also, the question is whether you are measuring the > FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier > after the write? Since Lustre is currently aggressively flushing the > write cache then the first MPI barrier is essentially waiting for all > of the IO to complete, which is of course very slow. The real item > of interest is how long the SECOND MPI barrier takes, which is what > the overhead of Lustre IO is on the network performance. > > > It is impossible that Lustre IO completely saturates the entire > cross-sectional bandwidth of the system OR the client CPUs, so having > some amount of computation for "free" during IO is still better than > waiting for the IO to complete. > > For example, we have 1000 processes each doing a 1GB write to their own > file, and the aggregate IO bandwidth is 10GB/s the IO will take about 100s > to write if (as we currently do) limit the amount of dirty data on each > client to avoid interfering with the application, and no computation can > be done during this time. If we allowed clients to cache that 1GB of > IO it might only take 1s to complete the "write" and then 99s to flush > the IO to the OSTs. > > If each compute timestep takes 0.1s during IO vs 0.01s without IO and > you would get 990 timesteps during the write flush in the second case > until the cache was cleared, vs. none in the first case. I suspect > that the overhead of the MPI communication on the Lustre IO is small, > since the IO will be limited by the OST network and disk bandwidth, > which is generally a small fraction of the cross-sectional bandwidth. > > This could be tested fairly easily with a real application that is > doing computation between IO, instead of a benchmark that is only doing > IO or only sleeping between IO, simply by increasing the per-OSC write > cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid the > case where 2 processes on the same node are writing to the same OST). > Then, measure the time taken for the application to do, say, 1M timesteps > and 100 checkpoints with the 32MB and the 2GB write cache sizes. > > > > The variables are aggregate network bandwidth into the distributed > > > checkpoint cache, which determines the checkpoint time, and aggregate > > > path-minimum bandwidth (i.e. lesser of network and disk bandwidth) > > > from the cache to disk, which determines how soon the cache can be > > > ready for the next checkpoint. The cache could be dedicated nodes and > > > storage (e.g. flash) or additional storage on the OSSes, or any > > > combination of either. And the interesting relationship is how > > > compute cluster utilisation varies with the cost of the server and > > > cache subsystems. > > > > The thing is, if we can just flush out data from the cache at the moment > > when there is no network-latency critical activity on the app side > > (somehow signaled by the app), why would we need the flash storage at > > all? We can write nice sequential chunks to normal disks just as fast, > > I presume. > > Actually, an idea I had for clusters that are running many jobs at once > was essentially having the apps poll for IO capacity when doing a > checkpoint, so that they can avoid contending with other jobs that are > doing checkpoints at the same time. That way, an app might skip an > occasional checkpoint if the filesystem is busy, and instead compute > until the filesystem is less busy. > > This would be equivalent to the client node being able to cache all of > the write data and flushing it out in the background, so long as the > time to flush a single checkpoint never took longer than the time > between checkpoints. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.
Mike Booth
2009-Apr-01 05:08 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Yes. But I do need to finish the assignment that Scott Klasky has for me first. Mike Mike Booth 512 289-3805 mobile 512 692-9602 On Apr 1, 2009, at 1:01 AM, Eric Barton <eeb at sun.com> wrote:> I''d really like to see measurements that confirm that allowing the > checkpoint I/O to overlap the application compute phase really > delivers the estimated benefits. My concern is that Lustre and > application communications will interfere to the detriment of both and > end up being less efficient overall. > > Mike, can you find 2 apps, one which is communications-intensive and > another that is only CPU-bound immediately after a checkpoint and > measure > them? > > Cheers, > Eric > >> -----Original Message----- >> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On >> Behalf Of Andreas Dilger >> Sent: 31 March 2009 7:51 PM >> To: Oleg Drokin >> Cc: Eric Barton; lustre-devel at lists.lustre.org; Michael.Booth at Sun.COM >> Subject: Re: [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: >> Mike Booth week ending 2009.03.15 >> >> On Mar 18, 2009 16:31 -0400, Oleg Drokin wrote: >>> On Mar 16, 2009, at 8:56 AM, Eric Barton wrote: >>>> I _do_ agree that for some apps, if there was sufficient memory >>>> on the >>>> app node to buffer the local component of a checkpoint and let it >>>> "dribble" out to disk would achieve better utilization of the >>>> compute >>>> resource. However parallel apps can be very sensitive to "noise" >>>> on >>>> the network they''re using for inter- process communication - i.e. >>>> the >>>> checkpoint data has either to be written all the way to disk, or at >>>> least buffered somewhere so that moving it to disk will not >>>> interfere >>>> with the app''s own communications. >>>> This latter concept is the basis for the "flash cache" concept. >>>> Actually, I think it''s worth exploring the economics of it in more >>>> detail. >>> >>> This turns out to be a very true assertion. We (I) do see a huge >>> delay >>> in e.g. MPI barriers done immediately after write. >> >> While this is true, I still believe that the amount of delay seen by >> the application cannot possibly be worse than waiting for all of the >> IO to complete. Also, the question is whether you are measuring the >> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier >> after the write? Since Lustre is currently aggressively flushing the >> write cache then the first MPI barrier is essentially waiting for all >> of the IO to complete, which is of course very slow. The real item >> of interest is how long the SECOND MPI barrier takes, which is what >> the overhead of Lustre IO is on the network performance. >> >> >> It is impossible that Lustre IO completely saturates the entire >> cross-sectional bandwidth of the system OR the client CPUs, so having >> some amount of computation for "free" during IO is still better than >> waiting for the IO to complete. >> >> For example, we have 1000 processes each doing a 1GB write to their >> own >> file, and the aggregate IO bandwidth is 10GB/s the IO will take >> about 100s >> to write if (as we currently do) limit the amount of dirty data on >> each >> client to avoid interfering with the application, and no >> computation can >> be done during this time. If we allowed clients to cache that 1GB of >> IO it might only take 1s to complete the "write" and then 99s to >> flush >> the IO to the OSTs. >> >> If each compute timestep takes 0.1s during IO vs 0.01s without IO and >> you would get 990 timesteps during the write flush in the second case >> until the cache was cleared, vs. none in the first case. I suspect >> that the overhead of the MPI communication on the Lustre IO is small, >> since the IO will be limited by the OST network and disk bandwidth, >> which is generally a small fraction of the cross-sectional bandwidth. >> >> This could be tested fairly easily with a real application that is >> doing computation between IO, instead of a benchmark that is only >> doing >> IO or only sleeping between IO, simply by increasing the per-OSC >> write >> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to >> avoid the >> case where 2 processes on the same node are writing to the same OST). >> Then, measure the time taken for the application to do, say, 1M >> timesteps >> and 100 checkpoints with the 32MB and the 2GB write cache sizes. >> >>>> The variables are aggregate network bandwidth into the distributed >>>> checkpoint cache, which determines the checkpoint time, and >>>> aggregate >>>> path-minimum bandwidth (i.e. lesser of network and disk bandwidth) >>>> from the cache to disk, which determines how soon the cache can be >>>> ready for the next checkpoint. The cache could be dedicated >>>> nodes and >>>> storage (e.g. flash) or additional storage on the OSSes, or any >>>> combination of either. And the interesting relationship is how >>>> compute cluster utilisation varies with the cost of the server and >>>> cache subsystems. >>> >>> The thing is, if we can just flush out data from the cache at the >>> moment >>> when there is no network-latency critical activity on the app side >>> (somehow signaled by the app), why would we need the flash storage >>> at >>> all? We can write nice sequential chunks to normal disks just as >>> fast, >>> I presume. >> >> Actually, an idea I had for clusters that are running many jobs at >> once >> was essentially having the apps poll for IO capacity when doing a >> checkpoint, so that they can avoid contending with other jobs that >> are >> doing checkpoints at the same time. That way, an app might skip an >> occasional checkpoint if the filesystem is busy, and instead compute >> until the filesystem is less busy. >> >> This would be equivalent to the client node being able to cache all >> of >> the write data and flushing it out in the background, so long as the >> time to flush a single checkpoint never took longer than the time >> between checkpoints. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. > >
Michael Booth
2009-Apr-01 11:41 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Apr 1, 2009, at 12:34 AM, Oleg Drokin wrote:> Hello! > > On Mar 31, 2009, at 11:55 PM, Michael Booth wrote: >> (My Opinion) The large size of the I/O request put onto the SeaStar >> by the Lustre client is giving it an artificially high priority. >> Barriers are just a few bytes, the I/Os from the client are in >> megabytes. SeaStar has no priority in is queue, but the amount of >> time it takes to clear megabyte request results in a priority that >> is thousands of times more impact on the hardware than the small >> synchronization requests of many collectives. I am wondering if the >> interference from I/O to computation is more an artifact of message >> size and bursts, than of congestion or routing inefficiencies in >> seastar.. >> If there are hundreds of megabytes of request queued up on the >> network, and there is no priority way to push a barrier or other >> small mpi request up on the queue, it is bound to create a >> disruption. >> To borrow the elevator metaphor from Eric, if all the elevators are >> queued up from 8:00 to 9:00 delivering office supplies on carts that >> occupy the entire elevator, maybe the carts should be smaller, and >> limited to a few per elevator trip. > > As we discussed in the past, just sending small i/o messages is going > to uncover all kinds of slowdowns all the way back to the disk > storage, > and the collateral damage would be other tasks that do need fast i/o > and do send big chunks of data. > > Bye, > Oleg > _______________________________________________Don''t take my explanation as a general suggestion for all I/O, It is a suggestion for I/O taking place during times of need for high response for mpi. How to know when the need is high is another issue. To over extend the metaphor, can the size of the office supply carts be small and the amount number allowed on the elevator up be limited from 8:00 to 9:00 am when the office people traffic is in need of high response? It is clear to the application when they are doing synchronous i/o and don''t care so much about mpi response and when they are in a stage that a collective response is important. For example: the barrier after an fsync is definitely desiring highest response to I/O request. After a barrier is complete, mpi would likely need highest response, until another user synchronous i/o call, (not to include printf''s). I could even be an explicit call to the Lustre client from the application to switch the priority between states. Mike
Andreas Dilger
2009-Apr-02 22:43 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Mar 31, 2009 23:55 -0400, Michael Booth wrote:> On Mar 31, 2009, at 11:35 PM, di wang wrote: >> Andreas Dilger wrote: >>> If each compute timestep takes 0.1s during IO vs 0.01s without IO and >>> you would get 990 timesteps during the write flush in the second case >>> until the cache was cleared, vs. none in the first case. I suspect >>> that the overhead of the MPI communication on the Lustre IO is small, >>> since the IO will be limited by the OST network and disk bandwidth, >>> which is generally a small fraction of the cross-sectional bandwidth. >>> >>> This could be tested fairly easily with a real application that is >>> doing computation between IO, instead of a benchmark that is only >>> doing >>> IO or only sleeping between IO, simply by increasing the per-OSC write >>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to avoid >>> the >>> case where 2 processes on the same node are writing to the same OST). >>> Then, measure the time taken for the application to do, say, 1M >>> timesteps >>> and 100 checkpoints with the 32MB and the 2GB write cache sizes. >> >> Can we implement aio here? for example the aio buffer can be treated >> different as other dirty buffer, not >> being pushed aggressively to server. It seems with buffer_write, the >> user have to deal with fs buffer cache >> issue in his application, not sure it is good for them, and we may not >> even output these features to the >> application.I''m not sure what you mean. Implementing AIO is _more_ complex for the application, and in essence the current IO is mostly async except when the client hits the max dirty limit. The client will still flush the dirty data in the background (despite Michaels experiment), it just takes the VM some time to catch up. Linux VM /proc tunables can be tweaked on the client to have it be more aggressive about pushing out dirty data. I suspect they are currently tuned for desktop workloads more than IO-intensive workloads. On Mar 31, 2009 23:55 -0400, Michael Booth wrote:> The large size of the I/O request put onto the SeaStar by > the Lustre client is giving it an artificially high priority. Barriers > are just a few bytes, the I/Os from the client are in megabytes. > SeaStar has no priority in is queue, but the amount of time it takes to > clear megabyte request results in a priority that is thousands of times > more impact on the hardware than the small synchronization requests of > many collectives. I am wondering if the interference from I/O to > computation is more an artifact of message size and bursts, than of > congestion or routing inefficiencies in seastar.. > > If there are hundreds of megabytes of request queued up on the network, > and there is no priority way to push a barrier or other small mpi request > up on the queue, it is bound to create a disruption.Note that the Lustre IO REQUESTS are not very large in themselves (under 512 bytes for a 1MB write), but it is the bulk xfer that is large. The LND code could definitely cooperate with the network hardware to ensure that small requests get a decent share of the network bandwidth, and Lustre itself would also benefit from this (allowing e.g. lock requests to bypass the bulk IO traffic) but whether the network hardware can do this in any manner is a separate question.> To borrow the elevator metaphor from Eric, if all the elevators are > queued up from 8:00 to 9:00 delivering office supplies on carts that > occupy the entire elevator, maybe the carts should be smaller, and > limited to a few per elevator trip.A better analogy would be the elevators occasionally have a mail cart with tens or hundreds of requests for office supplies, and this can easily share the elevator with other workers. Having a separate freight elevator to handle the supplies themselves is one way to do it, having the elevators alternate people and supplies is another, but cutting desks into small pieces so they can share space with people is not an option. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Michael Booth
2009-Apr-03 18:27 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Apr 2, 2009, at 6:43 PM, Andreas Dilger wrote:> On Mar 31, 2009 23:55 -0400, Michael Booth wrote: >> On Mar 31, 2009, at 11:35 PM, di wang wrote: >>> Andreas Dilger wrote: >>>> If each compute timestep takes 0.1s during IO vs 0.01s without IO >>>> and >>>> you would get 990 timesteps during the write flush in the second >>>> case >>>> until the cache was cleared, vs. none in the first case. I suspect >>>> that the overhead of the MPI communication on the Lustre IO is >>>> small, >>>> since the IO will be limited by the OST network and disk bandwidth, >>>> which is generally a small fraction of the cross-sectional >>>> bandwidth. >>>> >>>> This could be tested fairly easily with a real application that is >>>> doing computation between IO, instead of a benchmark that is only >>>> doing >>>> IO or only sleeping between IO, simply by increasing the per-OSC >>>> write >>>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to >>>> avoid >>>> the >>>> case where 2 processes on the same node are writing to the same >>>> OST). >>>> Then, measure the time taken for the application to do, say, 1M >>>> timesteps >>>> and 100 checkpoints with the 32MB and the 2GB write cache sizes. >>> >>> Can we implement aio here? for example the aio buffer can be >>> treated >>> different as other dirty buffer, not >>> being pushed aggressively to server. It seems with buffer_write, the >>> user have to deal with fs buffer cache >>> issue in his application, not sure it is good for them, and we may >>> not >>> even output these features to the >>> application. > > I''m not sure what you mean. Implementing AIO is _more_ complex for > the > application, and in essence the current IO is mostly async except when > the client hits the max dirty limit. The client will still flush the > dirty data in the background (despite Michaels experiment), it just > takes > the VM some time to catch up.even after an fsync? The experiment was to see if the dirty cache is in fact being voided as designed. for (writesize = small;writesize<hundeds of megabytes;writesize +=increment){ for (sixty iterations){ write<---- writesize timer1 fsync timer2}} Run again, but sleep for (writesize = small;writesize<hundeds of megabytes;writesize +=increment){ for (sixty iterations){ write<---- writesize sleep(1 second) timer1 fsync timer2}} for each iteration, take the best time for each fsync, and plot the speedup that the second routine has for it''s fsync over the non-slept fsync. The results -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedGraphic.pdf Type: application/pdf Size: 26637 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090403/39d8a9fc/attachment-0002.pdf -------------- next part -------------- To me this is not behavior that is consistent with the design behavior. Are there tests for writes to assure that the cache is behaving as designed? another way to view this is to calculate the amount of data that is moved during the 1 second of sleep. -------------- next part -------------- A non-text attachment was scrubbed... Name: pastedGraphic.pdf Type: application/pdf Size: 27283 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090403/39d8a9fc/attachment-0003.pdf -------------- next part -------------- This seems to show a "bug" in how much dirty data is shipped off to disk. Again,, I don''t think that cache is clearing like it was designed.> > Linux VM /proc tunables can be tweaked on the client to have it be > more > aggressive about pushing out dirty data. I suspect they are currently > tuned for desktop workloads more than IO-intensive workloads.I would not characterize these as i/o intensive,, they are large bulk sequential writes,, the cache design is for good performance assuming most of what is written is also read right away,, yes this is good for desktops, but really in the way for checkpoint restart and data dumping Scientific applications.> > On Mar 31, 2009 23:55 -0400, Michael Booth wrote: >> The large size of the I/O request put onto the SeaStar by >> the Lustre client is giving it an artificially high priority. >> Barriers >> are just a few bytes, the I/Os from the client are in megabytes. >> SeaStar has no priority in is queue, but the amount of time it >> takes to >> clear megabyte request results in a priority that is thousands of >> times >> more impact on the hardware than the small synchronization requests >> of >> many collectives. I am wondering if the interference from I/O to >> computation is more an artifact of message size and bursts, than of >> congestion or routing inefficiencies in seastar.. >> >> If there are hundreds of megabytes of request queued up on the >> network, >> and there is no priority way to push a barrier or other small mpi >> request >> up on the queue, it is bound to create a disruption. > > Note that the Lustre IO REQUESTS are not very large in themselves > (under > 512 bytes for a 1MB write), but it is the bulk xfer that is large. > The > LND code could definitely cooperate with the network hardware to > ensure > that small requests get a decent share of the network bandwidth, and > Lustre itself would also benefit from this (allowing e.g. lock > requests > to bypass the bulk IO traffic) but whether the network hardware can do > this in any manner is a separate question.Some codes see 80% slowdown after writes due to collectives that are chocked out by background i/o.> >> To borrow the elevator metaphor from Eric, if all the elevators are >> queued up from 8:00 to 9:00 delivering office supplies on carts that >> occupy the entire elevator, maybe the carts should be smaller, and >> limited to a few per elevator trip. > > A better analogy would be the elevators occasionally have a mail cart > with tens or hundreds of requests for office supplies, and this can > easily share the elevator with other workers. Having a separate > freight elevator to handle the supplies themselves is one way to do > it, having the elevators alternate people and supplies is another, > but cutting desks into small pieces so they can share space with > people is not an option.I don''t want to cut the desks up,, just want to be sure that small collective messages are not slowed down orders of magnitude by what should be a background process.> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develMike
di wang
2009-Apr-06 22:12 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
Hello, Andreas Dilger wrote:> I''m not sure what you mean. Implementing AIO is _more_ complex for the > application, and in essence the current IO is mostly async except when > the client hits the max dirty limit. The client will still flush the > dirty data in the background (despite Michaels experiment), it just takes > the VM some time to catch up. > > Linux VM /proc tunables can be tweaked on the client to have it be more > aggressive about pushing out dirty data. I suspect they are currently > tuned for desktop workloads more than IO-intensive workloads. > > >I am not sure the current IO is "async" enough, since it still includes some sync "process", for example, locks, read for partial page, some other stack overhead in commit_write, sometimes you can not ignore these overhead. For example, even without hit the dirty max limit, you might get quite different write time for writing same data. I guess some of the reason might be the VM is just "out of control". With AIO, 1) The application can skip these "sync" process? For example by creating an daemon to do the those routine process. 2) We can control the write_page(sent to OST) ourselves, instead of rely on VM. 3) These aio pages do not need comply the dirty max limit in the submit_io(AIO) process, So user application do "real" memory for writing. Yes, AIO will be indeed complex for the application, but not that much, IMHO. Thanks Wangdi> >
Andreas Dilger
2009-Apr-07 07:54 UTC
[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15
On Apr 06, 2009 18:12 -0400, di wang wrote:> I am not sure the current IO is "async" enough, since it still includes > some sync "process", for example, locks, read for partial page, some > other stack overhead in commit_write, sometimes you can not ignore > these overhead. For example, even without hit the dirty max limit, you > might get quite different write time for writing same data. I guess > some of the reason might be the VM is just "out of control". > > With AIO, > > 1) The application can skip these "sync" process? For example by > creating an daemon to do the those routine process. > 2) We can control the write_page(sent to OST) ourselves, instead of rely > on VM. > 3) These aio pages do not need comply the dirty max limit in the > submit_io(AIO) process, So user application do "real" memory for writing. > > Yes, AIO will be indeed complex for the application, but not that much,There was just a discussion about Linux AIO today (from former Linux AIO maintainer, as it is currently unmaintained) and his statements were that AIO is not really async at all, as it will block for local filesystems on any file allocation, or any buffered IO. Only preallocated files and O_DIRECT even have a chance at AIO, and even that will not always work. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.