thr3ads.net - Lustre devel - [Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost. [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Oleg Drokin

2009-Mar-04 17:14 UTC

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

Hello!

    I was having a discussion with Mike Booth just now about how  
application programmers are willing to sacrifice robustness of their
    checkpoint files and go the the previous version if the last one  
did not made it to the disk if only the checkpointing itself could
    be done quick (even if not persistent).
    I also remembered how we discussed using of ramdisks on the  
compute nodes and then have some background agent to actually copy the  
data to
    Lustre later on when the application is happily computing still.

    It now suddenly appeared to me (duh!) that we can get that all for  
free with buffer cache right now. We are artificially limiting our  
dirty memory to
    32M/osc on every node (which is probably way to low today too),  
but if we lift the limit significantly (or remove it altogether) such
    checkpointing applications (and I verified that many of them use  
10-20% of RAM for checkpointing) would benefit tremendously
    (as long as they do not do fsync at the end of checkpoint, of  
course). Currently there is a way to achieve this same goal, but to do  
this
    we essentially need to select a very suboptimal stripe pattern  
(essentially drag your file to be striped across as many OSTs as  
possible with small stripe size to
    maximize allowed dirty memory cache in use at the expense of a lot  
of seeking at the OSTs since that would essentially mean every client  
would be
    writing to every OST).

    The old justification for the small dirty memory limit was to do  
with lock timeouts and stuff, but now that we have lock prolonging and  
indefinite waiting for
    locks on clients on the other, there is no reason to limit  
ourselves anymore, I think.
    I plan to speak with ORNL later today to conduct an experiment (if  
they would agree) and to lift their dirty memory limit on some of the  
scratch filesystems
    often used for scratch files to see what the effect would be. I  
expect it to be very positive myself.

    At the same time Mike is doing a test right now with some real  
world applications + the above-mentioned suboptimal striping pattern  
to see the effects as well.

    I think that allows us to take advantage of huge amounts of wasted  
memory on compute nodes today for this caching and benefit many  
application with this
    checkpointing model (essentially superseding old "flash-cache"  
idea at a fraction of the cost and effort).

Bye,
     Oleg

Andreas Dilger

2009-Mar-05 20:00 UTC

head link

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

On Mar 04, 2009  12:14 -0500, Oleg Drokin wrote:>     It now suddenly appeared to me (duh!) that we can get that all for  
> free with buffer cache right now. We are artificially limiting our  
> dirty memory to 32M/osc on every node (which is probably way to low
> today too), but if we lift the limit significantly (or remove it
> altogether) such checkpointing applications (and I verified that many
> of them use 10-20% of RAM for checkpointing) would benefit tremendously
> (as long as they do not do fsync at the end of checkpoint, of course).
> Currently there is a way to achieve this same goal, but to do this
> we essentially need to select a very suboptimal stripe pattern  
> (essentially drag your file to be striped across as many OSTs as  
> possible with small stripe size to maximize allowed dirty memory cache
> in use at the expense of a lot of seeking at the OSTs since that would
> essentially mean every client would be writing to every OST).
We don''t need to go to the sub-optimal striping to get this result,
as that causes not only lots of seeking on the OSTs, but also requires
the clients to get locks on every OST.  Instead it is possible today
to just increase this limit to be much larger via /proc tunings on
the client for testing (assume 1/2 of RAM is large enough):

client# lctl set_param osc.*.max_dirty_mb=${ramsize/2}

or the cache limit can be increased permanently on all of the clients
via conf_param (sorry, syntax may not be 100% correct):

mgs# lctl --device ${mgsdevno} conf_param fsname.osc.max_dirty_mb=${ramsize/2}
>     The old justification for the small dirty memory limit was to do  
> with lock timeouts and stuff, but now that we have lock prolonging and  
> indefinite waiting for locks on clients on the other, there is no reason
> to limit ourselves anymore, I think.
You are probably correct.  At times we have discussed using the client
grant to manage the amount of dirty data that clients can have, so that
we don''t get 5TB of dirty data on the clients for a single 100MB/s OST
before trying to flush the data.  You may be right that with lock extension
and glimpse ASTs we may just be better off to allow the clients to fill
the cache as needed.

One possible downfall is that when multiple clients are writing to the
same file, if the first client to get the lock (full [0-EOF] lock) can
dump a huge amount of dirty data under the lock, all of the other clients
will not even be able to get a lock and start writing until the first
client is finished.

I think this shows up in SSF IOR testing today when the write chunk size
is 4MB being slower than when it is 1MB, because the clients need to flush
4MB of data before their lock can be revoked and split, instead of just 1MB.
Having lock conversion allow the client to shrink or split the lock would
avoid this contention.
>     At the same time Mike is doing a test right now with some real  
> world applications + the above-mentioned suboptimal striping pattern  
> to see the effects as well.
> 
>     I think that allows us to take advantage of huge amounts of wasted  
> memory on compute nodes today for this caching and benefit many  
> application with this checkpointing model (essentially superseding
> old "flash-cache" idea at a fraction of the cost and effort).
Sure, as long as apps are not impacted by the increased memory usage,
since client nodes generally do not have swap, nor would we want to
swap out the application to cache the checkpoint data.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Oleg Drokin

2009-Mar-05 20:35 UTC

head link

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

Hello!

On Mar 5, 2009, at 3:00 PM, Andreas Dilger wrote:> We don''t need to go to the sub-optimal striping to get this
result,
> as that causes not only lots of seeking on the OSTs, but also requires
> the clients to get locks on every OST.  Instead it is possible today
> to just increase this limit to be much larger via /proc tunings on
> the client for testing (assume 1/2 of RAM is large enough):
> client# lctl set_param osc.*.max_dirty_mb=${ramsize/2}
Of course! But I am speaking of a situation like say ORNL, where
users cannot control this setting directly.
> One possible downfall is that when multiple clients are writing to the
> same file, if the first client to get the lock (full [0-EOF] lock) can
> dump a huge amount of dirty data under the lock, all of the other  
> clients
> will not even be able to get a lock and start writing until the first
> client is finished.
This is unlikely case, since as soon as lock is cancelled, it cannot  
be rematched.
> I think this shows up in SSF IOR testing today when the write chunk  
> size
> is 4MB being slower than when it is 1MB, because the clients need to  
> flush
> 4MB of data before their lock can be revoked and split, instead of  
> just 1MB.
> Having lock conversion allow the client to shrink or split the lock  
> would
> avoid this contention.
I would think the reason is different since essentially 4M memory copy  
is very small
and chances are client was ably to issue write syscall many times per  
lock acquisition.
But this is pure speculation from my side.

Bye,
     Oleg

Andreas Dilger

2009-Mar-05 21:16 UTC

head link

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

On Mar 05, 2009  15:35 -0500, Oleg Drokin wrote:> On Mar 5, 2009, at 3:00 PM, Andreas Dilger wrote:
>> We don''t need to go to the sub-optimal striping to get this
result,
>> as that causes not only lots of seeking on the OSTs, but also requires
>> the clients to get locks on every OST.  Instead it is possible today
>> to just increase this limit to be much larger via /proc tunings on
>> the client for testing (assume 1/2 of RAM is large enough):
>> client# lctl set_param osc.*.max_dirty_mb=${ramsize/2}
>
> Of course! But I am speaking of a situation like say ORNL, where
> users cannot control this setting directly.
But if this is just testing, then we may as well avoid the extra overhead
and apples-to-oranges comparison of single-stripe vs. many-stripe files
and allow single-stripe files to cache a lot of data.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Mar 2009 - Fast checkpoints in Lustre today, at essentially zero cost.

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.

[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.