Oleg Drokin
2009-Mar-04 17:14 UTC
[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.
Hello! I was having a discussion with Mike Booth just now about how application programmers are willing to sacrifice robustness of their checkpoint files and go the the previous version if the last one did not made it to the disk if only the checkpointing itself could be done quick (even if not persistent). I also remembered how we discussed using of ramdisks on the compute nodes and then have some background agent to actually copy the data to Lustre later on when the application is happily computing still. It now suddenly appeared to me (duh!) that we can get that all for free with buffer cache right now. We are artificially limiting our dirty memory to 32M/osc on every node (which is probably way to low today too), but if we lift the limit significantly (or remove it altogether) such checkpointing applications (and I verified that many of them use 10-20% of RAM for checkpointing) would benefit tremendously (as long as they do not do fsync at the end of checkpoint, of course). Currently there is a way to achieve this same goal, but to do this we essentially need to select a very suboptimal stripe pattern (essentially drag your file to be striped across as many OSTs as possible with small stripe size to maximize allowed dirty memory cache in use at the expense of a lot of seeking at the OSTs since that would essentially mean every client would be writing to every OST). The old justification for the small dirty memory limit was to do with lock timeouts and stuff, but now that we have lock prolonging and indefinite waiting for locks on clients on the other, there is no reason to limit ourselves anymore, I think. I plan to speak with ORNL later today to conduct an experiment (if they would agree) and to lift their dirty memory limit on some of the scratch filesystems often used for scratch files to see what the effect would be. I expect it to be very positive myself. At the same time Mike is doing a test right now with some real world applications + the above-mentioned suboptimal striping pattern to see the effects as well. I think that allows us to take advantage of huge amounts of wasted memory on compute nodes today for this caching and benefit many application with this checkpointing model (essentially superseding old "flash-cache" idea at a fraction of the cost and effort). Bye, Oleg
Andreas Dilger
2009-Mar-05 20:00 UTC
[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.
On Mar 04, 2009 12:14 -0500, Oleg Drokin wrote:> It now suddenly appeared to me (duh!) that we can get that all for > free with buffer cache right now. We are artificially limiting our > dirty memory to 32M/osc on every node (which is probably way to low > today too), but if we lift the limit significantly (or remove it > altogether) such checkpointing applications (and I verified that many > of them use 10-20% of RAM for checkpointing) would benefit tremendously > (as long as they do not do fsync at the end of checkpoint, of course).> Currently there is a way to achieve this same goal, but to do this > we essentially need to select a very suboptimal stripe pattern > (essentially drag your file to be striped across as many OSTs as > possible with small stripe size to maximize allowed dirty memory cache > in use at the expense of a lot of seeking at the OSTs since that would > essentially mean every client would be writing to every OST).We don''t need to go to the sub-optimal striping to get this result, as that causes not only lots of seeking on the OSTs, but also requires the clients to get locks on every OST. Instead it is possible today to just increase this limit to be much larger via /proc tunings on the client for testing (assume 1/2 of RAM is large enough): client# lctl set_param osc.*.max_dirty_mb=${ramsize/2} or the cache limit can be increased permanently on all of the clients via conf_param (sorry, syntax may not be 100% correct): mgs# lctl --device ${mgsdevno} conf_param fsname.osc.max_dirty_mb=${ramsize/2}> The old justification for the small dirty memory limit was to do > with lock timeouts and stuff, but now that we have lock prolonging and > indefinite waiting for locks on clients on the other, there is no reason > to limit ourselves anymore, I think.You are probably correct. At times we have discussed using the client grant to manage the amount of dirty data that clients can have, so that we don''t get 5TB of dirty data on the clients for a single 100MB/s OST before trying to flush the data. You may be right that with lock extension and glimpse ASTs we may just be better off to allow the clients to fill the cache as needed. One possible downfall is that when multiple clients are writing to the same file, if the first client to get the lock (full [0-EOF] lock) can dump a huge amount of dirty data under the lock, all of the other clients will not even be able to get a lock and start writing until the first client is finished. I think this shows up in SSF IOR testing today when the write chunk size is 4MB being slower than when it is 1MB, because the clients need to flush 4MB of data before their lock can be revoked and split, instead of just 1MB. Having lock conversion allow the client to shrink or split the lock would avoid this contention.> At the same time Mike is doing a test right now with some real > world applications + the above-mentioned suboptimal striping pattern > to see the effects as well. > > I think that allows us to take advantage of huge amounts of wasted > memory on compute nodes today for this caching and benefit many > application with this checkpointing model (essentially superseding > old "flash-cache" idea at a fraction of the cost and effort).Sure, as long as apps are not impacted by the increased memory usage, since client nodes generally do not have swap, nor would we want to swap out the application to cache the checkpoint data. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Oleg Drokin
2009-Mar-05 20:35 UTC
[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.
Hello! On Mar 5, 2009, at 3:00 PM, Andreas Dilger wrote:> We don''t need to go to the sub-optimal striping to get this result, > as that causes not only lots of seeking on the OSTs, but also requires > the clients to get locks on every OST. Instead it is possible today > to just increase this limit to be much larger via /proc tunings on > the client for testing (assume 1/2 of RAM is large enough): > client# lctl set_param osc.*.max_dirty_mb=${ramsize/2}Of course! But I am speaking of a situation like say ORNL, where users cannot control this setting directly.> One possible downfall is that when multiple clients are writing to the > same file, if the first client to get the lock (full [0-EOF] lock) can > dump a huge amount of dirty data under the lock, all of the other > clients > will not even be able to get a lock and start writing until the first > client is finished.This is unlikely case, since as soon as lock is cancelled, it cannot be rematched.> I think this shows up in SSF IOR testing today when the write chunk > size > is 4MB being slower than when it is 1MB, because the clients need to > flush > 4MB of data before their lock can be revoked and split, instead of > just 1MB. > Having lock conversion allow the client to shrink or split the lock > would > avoid this contention.I would think the reason is different since essentially 4M memory copy is very small and chances are client was ably to issue write syscall many times per lock acquisition. But this is pure speculation from my side. Bye, Oleg
Andreas Dilger
2009-Mar-05 21:16 UTC
[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.
On Mar 05, 2009 15:35 -0500, Oleg Drokin wrote:> On Mar 5, 2009, at 3:00 PM, Andreas Dilger wrote: >> We don''t need to go to the sub-optimal striping to get this result, >> as that causes not only lots of seeking on the OSTs, but also requires >> the clients to get locks on every OST. Instead it is possible today >> to just increase this limit to be much larger via /proc tunings on >> the client for testing (assume 1/2 of RAM is large enough): >> client# lctl set_param osc.*.max_dirty_mb=${ramsize/2} > > Of course! But I am speaking of a situation like say ORNL, where > users cannot control this setting directly.But if this is just testing, then we may as well avoid the extra overhead and apples-to-oranges comparison of single-stripe vs. many-stripe files and allow single-stripe files to cache a lot of data. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.