Graham, Simon
2007-Feb-06 15:43 UTC
[Xen-devel] Reducing impact of save/restore/dump on Dom0
Currently, save, restore and dump all used cached I/O in Dom0 to write/read the file containing the memory image of the DomU - when the memory assigned to the DomU is greater than free memory in Dom0, this leads to severe memory thrashing and generally the Dom0 performance goes into the toilet. The ''classic'' answer to avoiding this when writing very large files is, of course, to use non-cached I/O to manipulate the files - this introduces the restriction that reads and writes have to be at sector aligned offsets and be an integral number of sectors in length. I''ve been working on a prototype for using O_DIRECT for the dump case; this one is actually easier to do because the on-disk format of a dump file already has the memory pages aligned to sector boundaries so only the dump header has to be handled specially - attached is a patch to unstable for this code which makes the following changes to tools/libxc/xc_core.c: 1. xc_domain_dumpcore_via_callback is modified to ensure that the buffer containing page data to be dumped is page aligned (using mmap instead of malloc) 2. xc_domain_dumpcore is modified to open the file O_DIRECT and to buffer writes issued by xc_domain_dumpcore_via_callback as needed. I''d welcome comments on this prior to submitting it for inclusion. I know that there is a lot of discussion around O_DIRECT being a bad idea in general and that the POSIX fadvise() call should provide the hints to tell Linux that a file''s cache doesn''t need to be maintained but from looking at the 2.6.16 & 2.6.18 sources I believe this call is not fully implemented. I also think that this is a case where you never want to use the file cache at all. Solving this problem for save/restore is more tricky because the on-disk/wire save format does not force the page data to be page aligned -- my proposal would be to page align each batch of pages written, leaving pad between the batch header and the pages themselves but I realize that a change in on-disk/wire format is a potential compatibility problem. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-06 16:23 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
On Tue, Feb 06, 2007 at 10:43:41AM -0500, Graham, Simon wrote: Index: trunk/tools/libxc/xc_core.c ==================================================================--- trunk/tools/libxc/xc_core.c (revision 9948) +++ trunk/tools/libxc/xc_core.c (working copy) @@ -1,7 +1,13 @@ #include "xg_private.h" #include <stdlib.h> #include <unistd.h> +#include <fcntl.h> +#ifndef O_DIRECT +#define O_DIRECT 040000 +#endif O_DIRECT doesn''t exist on Solaris. The closest equivalent is directio(fd, DIRECTIO_ON). Either way you shouldn''t be defining it yourself? And you should expect this to be able to fail, on Solaris at least. + dump_mem_start = mmap(0, PAGE_SIZE*DUMP_INCREMENT, + PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); Please use the more portable MAP_ANON. regards john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2007-Feb-06 17:25 UTC
RE: [Xen-devel] Reducing impact of save/restore/dump on Dom0
Thanks for the comments:> > +#ifndef O_DIRECT > +#define O_DIRECT 040000 > +#endif > > O_DIRECT doesn''t exist on Solaris. The closest equivalent is > directio(fd, DIRECTIO_ON). Either way you shouldn''t be defining it > yourself? >Well, I would agree with you except that it ends up not being defined when I build on RHEL4/U2! It seems to me that O_DIRECT is something that folks really don''t want me to be able to use! Anyone know of a generic mechanism that works everywhere for enabling this? If not, I guess I''ll have to add #ifdef''s... yuck! What I can do is fix the problem on Linux and make it behave the same old way on everything else (that I can''t build/test for)... other people can then add platform specific implementations as needed...> And you should expect this to be able to fail, on Solaris at least. > > + dump_mem_start = mmap(0, PAGE_SIZE*DUMP_INCREMENT, > + PROT_READ|PROT_WRITE, > + MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); > > Please use the more portable MAP_ANON. >Will do. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Iustin Pop
2007-Feb-06 17:36 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
On Tue, Feb 06, 2007 at 10:43:41AM -0500, Graham, Simon wrote:> Currently, save, restore and dump all used cached I/O in Dom0 to > write/read the file containing the memory image of the DomU - when the > memory assigned to the DomU is greater than free memory in Dom0, this > leads to severe memory thrashing and generally the Dom0 performance goes > into the toilet.On linux, there is another way beside O_DIRECT: it''s possible to reduce the memory used for cached writing using the sysctl vm.dirty_ratio; I''ve used this to reduce the impact of heavy cached writes on dom0 with good results. Maybe it helps for save/dump also... Regards, Iustin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2007-Feb-06 17:46 UTC
RE: [Xen-devel] Reducing impact of save/restore/dump on Dom0
> On linux, there is another way beside O_DIRECT: it''s possible toreduce> the memory used for cached writing using the sysctl vm.dirty_ratio; > I''ve > used this to reduce the impact of heavy cached writes on dom0 withgood> results. > > Maybe it helps for save/dump also...Oh interesting -- I shall look into this. I just took a quick peek and it is set to 40% in Dom0; I do see free memory go to zero during a dump (and save/restore) plus I see the waitIO% go to near 100% and Dom0 becomes somewhat unresponsive; specifically what we see is that domain boot fails during this time because of a XenBus timeout waiting for the hotplug scripts to finish adding the VBD in Dom0. I still feel that dump/save/restore files really don''t belong in the system cache at all since they just pollute the cache for no ggood reason. Thanks, Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Iustin Pop
2007-Feb-06 18:35 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
On Tue, Feb 06, 2007 at 12:46:52PM -0500, Graham, Simon wrote:> > On linux, there is another way beside O_DIRECT: it''s possible to > > reduce > > the memory used for cached writing using the sysctl vm.dirty_ratio; > > I''ve > > used this to reduce the impact of heavy cached writes on dom0 with > > good > > results. > > > > Maybe it helps for save/dump also... > > Oh interesting -- I shall look into this. > > I just took a quick peek and it is set to 40% in Dom0; I do see free > memory go to zero during a dump (and save/restore) plus I see the > waitIO% go to near 100% and Dom0 becomes somewhat unresponsive; > specifically what we see is that domain boot fails during this time > because of a XenBus timeout waiting for the hotplug scripts to finish > adding the VBD in Dom0.Yes, that''s more or less expected. I''ve used 10% (you can''t go below 5% = harcoded limit in the kernel) and then, for a 512MB dom0, only ~25 MB of cache will be used. I would hardly say that 25MB is too much.> I still feel that dump/save/restore files really don''t belong in the > system cache at all since they just pollute the cache for no ggood > reason.Then there is also posix_fadvise, which is more or less what you need to use in case you worry about your cache. I haven''t used it, but I''ve heard from people that using fadvise with POSIX_FADV_DONTNEED+fsync or O_SYNC in batches can reduce your cache usage. Just a few thoughts, as these don''t change the way you do writes, as opposed to O_DIRECT. Regards, Iustin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2007-Feb-06 19:21 UTC
RE: [Xen-devel] Reducing impact of save/restore/dump on Dom0
> Yes, that''s more or less expected. I''ve used 10% (you can''t go below5%> = harcoded limit in the kernel) and then, for a 512MB dom0, only ~25MB> of cache will be used. I would hardly say that 25MB is too much. >Well, I think it means only 25MB of dirty cache is allowed before writes become synchronous - you will still use all of memory for the cache, it will just be cleaned earlier and therefore available for reuse plus the penalty for the write moves to the writer rather than to everyone else... Still, I agree it''s worth experimenting with (and I intend to).> > I still feel that dump/save/restore files really don''t belong in the > > system cache at all since they just pollute the cache for no ggood > > reason. > > Then there is also posix_fadvise, which is more or less what you need > to > use in case you worry about your cache. I haven''t used it, but I''ve > heard from people that using fadvise with POSIX_FADV_DONTNEED+fsync or > O_SYNC in batches can reduce your cache usage. > > Just a few thoughts, as these don''t change the way you do writes, as > opposed to O_DIRECT.I certainly would prefer this too; I hadn''t considered using POSIX_FADV_DONTNEED/fsync in the loop... Thanks for the suggestions, Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-06 23:00 UTC
RE: [Xen-devel] Reducing impact of save/restore/dump on Dom0
> Solving this problem for save/restore is more tricky because the > on-disk/wire save format does not force the page data to be pagealigned> -- my proposal would be to page align each batch of pages written, > leaving pad between the batch header and the pages themselves but I > realize that a change in on-disk/wire format is a potential > compatibility problem.The on-disk format is due to change before 3.0.5 anyhow. Page aligning the data pages is certainly something I''d like to see happen. The easiest way of doing this is to add some padding to align things at the start of the page write-out loop, then in the PV case, make sure that the page-type batches are padded to page size. (I''d reduce the max batch size slightly so that in the normal case the batch fits nicely in 1 page, or 2 for 64b). As for making the IO bypass the buffer cache, I''m not sure what the best way to do this is. There are some occasions where we want the restore image to be in the buffer cache (e.g. as used by the fault injection testing for fast domain restart) but I agree that its not helpful in the normal case. My first inclination would be O_DIRECT, but there may be a better way. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Feb-06 23:20 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
Ian Pratt wrote:> As for making the IO bypass the buffer cache, I''m not sure what the best > way to do this is. There are some occasions where we want the restore > image to be in the buffer cache (e.g. as used by the fault injection > testing for fast domain restart) but I agree that its not helpful in the > normal case. My first inclination would be O_DIRECT, but there may be a > better way.O_DIRECT is strongly deprecated. fadvise(..., FADV_DONTNEED, ...) is the preferred interface. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rusty Russell
2007-Feb-07 06:40 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
On Tue, 2007-02-06 at 15:20 -0800, Jeremy Fitzhardinge wrote:> O_DIRECT is strongly deprecated. fadvise(..., FADV_DONTNEED, ...) is > the preferred interface.Really? I know raw was deprecated in favour of O_DIRECT. But you''d have to call FADV_DONTNEED after every read AFAICT... Rusty. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Feb-07 07:41 UTC
Re: [Xen-devel] Reducing impact of save/restore/dump on Dom0
Rusty Russell wrote:> On Tue, 2007-02-06 at 15:20 -0800, Jeremy Fitzhardinge wrote: > >> O_DIRECT is strongly deprecated. fadvise(..., FADV_DONTNEED, ...) is >> the preferred interface. >> > > Really? I know raw was deprecated in favour of O_DIRECT. But you''d > have to call FADV_DONTNEED after every read AFAICT... >Linus was having a bit of a rant about it last week. O_DIRECT works for straightforward use, but it''s a bit of a crapshoot if you mix it with pagecached operations. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Feb-07 12:11 UTC
[Xen-devel] Re: Reducing impact of save/restore/dump on Dom0
"Graham, Simon" <Simon.Graham@stratus.com> writes:> Currently, save, restore and dump all used cached I/O in Dom0 to > write/read the file containing the memory image of the DomU - when the > memory assigned to the DomU is greater than free memory in Dom0, this > leads to severe memory thrashing and generally the Dom0 performance goes > into the toilet. > > The ''classic'' answer to avoiding this when writing very large files is, > of course, to use non-cached I/O to manipulate the files -Otherwise you can just use madvise()/fadvise() to tell the kernel to drop the old data [the later might need a fairly recent kernel to work] It has the advantage that it doesn''t need much other changes. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-07 12:52 UTC
RE: [Xen-devel] Re: Reducing impact of save/restore/dump on Dom0
> > Currently, save, restore and dump all used cached I/O in Dom0 to > > write/read the file containing the memory image of the DomU - whenthe> > memory assigned to the DomU is greater than free memory in Dom0,this> > leads to severe memory thrashing and generally the Dom0 performancegoes> > into the toilet. > > > > The ''classic'' answer to avoiding this when writing very large filesis,> > of course, to use non-cached I/O to manipulate the files - > > Otherwise you can just use madvise()/fadvise() to tell the kernel > to drop the old data [the later might need a fairly recent kernel > to work] > > It has the advantage that it doesn''t need much other changes.It''s pretty easy for us to arrange that everything is page aligned. If there was a measurable performance advantage using o_direct rather than madvise/fadvise it probably makes sense to use it -- I can''t see o_direct going away any time soon. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2007-Feb-07 12:56 UTC
RE: [Xen-devel] Reducing impact of save/restore/dump on Dom0
> -----Original Message----- > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Ian Pratt wrote: > > As for making the IO bypass the buffer cache, I''m not sure what the > best > > way to do this is. There are some occasions where we want therestore> > image to be in the buffer cache (e.g. as used by the fault injection > > testing for fast domain restart) but I agree that its not helpful in > the > > normal case. My first inclination would be O_DIRECT, but there maybe> a > > better way. > O_DIRECT is strongly deprecated. fadvise(..., FADV_DONTNEED, ...) is > the preferred interface.I''m currently experimenting with using fsync/fadvise - will post results shortly. If this works well, then it''s not essential to change the on-disk format although I think the performance will be better if it is changed. I do find it a little annoying that in Linux the routine is actually called posix_fadvise64 rather than fadvise64 but I can obviously work round that. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Feb-07 12:58 UTC
Re: [Xen-devel] Re: Reducing impact of save/restore/dump on Dom0
> It''s pretty easy for us to arrange that everything is page aligned. If > there was a measurable performance advantage using o_direct rather than > madvise/fadvise it probably makes sense to use it -- I can''t see > o_direct going away any time soon.O_DIRECT won''t do any write behind, so unless you do very aggressive IO (large IO requests, threads or aio so you do useful work during the disk write latency) it will be likely slower. Similar for read -- it won''t do any readahead which you would need to do by yourself. It''s really not a very good idea for most non database applications. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2007-Feb-07 13:06 UTC
RE: [Xen-devel] Re: Reducing impact of save/restore/dump on Dom0
> > It''s pretty easy for us to arrange that everything is page aligned. > If > > there was a measurable performance advantage using o_direct rather > than > > madvise/fadvise it probably makes sense to use it -- I can''t see > > o_direct going away any time soon. > > O_DIRECT won''t do any write behind, so unless you do very aggressiveIO> (large IO requests, threads or aio so you do useful work during the > disk write > latency) it will be likely slower. Similar for read -- it won''t do any > readahead which you would need to do by yourself. > > It''s really not a very good idea for most non database applications.Well, the dump/save/restore does do large IO requests for most of the data. Also, it''s a non-performance path - it''s MUCH more important that other things in Dom0 happen quickly (such as performing I/O for other domains) so I would be quite happy with the save/restore/dump process being a little slow in return for not destroying Dom0 performance (which is what happens today). Having said that, I understand that O_DIRECT is deprecated (by Linus at least) and there is also the problem of it not being available on Solaris; hence I am trying out the fsync/fadvise(DON''T_NEED) in the loop after writing/reading a chunk of data. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel