George Dunlap
2009-Aug-20 10:18 UTC
[Xen-devel] Balloons, crash-dumps, populate-on-demand, and shared zero pages
Paul recently pointed out that a side-effect of having the balloon driver replace guest p2m memory with empty space is that when Windows does a crash dump (perhaps Linux too), when it reaches the pages in the balloon, it will cause a page fault, which can cause cascading crashes and prevent any useful information from reaching the dump file. After thinking about it for a bit, I wondered if it might be better to replace the "populate-on-demand" concept with a "shared-zero-populate-on-demand". Reads to a PoD page would always map to a read-only shared zero page (or superpage, as the case may be). We can change the balloon driver behavior to fill the p2m entries for the balloon with zPoD entries instead of empy p2m entries. As a side-effect, the balloon driver no longer would need to explicitly fill in the p2m entries with ram when deflating the balloon; the tools already tell Xen about memory target increases, so it can increase the PoD "cache"; the balloon driver would simply need to free memory back to the kernel and it the balloon will be populated on-demand by the guest. Any thoughts? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Smith
2009-Aug-20 10:39 UTC
[Xen-devel] Re: Balloons, crash-dumps, populate-on-demand, and shared zero pages
> Paul recently pointed out that a side-effect of having the balloon > driver replace guest p2m memory with empty space is that when Windows > does a crash dump (perhaps Linux too), when it reaches the pages in > the balloon, it will cause a page fault, which can cause cascading > crashes and prevent any useful information from reaching the dump > file.Well, not quite. During a crash dump, the only thing Windows does with the page is write it out. If you''re using PV drivers, that means you create grant references for the ballooned-out PFNs and pass them off to the backend, which tries to map them, fails, and passes an error back to the frontend. If the frontend then passes those errors back to Windows then it''ll retry a couple of times, then give up and crash. It wouldn''t be particularly difficult to avoid this by just masking the error from the frontend, claiming to have written the data even though the backend gave us an error. That''d mean you''d have garbage in the dump file for ballooned-out pages, but those pages probably aren''t very interesting, and the rest of the dump file would be fine. This might be relevant for hibernation files, though, because Windows compresses those before writing them out, and hence has to touch them through a virtual address. At the moment, the Citrix drivers deal with this by just blocking hibernation whenever the balloon driver''s active. Making ballooned out pages implicitly all-zeroes would let us turn that back on, which''d be kind of nice. I''m not sure how valuable that actually is in the real world, though: why would you hibernate a VM when you could just vm-suspend it?> After thinking about it for a bit, I wondered if it might be better to > replace the "populate-on-demand" concept with a > "shared-zero-populate-on-demand". Reads to a PoD page would always > map to a read-only shared zero page (or superpage, as the case may > be). We can change the balloon driver behavior to fill the p2m > entries for the balloon with zPoD entries instead of empy p2m entries. > As a side-effect, the balloon driver no longer would need to > explicitly fill in the p2m entries with ram when deflating the > balloon; the tools already tell Xen about memory target increases, so > it can increase the PoD "cache"; the balloon driver would simply need > to free memory back to the kernel and it the balloon will be populated > on-demand by the guest.That would make things marginally easier on the drivers, but it''s at the expense of potentially more subtle errors when something goes wrong. At the moment, if the balloon driver tries to deflate the balloon too far, the populate hypercall fails and it''s very obvious what''s gone wrong, whereas with an implicit re-populate it''ll look like everything''s working fine for some time afterwards, until the guest touches too many pages and PoD kills it. Steven. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Paul Durrant
2009-Aug-20 10:56 UTC
[Xen-devel] Re: Balloons, crash-dumps, populate-on-demand, and shared zero pages
Steven Smith wrote:> That would make things marginally easier on the drivers, but it''s at > the expense of potentially more subtle errors when something goes > wrong. At the moment, if the balloon driver tries to deflate the > balloon too far, the populate hypercall fails and it''s very obvious > what''s gone wrong, whereas with an implicit re-populate it''ll look > like everything''s working fine for some time afterwards, until the > guest touches too many pages and PoD kills it. >If the balloon driver deflated too far, that would be a bug in the balloon driver, and if Windows doesn''t scrub the memory when it''s freed we could do that ourselves so at least PoD would kill the guest at the right juncture. Paul -- ==============================Paul Durrant, Software Engineer Citrix Systems (R&D) Ltd. First Floor, Building 101 Cambridge Science Park Milton Road Cambridge CB4 0FY United Kingdom ============================== _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Aug-20 12:49 UTC
Re: [Xen-devel] Re: Balloons, crash-dumps, populate-on-demand, and shared zero pages
On Thu, Aug 20, 2009 at 11:56 AM, Paul Durrant<paul.durrant@citrix.com> wrote:> If the balloon driver deflated too far, that would be a bug in the balloon > driver, and if Windows doesn''t scrub the memory when it''s freed we could do > that ourselves so at least PoD would kill the guest at the right juncture.But is it easier to scrub memory manually before freeing, or just make a hypercall telling Xen to put zeroed pages there? I think Steven''s right... we may be introducing subtle latent bugs; overall it doesn''t sound like the benefit is worth the extra complexity. Steven, if we can make the PV drivers do something sensible wrt failed writes during a crash, that might be the best solution. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel