George, in the hope that you might have some insight, or might be remembering that something like this was reported before (and ideally fixed), I''ll try to describe a problem a customer of ours reported. Unfortunately this is with Xen 4.0.x (plus numerous backports), and it is not known whether the same issue exists on 4.1.x or -unstable. For a domain with maxmem=16000M and memory=3200M, what gets logged is (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 (XEN) domain_crash called from p2m.c:1150 (XEN) Domain 3 reported crashed by domain 0 on cpu#6: (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 (XEN) domain_crash called from p2m.c:1150 Translated to hex, the numbers are 1e0 and 36000. The latter one varies across the (rather infrequent) cases where this happens (but was always a multiple of 1000 - see below), and instant retries to create the affected domain did always succeed so far (i.e. the failure is definitely not because of a lack of free memory). Given that the memory= target wasn''t reached, yet, I would conclude that this happens in the middle of (4.0.x file name used here) tools/libxc/xc_hvm_build.c:setup_guest()''s main physmap population code. However, the way I read the code there, I would think that the sequence of population should be (using hex GFNs) 0...9f, c0...7ff, 800-fff, 1000-17ff, etc. That, however appears to be inconsistent with the logged numbers above - tot_pages should always be at least 7e0 (low 2Mb less the VGA hole), especially when pod_entries is divisible by 800 (the increment by which large page population happens). As a result of this apparent inconsistency I can''t really conclude anything from the logged numbers. The main question, irrespective of any numbers, of course is: How would p2m_pod_demand_populate() be invoked at all during this early phase of domain construction? Nothing should be touching any of the memory... If this nevertheless is possible (even if just for a single page), then perhaps the tools ought to make sure the pages put into the low 2Mb get actually zeroed, so the PoD code has a chance to find victim pages. Thanks for any thoughts or pointers, Jan
Jan Beulich
2012-Jul-26 15:37 UTC
Re: PoD code killing domain before it really gets started
>>> On 26.07.12 at 16:41, "Jan Beulich" <JBeulich@suse.com> wrote: > The main question, irrespective of any numbers, of course is: > How would p2m_pod_demand_populate() be invoked at all > during this early phase of domain construction? Nothing > should be touching any of the memory... If this nevertheless > is possible (even if just for a single page), then perhaps the > tools ought to make sure the pages put into the low 2Mb get > actually zeroed, so the PoD code has a chance to find victim > pages.One more point of inconsistency: According to xend.log, the device model got already launched at the point of the guest death, yet the two operations (call trees ending in pyxc_hvm_build() and ImageHandler.createDomain(), the log message of which is present in the logs) are both rooted in XendDomainInfo._initDomain(), and are hence sequential aiui (i.e. the physmap population should already have finished). Of course that''s unless xend happily continues acting on a crashed guest (which would explain why there are two instances of the PoD related messages in the hypervisor log). So I''m getting all the more confused the deeper I look into this. Jan
George Dunlap
2012-Jul-26 16:14 UTC
Re: PoD code killing domain before it really gets started
On 26/07/12 15:41, Jan Beulich wrote:> George, > > in the hope that you might have some insight, or might be > remembering that something like this was reported before (and > ideally fixed), I''ll try to describe a problem a customer of ours > reported. Unfortunately this is with Xen 4.0.x (plus numerous > backports), and it is not known whether the same issue exists > on 4.1.x or -unstable. > > For a domain with maxmem=16000M and memory=3200M, what > gets logged is > > (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 > (XEN) domain_crash called from p2m.c:1150 > (XEN) Domain 3 reported crashed by domain 0 on cpu#6: > (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 > (XEN) domain_crash called from p2m.c:1150 > > Translated to hex, the numbers are 1e0 and 36000. The latter > one varies across the (rather infrequent) cases where this > happens (but was always a multiple of 1000 - see below), and > instant retries to create the affected domain did always succeed > so far (i.e. the failure is definitely not because of a lack of free > memory). > > Given that the memory= target wasn''t reached, yet, I would > conclude that this happens in the middle of (4.0.x file name used > here) tools/libxc/xc_hvm_build.c:setup_guest()''s main physmap > population code. However, the way I read the code there, I > would think that the sequence of population should be (using > hex GFNs) 0...9f, c0...7ff, 800-fff, 1000-17ff, etc. That, > however appears to be inconsistent with the logged numbers > above - tot_pages should always be at least 7e0 (low 2Mb less > the VGA hole), especially when pod_entries is divisible by 800 > (the increment by which large page population happens). > > As a result of this apparent inconsistency I can''t really > conclude anything from the logged numbers. > > The main question, irrespective of any numbers, of course is: > How would p2m_pod_demand_populate() be invoked at all > during this early phase of domain construction? Nothing > should be touching any of the memory... If this nevertheless > is possible (even if just for a single page), then perhaps the > tools ought to make sure the pages put into the low 2Mb get > actually zeroed, so the PoD code has a chance to find victim > pages.Yes, this is a very strange circumstance: because p2m_demand_populate() shouldn''t happen until at least one PoD entry has been created; and that shouldn''t happen until after c0...7ff have been populated with 4k pages. Although, it does look as though when populating 4k pages, the code doesn''t actually look to see if the allocation succeeded or not... oh wait, no, it actually checks rc as a condition of the while() loop -- but that is then clobbered by the xc_domain_set_pod_target() call. But surely if the 4k allocation failed, the set_target() call should fail as well? And in any case, there shouldn''t yet be any PoD entries to cause a demand-populate. We probably should change "if(pod_mode)" to "if(rc == 0 && pod_mode)" or something like that, just to be sure. I''ll spin up a patch. I think what I would try to do is to add a stack trace to the demand_populate() failure path, so you can see where the call came from; i.e., if it came from a guest access, or from someone in dom0 writing to some of the memory. I''d also add a printk to set_pod_target(), so you can see if it was actually called and what it was set to. -George
Jan Beulich
2012-Jul-27 06:45 UTC
Re: PoD code killing domain before it really gets started
>>> On 26.07.12 at 18:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > I think what I would try to do is to add a stack trace to the > demand_populate() failure path, so you can see where the call came from; > i.e., if it came from a guest access, or from someone in dom0 writing to > some of the memory. I''d also add a printk to set_pod_target(), so you > can see if it was actually called and what it was set to.Yes, that sounds very similar to the plans I had in case you don''t recall any respective fixes or have any other immediate idea about the problem. Albeit guest accesses can be excluded here afaict, given that the guest didn''t even get started yet. But indeed, with nothing really fitting together, we really need to make sure... Meanwhile I also got confirmed that this can happen strait after boot, which makes it all the more questionable - almost all pages should hold zeros only, and hence the PoD code should be able to find victim pages among the populated ones, yet it obviously doesn''t. Jan
Jan Beulich
2012-Aug-06 13:57 UTC
Re: PoD code killing domain before it really gets started
>>> On 26.07.12 at 18:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Yes, this is a very strange circumstance: because p2m_demand_populate() > shouldn''t happen until at least one PoD entry has been created; and that > shouldn''t happen until after c0...7ff have been populated with 4k pages.So meanwhile I was told that this very likely is caused by an access originating in Dom0. Just a few minutes ago I also got hold of call stacks (as was already seen with the original messages, it produces two instances per bad access): ... (XEN) gpmpod(1, 8000, 9) -> 0 [dom0] (XEN) gpmpod(1, 8200, 9) -> 0 [dom0] [coming from printk("gpmpod(%d, %lx, %u) -> %d [dom%d]\n", d->domain_id, gfn, order, rc, current->domain->domain_id); at the end of guest_physmap_mark_populate_on_demand()] (XEN) p2m_pod_demand_populate: Dom1 out of PoD memory! (tot=1e0 ents=8200 dom0) [altered message at the failure point in p2m_pod_demand_populate(): - printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " pod_entries %" PRIi32 "\n", - __func__, d->tot_pages, p2md->pod.entry_count); + printk("%s: Dom%d out of PoD memory! (tot=%"PRIx32" ents=%"PRIx32" dom%d)\n", + __func__, d->domain_id, d->tot_pages, p2md->pod.entry_count, current->domain->domain_id); +WARN_ON(1); ] (XEN) Xen WARN at p2m.c:1155 (XEN) ----[ Xen-4.0.3_21548_04a-0.9.1 x86_64 debug=n Tainted: C ]---- (XEN) CPU: 2 (XEN) RIP: e008:[<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0 ... (XEN) Xen call trace: (XEN) [<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0 (XEN) [<ffff82c4801676b1>] get_page_and_type_from_pagenr+0x91/0x100 (XEN) [<ffff82c4801f02d4>] ept_pod_check_and_populate+0x104/0x1a0 (XEN) [<ffff82c4801f0482>] ept_get_entry+0x112/0x230 (XEN) [<ffff82c48016be98>] do_mmu_update+0x16d8/0x1930 (XEN) [<ffff82c4801f8c51>] do_iret+0xc1/0x1a0 (XEN) [<ffff82c4801f4189>] syscall_enter+0xa9/0xae (XEN) (XEN) domain_crash called from p2m.c:1156 (XEN) Domain 1 reported crashed by domain 0 on cpu#2: (XEN) p2m_pod_demand_populate: Dom1 out of PoD memory! (tot=1e0 ents=8200 dom0) (XEN) Xen WARN at p2m.c:1155 (XEN) ----[ Xen-4.0.3_21548_04a-0.9.1 x86_64 debug=n Tainted: C ]---- (XEN) CPU: 2 (XEN) RIP: e008:[<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0 ... (XEN) Xen call trace: (XEN) [<ffff82c4801cbf86>] p2m_pod_demand_populate+0x836/0xab0 (XEN) [<ffff82c480108733>] send_guest_global_virq+0x93/0xe0 (XEN) [<ffff82c4801cbfb2>] p2m_pod_demand_populate+0x862/0xab0 (XEN) [<ffff82c4801f02d4>] ept_pod_check_and_populate+0x104/0x1a0 (XEN) [<ffff82c4801f0482>] ept_get_entry+0x112/0x230 (XEN) [<ffff82c48016890b>] mod_l1_entry+0x47b/0x650 (XEN) [<ffff82c4801f0482>] ept_get_entry+0x112/0x230 (XEN) [<ffff82c48016b21a>] do_mmu_update+0xa5a/0x1930 (XEN) [<ffff82c4801f8c51>] do_iret+0xc1/0x1a0 (XEN) [<ffff82c4801f4189>] syscall_enter+0xa9/0xae (XEN) (XEN) domain_crash called from p2m.c:1156 This clarifies at least why there are two events (and, despite the code having changed quite a bit, appears to still be the case for -unstable): case MMU_NORMAL_PT_UPDATE, sub-case PGT_l1_page_table call (in -unstable terms) get_page_from_gfn() but ignores the return value (which ought to be NULL here), and only partially inspects the returned type. As the type matches none of the ones looked for, it happily proceeds into mod_l1_entry(), which then calls get_page_from_gfn() again.> Although, it does look as though when populating 4k pages, the code > doesn''t actually look to see if the allocation succeeded or not... oh > wait, no, it actually checks rc as a condition of the while() loop -- > but that is then clobbered by the xc_domain_set_pod_target() call. But > surely if the 4k allocation failed, the set_target() call should fail as > well? And in any case, there shouldn''t yet be any PoD entries to cause > a demand-populate. > > We probably should change "if(pod_mode)" to "if(rc == 0 && pod_mode)" or > something like that, just to be sure. I''ll spin up a patch.I had also included this adjustment in the debugging patch, but this clearly isn''t related to the problem. The domain indeed has 0x1e0 pages allocated, and a huge (still growing number) of PoD entries. And apparently this fails so rarely because it''s pretty unlikely that there''s not a single clear page that the PoD code can select as victim, plus the Dom0 space code likely also only infrequently happens to kick in at the wrong time. So in the end it presumably boils down to decide whether such an out-of-band Dom0 access is valid to be done (and I think it is). If it is, then xc_hvm_build_x86.c:setup_guest() should make sure any actually allocated pages (those coming from the calls to xc_domain_populate_physmap_exact()) get cleared when pod_mode is set. Otoh, as pointed out in a yet unanswered mail (see http://lists.xen.org/archives/html/xen-devel/2012-07/msg01331.html), these allocations could/should - when pod_mode is set - similarly be done with XENMEMF_populate_on_demand set. In such a case, _any_ Dom0 access to guest memory prior to the call to xc_domain_set_pod_target() would kill the domain, as there is not even a single page to be looked at as a possible victim. Consequently, I would think that the guest shouldn''t be killed unconditionally when a PoD operation didn''t succeed - in particular not when the access was from a foreign (i.e. the management) domain. Jan
Jan Beulich
2012-Aug-06 14:12 UTC
Re: PoD code killing domain before it really gets started
>>> On 06.08.12 at 15:57, "Jan Beulich" <JBeulich@suse.com> wrote: > The domain indeed has 0x1e0 pages allocated, and a huge (still > growing number) of PoD entries. And apparently this fails so > rarely because it''s pretty unlikely that there''s not a single clear > page that the PoD code can select as victim, plus the Dom0 > space code likely also only infrequently happens to kick in at > the wrong time.Just realized that of course it''s also suspicious that there shouldn''t be any clear page among those 480 - Dom0 scrubs its pages at balloons them out (but I think ballooning isn''t even in use there), Xen scrubs the free pages on boot, yet this reportedly has happened also for the very first domain ever created after boot. Or does the PoD code not touch the low 2Mb for some reason? Jan
George Dunlap
2012-Aug-06 16:03 UTC
Re: PoD code killing domain before it really gets started
On Mon, Aug 6, 2012 at 3:12 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 06.08.12 at 15:57, "Jan Beulich" <JBeulich@suse.com> wrote: >> The domain indeed has 0x1e0 pages allocated, and a huge (still >> growing number) of PoD entries. And apparently this fails so >> rarely because it''s pretty unlikely that there''s not a single clear >> page that the PoD code can select as victim, plus the Dom0 >> space code likely also only infrequently happens to kick in at >> the wrong time. > > Just realized that of course it''s also suspicious that there > shouldn''t be any clear page among those 480 - Dom0 scrubs > its pages at balloons them out (but I think ballooning isn''t even > in use there), Xen scrubs the free pages on boot, yet this > reportedly has happened also for the very first domain ever > created after boot. Or does the PoD code not touch the low > 2Mb for some reason?Hmm -- the sweep code has some fairly complicated heuristics. Ah -- I bet this is it: The algorithm implicitly assumes that he first sweep will happen after the first demand-fault. It''s designed to start at the last demand-faulted gpfn (tracked by p2m->pod.max_guest) and go downwards. When it reaches 0, it stops its sweep (?!), and resets to max_guest on the next entry. But if max_guest is 0, this means it will basically never sweep at all. I guess there are two problems with that: * As you''ve seen, apparently dom0 may access these pages before any faults happen. * If it happens that reclaim_single is below the only zeroed page, the guest will crash even when there is reclaim-able memory available. Two ways we could fix this: 1. Remove dom0 accesses (what on earth could be looking at a not-yet-created VM?) 2. Allocate the PoD cache before populating the p2m table 3. Make it so that some accesses fail w/o crashing the guest? I don''t see how that''s really practical. 4. Change the sweep routine so that the lower 2MiB gets swept #2 would require us to use all PoD entries when building the p2m table, thus addressing the mail you mentioned from 25 July*. Given that you don''t want #1, it seems like #2 is the best option. No matter what we do, the sweep routine for 4.2 should be re-written to search all of memory at least once (maybe with a timeout for watchdogs), since it''s only called in an actual emergency. Let me take a look... -George * Sorry for not responding to that one; I must have missed it in my return-from-travelling e-mail sweep. If you CC me next time I''ll be sure to get it.
Jan Beulich
2012-Aug-07 07:34 UTC
Re: PoD code killing domain before it really gets started
>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > I guess there are two problems with that: > * As you''ve seen, apparently dom0 may access these pages before any > faults happen. > * If it happens that reclaim_single is below the only zeroed page, the > guest will crash even when there is reclaim-able memory available. > > Two ways we could fix this: > 1. Remove dom0 accesses (what on earth could be looking at a > not-yet-created VM?)I''m told it''s a monitoring daemon, and yes, they are intending to adjust it to first query the GFN''s type (and don''t do the access when it''s not populated, yet). But wait, I didn''t check the code when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) also call get_page_from_gfn() with P2M_ALLOC, so would also trigger the PoD code (in -unstable at least) - Tim, was that really a correct adjustment in 25355:974ad81bb68b? It looks to be a 1:1 translation, but is that really necessary? If one wanted to find out whether a page is PoD to avoid getting it populated, how would that be done from outside the hypervisor? Would we need XEN_DOMCTL_getpageframeinfo4 for this?> 2. Allocate the PoD cache before populating the p2m table > 3. Make it so that some accesses fail w/o crashing the guest? I don''t > see how that''s really practical.What''s wrong with telling control tools that a certain page is unpopulated (from which they will be able to imply that''s it''s all clear from the guest''s pov)? Even outside of the current problem, I would think that''s more efficient than allocating the page. Of course, the control tools need to be able to cope with that. And it may also be necessary to distinguish between read and read/write mappings being established (and for r/w ones the option of populating at access time rather than at creation time would need to be explored).> 4. Change the sweep routine so that the lower 2MiB gets swept > > #2 would require us to use all PoD entries when building the p2m > table, thus addressing the mail you mentioned from 25 July*. Given > that you don''t want #1, it seems like #2 is the best option. > > No matter what we do, the sweep routine for 4.2 should be re-written > to search all of memory at least once (maybe with a timeout for > watchdogs), since it''s only called in an actual emergency. > > Let me take a look...Thanks! Jan
Tim Deegan
2012-Aug-07 10:00 UTC
Re: PoD code killing domain before it really gets started
Hi, At 08:34 +0100 on 07 Aug (1344328495), Jan Beulich wrote:> >>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > I guess there are two problems with that: > > * As you''ve seen, apparently dom0 may access these pages before any > > faults happen. > > * If it happens that reclaim_single is below the only zeroed page, the > > guest will crash even when there is reclaim-able memory available. > > > > Two ways we could fix this: > > 1. Remove dom0 accesses (what on earth could be looking at a > > not-yet-created VM?) > > I''m told it''s a monitoring daemon, and yes, they are intending to > adjust it to first query the GFN''s type (and don''t do the access > when it''s not populated, yet). But wait, I didn''t check the code > when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) > also call get_page_from_gfn() with P2M_ALLOC, so would also > trigger the PoD code (in -unstable at least) - Tim, was that really > a correct adjustment in 25355:974ad81bb68b? It looks to be a > 1:1 translation, but is that really necessary?AFAICT 25355:974ad81bb68b doesn''t change anything. Back in 4.1-testing the lookup was done with gmfn_to_mfn(), which boils down to a lookup with p2m_alloc.> If one wanted to find out whether a page is PoD to avoid getting it > populated, how would that be done from outside the hypervisor? Would > we need XEN_DOMCTL_getpageframeinfo4 for this?We''d certainly need _some_ change to the hypercall interface, as there''s no XEN_DOMCTL_PFINFO_ rune for ''PoD'', and presumably you''d want to know the difference between PoD and not-present.> > 2. Allocate the PoD cache before populating the p2m tableAny reason not to do this? Tim.
George Dunlap
2012-Aug-07 10:20 UTC
Re: PoD code killing domain before it really gets started
On Tue, Aug 7, 2012 at 8:34 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> I guess there are two problems with that: >> * As you''ve seen, apparently dom0 may access these pages before any >> faults happen. >> * If it happens that reclaim_single is below the only zeroed page, the >> guest will crash even when there is reclaim-able memory available. >> >> Two ways we could fix this: >> 1. Remove dom0 accesses (what on earth could be looking at a >> not-yet-created VM?) > > I''m told it''s a monitoring daemon, and yes, they are intending to > adjust it to first query the GFN''s type (and don''t do the access > when it''s not populated, yet). But wait, I didn''t check the code > when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) > also call get_page_from_gfn() with P2M_ALLOC, so would also > trigger the PoD code (in -unstable at least) - Tim, was that really > a correct adjustment in 25355:974ad81bb68b? It looks to be a > 1:1 translation, but is that really necessary? If one wanted to > find out whether a page is PoD to avoid getting it populated, > how would that be done from outside the hypervisor? Would > we need XEN_DOMCTL_getpageframeinfo4 for this? > >> 2. Allocate the PoD cache before populating the p2m table >> 3. Make it so that some accesses fail w/o crashing the guest? I don''t >> see how that''s really practical. > > What''s wrong with telling control tools that a certain page is > unpopulated (from which they will be able to imply that''s it''s all > clear from the guest''s pov)?Because in the general case it''s wrong. The only time crashing the guest is *not* the right thing to do is in the case we have at hand, where PoD pages are accessed before the PoD memory is allocated. Probably the quickest fix would be if there was a simple way for the monitoring daemon to filter out domains that aren''t completely built yet -- maybe by looking at something in xenstore? But the current state of things, does seem unnecessarily fragile; I think if it can be done, allocating PoD memory before writing PoD entries is probably a good thing to do anyway. -George
George Dunlap
2012-Aug-07 10:32 UTC
Re: PoD code killing domain before it really gets started
On Tue, Aug 7, 2012 at 11:00 AM, Tim Deegan <tim@xen.org> wrote:>> > 2. Allocate the PoD cache before populating the p2m table > > Any reason not to do this?The only reason I can think of is that it make it more fiddly if we *didn''t* want some entries to be PoD. For example, if it is the case that the first 2MiB shouldn''t be PoD, we''d have to subtract 2MiB from the target when doing the allocation. Doing it afterwards means not having to worry about it. I can''t think of any reason why the first 2MiB couldn''t be PoD, however; so I think this is probably the best route to take. -George
Jan Beulich
2012-Aug-07 11:03 UTC
Re: PoD code killing domain before it really gets started
>>> On 07.08.12 at 12:00, Tim Deegan <tim@xen.org> wrote: >> > 2. Allocate the PoD cache before populating the p2m table > > Any reason not to do this?Don''t know. I was (maybe wrongly) implying that the order things got done in was such for a reason. That''s why I had added tools folks to Cc. In any case I have asked the reporter to test a respective change. Jan
Jan Beulich
2012-Aug-07 11:05 UTC
Re: PoD code killing domain before it really gets started
>>> On 07.08.12 at 12:20, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > Probably the quickest fix would be if there was a simple way for the > monitoring daemon to filter out domains that aren''t completely built > yet -- maybe by looking at something in xenstore?Given that the getpageframeinfo thing that I thought could be used for this doesn''t work, I''d be curious as to what (xenstore or other) things to look at we could suggest to them. I personally have no idea... Jan
Jan Beulich
2012-Aug-07 12:17 UTC
Re: PoD code killing domain before it really gets started
>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > 2. Allocate the PoD cache before populating the p2m tableSo this doesn''t work, the call simply has no effect (and never reaches p2m_pod_set_cache_target()). Apparently because of /* P == B: Nothing to do. */ if ( p2md->pod.entry_count == 0 ) goto out; in p2m_pod_set_mem_target(). Now I''m not sure about the proper adjustment here: Entirely dropping the conditional is certainly wrong. Would if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) goto out; be okay? But then later in that function we also have /* B < T'': Set the cache size equal to # of outstanding entries, * let the balloon driver fill in the rest. */ if ( pod_target > p2md->pod.entry_count ) pod_target = p2md->pod.entry_count; which in the case at hand would set pod_target to 0, and the whole operation would again not have any effect afaict. So maybe this was the reason to do this operation _after_ the normal address space population? Jan
George Dunlap
2012-Aug-07 13:13 UTC
Re: PoD code killing domain before it really gets started
On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> 2. Allocate the PoD cache before populating the p2m table > > So this doesn''t work, the call simply has no effect (and never > reaches p2m_pod_set_cache_target()). Apparently because > of > > /* P == B: Nothing to do. */ > if ( p2md->pod.entry_count == 0 ) > goto out; > > in p2m_pod_set_mem_target(). Now I''m not sure about the > proper adjustment here: Entirely dropping the conditional is > certainly wrong. Would > > if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) > goto out; > > be okay? > > But then later in that function we also have > > /* B < T'': Set the cache size equal to # of outstanding entries, > * let the balloon driver fill in the rest. */ > if ( pod_target > p2md->pod.entry_count ) > pod_target = p2md->pod.entry_count; > > which in the case at hand would set pod_target to 0, and the > whole operation would again not have any effect afaict. So > maybe this was the reason to do this operation _after_ the > normal address space population?Snap -- forgot about that. The main thing is for set_mem_target() to be simple for the toolstack -- it''s just supposed to say how much memory it wants the guest to use, and Xen is supposed to figure out how much memory the PoD cache needs. The interface is that the toolstack is just supposed to call set_mem_target() after each time it changes the balloon target. The idea was to be robust against the user setting arbitrary new targets before the balloon driver had reached the old target. So the problem with allowing (pod_target > entry_count) is that that''s the condition that happens when you are ballooning up. Maybe the best thing to do is to introduce a specific call to initialize the PoD cache that would ignore entry_count? -George
Jan Beulich
2012-Aug-07 13:29 UTC
Re: PoD code killing domain before it really gets started
>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>> 2. Allocate the PoD cache before populating the p2m table >> >> So this doesn''t work, the call simply has no effect (and never >> reaches p2m_pod_set_cache_target()). Apparently because >> of >> >> /* P == B: Nothing to do. */ >> if ( p2md->pod.entry_count == 0 ) >> goto out; >> >> in p2m_pod_set_mem_target(). Now I''m not sure about the >> proper adjustment here: Entirely dropping the conditional is >> certainly wrong. Would >> >> if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) >> goto out; >> >> be okay? >> >> But then later in that function we also have >> >> /* B < T'': Set the cache size equal to # of outstanding entries, >> * let the balloon driver fill in the rest. */ >> if ( pod_target > p2md->pod.entry_count ) >> pod_target = p2md->pod.entry_count; >> >> which in the case at hand would set pod_target to 0, and the >> whole operation would again not have any effect afaict. So >> maybe this was the reason to do this operation _after_ the >> normal address space population? > > Snap -- forgot about that. > > The main thing is for set_mem_target() to be simple for the toolstack > -- it''s just supposed to say how much memory it wants the guest to > use, and Xen is supposed to figure out how much memory the PoD cache > needs. The interface is that the toolstack is just supposed to call > set_mem_target() after each time it changes the balloon target. The > idea was to be robust against the user setting arbitrary new targets > before the balloon driver had reached the old target. So the problem > with allowing (pod_target > entry_count) is that that''s the condition > that happens when you are ballooning up. > > Maybe the best thing to do is to introduce a specific call to > initialize the PoD cache that would ignore entry_count?Hmm, would looks more like a hack to me. How about doing the initial check as suggested earlier if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) goto out; and the latter check in a similar way if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 ) pod_target = p2md->pod.entry_count; (which would still take care of any ballooning activity)? Or are there any other traps to fall into? Jan
Andres Lagar-Cavilla
2012-Aug-07 14:40 UTC
Re: PoD code killing domain before it really gets started
>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> >>>> wrote: >> I guess there are two problems with that: >> * As you''ve seen, apparently dom0 may access these pages before any >> faults happen. >> * If it happens that reclaim_single is below the only zeroed page, the >> guest will crash even when there is reclaim-able memory available. >> >> Two ways we could fix this: >> 1. Remove dom0 accesses (what on earth could be looking at a >> not-yet-created VM?) > > I''m told it''s a monitoring daemon, and yes, they are intending to > adjust it to first query the GFN''s type (and don''t do the access > when it''s not populated, yet). But wait, I didn''t check the code > when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) > also call get_page_from_gfn() with P2M_ALLOC, so would also > trigger the PoD code (in -unstable at least) - Tim, was that really > a correct adjustment in 25355:974ad81bb68b? It looks to be a > 1:1 translation, but is that really necessary? If one wanted to > find out whether a page is PoD to avoid getting it populated, > how would that be done from outside the hypervisor? Would > we need XEN_DOMCTL_getpageframeinfo4 for this? > >> 2. Allocate the PoD cache before populating the p2m table >> 3. Make it so that some accesses fail w/o crashing the guest? I don''t >> see how that''s really practical. > > What''s wrong with telling control tools that a certain page is > unpopulated (from which they will be able to imply that''s it''s all > clear from the guest''s pov)? Even outside of the current problem, > I would think that''s more efficient than allocating the page. Of > course, the control tools need to be able to cope with that. And > it may also be necessary to distinguish between read and > read/write mappings being established (and for r/w ones the > option of populating at access time rather than at creation time > would need to be explored).I wouldn''t be opposed to some form of getpageframeinfo4. It''s not just PoD we are talking about here. Is the page paged out? Is the page shared? Right now we have global per-domain queries (domaininfo). Or individual gfn debug memctl''s. A batched interface with richer information would be a blessing for debugging or diagnosis purposes. The first order of business is exposing the type. Do we really want to expose the whole range of p2m_* types or just "really useful" ones like is_shared, is_pod, is_paged, is_normal? An argument for the former is that the mem event interface already pumps the p2m_* type up the stack. The other useful bit of information I can think of is exposing the shared ref count. My two cents Andres> >> 4. Change the sweep routine so that the lower 2MiB gets swept >> >> #2 would require us to use all PoD entries when building the p2m >> table, thus addressing the mail you mentioned from 25 July*. Given >> that you don''t want #1, it seems like #2 is the best option. >> >> No matter what we do, the sweep routine for 4.2 should be re-written >> to search all of memory at least once (maybe with a timeout for >> watchdogs), since it''s only called in an actual emergency. >> >> Let me take a look... > > Thanks! > > Jan
George Dunlap
2012-Aug-07 15:04 UTC
Re: PoD code killing domain before it really gets started
On 07/08/12 15:40, Andres Lagar-Cavilla wrote:>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> >>>>> wrote: >>> I guess there are two problems with that: >>> * As you''ve seen, apparently dom0 may access these pages before any >>> faults happen. >>> * If it happens that reclaim_single is below the only zeroed page, the >>> guest will crash even when there is reclaim-able memory available. >>> >>> Two ways we could fix this: >>> 1. Remove dom0 accesses (what on earth could be looking at a >>> not-yet-created VM?) >> I''m told it''s a monitoring daemon, and yes, they are intending to >> adjust it to first query the GFN''s type (and don''t do the access >> when it''s not populated, yet). But wait, I didn''t check the code >> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) >> also call get_page_from_gfn() with P2M_ALLOC, so would also >> trigger the PoD code (in -unstable at least) - Tim, was that really >> a correct adjustment in 25355:974ad81bb68b? It looks to be a >> 1:1 translation, but is that really necessary? If one wanted to >> find out whether a page is PoD to avoid getting it populated, >> how would that be done from outside the hypervisor? Would >> we need XEN_DOMCTL_getpageframeinfo4 for this? >> >>> 2. Allocate the PoD cache before populating the p2m table >>> 3. Make it so that some accesses fail w/o crashing the guest? I don''t >>> see how that''s really practical. >> What''s wrong with telling control tools that a certain page is >> unpopulated (from which they will be able to imply that''s it''s all >> clear from the guest''s pov)? Even outside of the current problem, >> I would think that''s more efficient than allocating the page. Of >> course, the control tools need to be able to cope with that. And >> it may also be necessary to distinguish between read and >> read/write mappings being established (and for r/w ones the >> option of populating at access time rather than at creation time >> would need to be explored). > I wouldn''t be opposed to some form of getpageframeinfo4. It''s not just PoD > we are talking about here. Is the page paged out? Is the page shared? > > Right now we have global per-domain queries (domaininfo). Or individual > gfn debug memctl''s. A batched interface with richer information would be a > blessing for debugging or diagnosis purposes. > > The first order of business is exposing the type. Do we really want to > expose the whole range of p2m_* types or just "really useful" ones like > is_shared, is_pod, is_paged, is_normal? An argument for the former is that > the mem event interface already pumps the p2m_* type up the stack. > > The other useful bit of information I can think of is exposing the shared > ref count.I think just like the gfn_to_mfn() interface, we need a "I care about the details" interface and an "I don''t care about the details" interface. If a page isn''t present, or needs to be un-shared, or is PoD and not currently available, then maybe dom0 callers trying to map that page should get something like -EAGAIN? Just something that indicates, "This page isn''t here at the moment, but may be here soon." What do you think? -George
George Dunlap
2012-Aug-07 15:08 UTC
Re: PoD code killing domain before it really gets started
On 07/08/12 14:29, Jan Beulich wrote:>>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>>> 2. Allocate the PoD cache before populating the p2m table >>> So this doesn''t work, the call simply has no effect (and never >>> reaches p2m_pod_set_cache_target()). Apparently because >>> of >>> >>> /* P == B: Nothing to do. */ >>> if ( p2md->pod.entry_count == 0 ) >>> goto out; >>> >>> in p2m_pod_set_mem_target(). Now I''m not sure about the >>> proper adjustment here: Entirely dropping the conditional is >>> certainly wrong. Would >>> >>> if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) >>> goto out; >>> >>> be okay? >>> >>> But then later in that function we also have >>> >>> /* B < T'': Set the cache size equal to # of outstanding entries, >>> * let the balloon driver fill in the rest. */ >>> if ( pod_target > p2md->pod.entry_count ) >>> pod_target = p2md->pod.entry_count; >>> >>> which in the case at hand would set pod_target to 0, and the >>> whole operation would again not have any effect afaict. So >>> maybe this was the reason to do this operation _after_ the >>> normal address space population? >> Snap -- forgot about that. >> >> The main thing is for set_mem_target() to be simple for the toolstack >> -- it''s just supposed to say how much memory it wants the guest to >> use, and Xen is supposed to figure out how much memory the PoD cache >> needs. The interface is that the toolstack is just supposed to call >> set_mem_target() after each time it changes the balloon target. The >> idea was to be robust against the user setting arbitrary new targets >> before the balloon driver had reached the old target. So the problem >> with allowing (pod_target > entry_count) is that that''s the condition >> that happens when you are ballooning up. >> >> Maybe the best thing to do is to introduce a specific call to >> initialize the PoD cache that would ignore entry_count? > Hmm, would looks more like a hack to me. > > How about doing the initial check as suggested earlier > > if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) > goto out; > > and the latter check in a similar way > > if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 ) > pod_target = p2md->pod.entry_count; > > (which would still take care of any ballooning activity)? Or are > there any other traps to fall into?The "d->tot_pages > 0" seems more like a hack to me. :-) What''s hackish about having an interface like this? * allocate_pod_mem() * for() { populate_pod_mem() } * [Boot VM] * set_pod_target() Right now set_pod_mem() is used both for initial allocation and for adjustments. But it seems like there''s good reason to make a distinction. -George
Jan Beulich
2012-Aug-07 15:36 UTC
Re: PoD code killing domain before it really gets started
>>> On 07.08.12 at 17:08, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 07/08/12 14:29, Jan Beulich wrote: >>>>> On 07.08.12 at 15:13, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>> On Tue, Aug 7, 2012 at 1:17 PM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>>>> 2. Allocate the PoD cache before populating the p2m table >>>> So this doesn''t work, the call simply has no effect (and never >>>> reaches p2m_pod_set_cache_target()). Apparently because >>>> of >>>> >>>> /* P == B: Nothing to do. */ >>>> if ( p2md->pod.entry_count == 0 ) >>>> goto out; >>>> >>>> in p2m_pod_set_mem_target(). Now I''m not sure about the >>>> proper adjustment here: Entirely dropping the conditional is >>>> certainly wrong. Would >>>> >>>> if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) >>>> goto out; >>>> >>>> be okay? >>>> >>>> But then later in that function we also have >>>> >>>> /* B < T'': Set the cache size equal to # of outstanding entries, >>>> * let the balloon driver fill in the rest. */ >>>> if ( pod_target > p2md->pod.entry_count ) >>>> pod_target = p2md->pod.entry_count; >>>> >>>> which in the case at hand would set pod_target to 0, and the >>>> whole operation would again not have any effect afaict. So >>>> maybe this was the reason to do this operation _after_ the >>>> normal address space population? >>> Snap -- forgot about that. >>> >>> The main thing is for set_mem_target() to be simple for the toolstack >>> -- it''s just supposed to say how much memory it wants the guest to >>> use, and Xen is supposed to figure out how much memory the PoD cache >>> needs. The interface is that the toolstack is just supposed to call >>> set_mem_target() after each time it changes the balloon target. The >>> idea was to be robust against the user setting arbitrary new targets >>> before the balloon driver had reached the old target. So the problem >>> with allowing (pod_target > entry_count) is that that''s the condition >>> that happens when you are ballooning up. >>> >>> Maybe the best thing to do is to introduce a specific call to >>> initialize the PoD cache that would ignore entry_count? >> Hmm, would looks more like a hack to me. >> >> How about doing the initial check as suggested earlier >> >> if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) >> goto out; >> >> and the latter check in a similar way >> >> if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 ) >> pod_target = p2md->pod.entry_count; >> >> (which would still take care of any ballooning activity)? Or are >> there any other traps to fall into? > The "d->tot_pages > 0" seems more like a hack to me. :-) What''s hackish > about having an interface like this? > * allocate_pod_mem() > * for() { populate_pod_mem() } > * [Boot VM] > * set_pod_target()Mostly the fact that it''s an almost identical interface to what we already have. After all, the two !alloc checks would go at exactly the place where I had put the d->tot_pages > 0. Plus the new interface likely ought to check that d->tot_pages is zero. In the end you''re proposing a full new interface for something that can be done with two small adjustments.> Right now set_pod_mem() is used both for initial allocation and for > adjustments. But it seems like there''s good reason to make a distinction.Yes. But that distinction can well be implicit. In any case, as for testing this out my approach would be far easier (even if it looks like a hack to you), is there anything wrong with it (i.e. is my assumption where the !alloc checks would have to go wrong)? If not, I''d prefer to have that simpler thing tested, and then we can still settle on the more involved change you would like to see. Jan
Andres Lagar-Cavilla
2012-Aug-07 15:36 UTC
Re: PoD code killing domain before it really gets started
> On 07/08/12 15:40, Andres Lagar-Cavilla wrote: >>>>>> On 06.08.12 at 18:03, George Dunlap <George.Dunlap@eu.citrix.com> >>>>>> wrote: >>>> I guess there are two problems with that: >>>> * As you''ve seen, apparently dom0 may access these pages before any >>>> faults happen. >>>> * If it happens that reclaim_single is below the only zeroed page, the >>>> guest will crash even when there is reclaim-able memory available. >>>> >>>> Two ways we could fix this: >>>> 1. Remove dom0 accesses (what on earth could be looking at a >>>> not-yet-created VM?) >>> I''m told it''s a monitoring daemon, and yes, they are intending to >>> adjust it to first query the GFN''s type (and don''t do the access >>> when it''s not populated, yet). But wait, I didn''t check the code >>> when I recommended this - XEN_DOMCTL_getpageframeinfo{2,3) >>> also call get_page_from_gfn() with P2M_ALLOC, so would also >>> trigger the PoD code (in -unstable at least) - Tim, was that really >>> a correct adjustment in 25355:974ad81bb68b? It looks to be a >>> 1:1 translation, but is that really necessary? If one wanted to >>> find out whether a page is PoD to avoid getting it populated, >>> how would that be done from outside the hypervisor? Would >>> we need XEN_DOMCTL_getpageframeinfo4 for this? >>> >>>> 2. Allocate the PoD cache before populating the p2m table >>>> 3. Make it so that some accesses fail w/o crashing the guest? I don''t >>>> see how that''s really practical. >>> What''s wrong with telling control tools that a certain page is >>> unpopulated (from which they will be able to imply that''s it''s all >>> clear from the guest''s pov)? Even outside of the current problem, >>> I would think that''s more efficient than allocating the page. Of >>> course, the control tools need to be able to cope with that. And >>> it may also be necessary to distinguish between read and >>> read/write mappings being established (and for r/w ones the >>> option of populating at access time rather than at creation time >>> would need to be explored). >> I wouldn''t be opposed to some form of getpageframeinfo4. It''s not just >> PoD >> we are talking about here. Is the page paged out? Is the page shared? >> >> Right now we have global per-domain queries (domaininfo). Or individual >> gfn debug memctl''s. A batched interface with richer information would be >> a >> blessing for debugging or diagnosis purposes. >> >> The first order of business is exposing the type. Do we really want to >> expose the whole range of p2m_* types or just "really useful" ones like >> is_shared, is_pod, is_paged, is_normal? An argument for the former is >> that >> the mem event interface already pumps the p2m_* type up the stack. >> >> The other useful bit of information I can think of is exposing the >> shared >> ref count. > I think just like the gfn_to_mfn() interface, we need a "I care about > the details" interface and an "I don''t care about the details" > interface. If a page isn''t present, or needs to be un-shared, or is PoD > and not currently available, then maybe dom0 callers trying to map that > page should get something like -EAGAIN? Just something that indicates, > "This page isn''t here at the moment, but may be here soon." What do you > think?Sure. Right now you get -ENOENT if dom0 tries to mmap a foreign frame that is paged out. Kernel-level backends get the same with grant mappings. As a side-effect, the hypervisor has triggered the page in, so one of the next retries should succeed. My opinion is that you should expand the use of -ENOENT for this newly-minted "delayed PoD" case. Iiuc, PoD would just succeed or crash the guest prior to this conversation. That way, no new code is needed in the upper-layers (neither libxc nor kernel backends) to deal with delayed PoD allocations. The retry loops paging needs are already in place and PoD leverages them. Sharing can get ENOMEM when breaking shares. Much like paging, the hypervisor will have kicked the appropriate helper, so a retry with an expectation of success could be put in place. (Nevermind that there are no in-tree helpers atm: what to do, balloon dom0 to release more mem?). I can look into making it uniform with the above cases. As for the detailed interface, getpageframeinfo4 sounds like a good idea. Otherwise, you need a new privcmd mmap interface in the kernel ... snore ;) Andres> > -George >
Jan Beulich
2012-Aug-09 08:37 UTC
Re: PoD code killing domain before it really gets started
>>> On 07.08.12 at 17:08, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 07/08/12 14:29, Jan Beulich wrote: >> Hmm, would looks more like a hack to me. >> >> How about doing the initial check as suggested earlier >> >> if ( p2md->pod.entry_count == 0 && d->tot_pages > 0 ) >> goto out; >> >> and the latter check in a similar way >> >> if ( pod_target > p2md->pod.entry_count && d->tot_pages > 0 ) >> pod_target = p2md->pod.entry_count; >> >> (which would still take care of any ballooning activity)? Or are >> there any other traps to fall into? > The "d->tot_pages > 0" seems more like a hack to me. :-) What''s hackish > about having an interface like this? > * allocate_pod_mem() > * for() { populate_pod_mem() } > * [Boot VM] > * set_pod_target() > > Right now set_pod_mem() is used both for initial allocation and for > adjustments. But it seems like there''s good reason to make a distinction.So that one didn''t take hypercall preemption into account. While adjusting this to use "d->tot_pages > p2md->pod.count", which in turn I converted to simply "populated > 0" (moving the calculation of "populated" earlier), I noticed that there''s quite a bit of type inconsistency: I first stumbled across p2m_pod_set_mem_target()''s "pod_target" being "unsigned" instead of "unsigned long", then noticed that all the fields in struct p2m_domain''s pod sub-structure aren''t "long" as at least the GFNs ought to be (the counts should be not only for consistency, but also because they''re signed, and hence overflow earlier). As a consequence, the PoD code itself would also need careful review, as there are a number of questionable uses of plain "int" (and, quite the opposite, two cases of "order" pointlessly being "unsigned long"). Jan