Jan Beulich
2007-Sep-27 09:51 UTC
[Xen-devel] non-zero order allocations in shadow code may prevent live migration
Tim, after a lot of walking dead end routes with a customer issue stating that he can''t reliably run live migration I finally concluded that the problem can only be explained by the non-zero order allocations done in shadow code (on x86-64 and x86-32/pae). However, from a PV-domain-live-migration perspective it would seem to me that these order 2 allocations are entirely pointless; there are really just 2 cases where non-zero order allocations are needed: a guest in 32-bit non-PAE mode (can only be PV on a 32-bit non-PAE hypervisor, in which case no non-zero order allocations are needed at all, or hvm) or shadow_alloc_p2m_pages(). The latter is neither used for live migration nor does it really require non-zero order allocations - its sole caller is shadow_alloc_p2m_page(), which really only ever wants to return single pages (i.e. allocating more than one page here acts at best as a short cut, but I think there''s really very little win from doing so). So the bottom line is - sh_set_allocation() really shouldn''t need to allocate non-zero order pages except for hvm domains. As this implies quite a few changes, before going that route I''d like to understand whether I''m mistaken with anything here. Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2007-Sep-27 10:41 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
Hi, At 10:51 +0100 on 27 Sep (1190890315), Jan Beulich wrote:> after a lot of walking dead end routes with a customer issue stating that he > can''t reliably run live migration I finally concluded that the problem can only be > explained by the non-zero order allocations done in shadow code (on x86-64 > and x86-32/pae).Gosh. Are you running with almost all memory in use and failing to allocate shadow memory? Have you seen sh_set_allocation return -ENOMEM when there is enough memory on the page allocator''s free lists? Xend''s ballooning rules have been wrong more than once before, and are what I''d suspect first.> However, from a PV-domain-live-migration perspective it > would seem to me that these order 2 allocations are entirely > pointless;You''re absolutely right. The order-2 allocations are only used for shadowing 32bit non-PAE guests on PAE or 64bit Xen, which only happens for HVM guests. (They were originally used for shadowing both kinds of PAE guests as well but that''s done differently now). shadow_alloc_p2m_pages() only does order-2 allocations to prevent it from fragmenting the shadow pool and undermining the assertion that you can always free up an order-2 block.> So the bottom line is - sh_set_allocation() really shouldn''t need to allocate > non-zero order pages except for hvm domains.Yep.> As this implies quite a few changes, before going that route I''d like to > understand whether I''m mistaken with anything here.No, you''re quite right. It should be possible to have the shadow allocator work with single pages for PV guests. The shadow_prealloc() call still needs to be able to make sure that four free pages are available so that we can populate an entire walk of the shadow pagetable without needing to reclaim in-use pages. Apart from that I can''t think of any other problems right now. Cheers, Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2007-Sep-27 10:50 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
At 10:51 +0100 on 27 Sep (1190890315), Jan Beulich wrote:> So the bottom line is - sh_set_allocation() really shouldn''t need to allocate > non-zero order pages except for hvm domains.Actually, if you''re having fragmentation issues, won''t that stop you from booting HVM domains as well? Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Sep-27 11:19 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
>Gosh. Are you running with almost all memory in use and failing to >allocate shadow memory? Have you seen sh_set_allocation return -ENOMEM >when there is enough memory on the page allocator''s free lists? Xend''s >ballooning rules have been wrong more than once before, and are what I''d >suspect first.Yes, that''s the scenario (64G physical memory, one PV domain started with 4G assigned, live migration back and forth between two hosts, perhaps intertwined with other domain creation activities, plus perhaps ballooning Dom0 back up after domain termination). So no, I verified there''s about 21M free (xm info) before migration starts (or after it failed), and the tools estimate the need correctly: DEBUG (balloon:146) Balloon: 21888 KiB free; need 21504; done. while shadow still fails (some of the messages might be temporary debugging aids of ours): (XEN) sh error: set_sh_allocation(): current 0 target 4608 (XEN) sh error: set_sh_allocation(): failed to allocate shadow pages. (XEN) sh error: set_sh_allocation(): current 4212 target 0 (XEN) sh error: shadow_one_bit_enable(): shadow_one_bit_enable() failed memory allocation (XEN) sh error: shadow_log_dirty_enable(): shadow_log_dirty_enable() received (errno = -12) from shadow_one_bit_enabled() 4608 pages are just 18432k, so there is about 3M of fragmented and hence unusable space. So then I''ll go ahead with implementing the described change (I''m actually intending to have shadow_prealloc() take not just an order, but also a count parameter - in a number of places it is being called with SHADOW_MAX_ORDER for no reason other than wanting 3 or 4 single pages). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Sep-27 11:28 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
>>> Tim Deegan <Tim.Deegan@xensource.com> 27.09.07 12:50 >>> >At 10:51 +0100 on 27 Sep (1190890315), Jan Beulich wrote: >> So the bottom line is - sh_set_allocation() really shouldn''t need to allocate >> non-zero order pages except for hvm domains. > >Actually, if you''re having fragmentation issues, won''t that stop you >from booting HVM domains as well?I''m sure we would, but in the given case the customer just cares about PV. However, HVM would likely see the issue much less frequently if (as I assume, but didn''t verify) shadow mode gets enabled before normal domain memory gets allocated, because that would generally allow a much wider pool for picking out the needed higher order pages. Nevertheless I would think that even for HVM domains the shadow shouldn''t need to make its entire allocation in order 2 chunks, but could limit this to just the amount it really knows it''ll need (1 per vCPU as I understand it), then continue with order 1 chunks (not sure about their count, but for forward progress I think you''ll need at most 6 {,f}l1_32_shadow pages per vCPU). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2007-Sep-27 12:07 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
At 12:19 +0100 on 27 Sep (1190895561), Jan Beulich wrote:> So then I''ll go ahead with implementing the described change (I''m actually > intending to have shadow_prealloc() take not just an order, but also a count > parameter - in a number of places it is being called with SHADOW_MAX_ORDER > for no reason other than wanting 3 or 4 single pages).shadow_prealloc could just as easily take no arguments and always free four pages in the highest order that''s in use. There''s no real benefit from fine-tuning it as the operations that free shadow memory operate in much bigger increments anyway. Cheers, Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2007-Sep-27 12:13 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
At 12:28 +0100 on 27 Sep (1190896090), Jan Beulich wrote:> Nevertheless I would think that > even for HVM domains the shadow shouldn''t need to make its entire > allocation in order 2 chunks, but could limit this to just the amount it really > knows it''ll need (1 per vCPU as I understand it),For correctness, yes. At least 1 per process in the working set for anything like reasonable performance.> then continue with order 1 > chunks (not sure about their count, but for forward progress I think you''ll > need at most 6 {,f}l1_32_shadow pages per vCPU).ISTR the number came out at something unpleasant like thirteen after all the VAs you might need for a single x86 instruction were accounted for. But then we realised that even a heavily pessimistic lower bound was so far below the threshold of acceptable performance that it wasn''t worth being precise. :) Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Sep-27 12:58 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
>>> Tim Deegan <Tim.Deegan@xensource.com> 27.09.07 14:07 >>> >At 12:19 +0100 on 27 Sep (1190895561), Jan Beulich wrote: >> So then I''ll go ahead with implementing the described change (I''m actually >> intending to have shadow_prealloc() take not just an order, but also a count >> parameter - in a number of places it is being called with SHADOW_MAX_ORDER >> for no reason other than wanting 3 or 4 single pages). > >shadow_prealloc could just as easily take no arguments and always free >four pages in the highest order that''s in use. There''s no real benefit >from fine-tuning it as the operations that free shadow memory operate in >much bigger increments anyway.While doing so I realized that this would fit well with the suggested HVM related changes in my other response. However, you seem to indicate that such changes aren''t worth it from a performance perspective, so I''m not sure it''d be worth. Otoh, in a long lived production environment fragmentation is likely to become a significant issue, and it is certainly not desirable to have good chances of HVM domain creation starting to fail after a system has been up for a long time (in theory it might be necessary to balloon out of Dom0 three quarters of the installed physical memory plus one quarter of the to be allocated amount in order to guarantee that shadow allocation can succeed). But I also agree that any attempt to change this without changing the parts of the shadow code that depend on non-zero order pages will only reduce the likelihood of running into the problem, it won''t eliminate it. So I guess for a first cut I''ll go with your suggestion of removing the ''order'' parameter of shadow_prealloc(). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Sep-27 13:21 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
>>> "Jan Beulich" <jbeulich@novell.com> 27.09.07 14:58 >>> >>>> Tim Deegan <Tim.Deegan@xensource.com> 27.09.07 14:07 >>> >>At 12:19 +0100 on 27 Sep (1190895561), Jan Beulich wrote: >>> So then I''ll go ahead with implementing the described change (I''m actually >>> intending to have shadow_prealloc() take not just an order, but also a count >>> parameter - in a number of places it is being called with SHADOW_MAX_ORDER >>> for no reason other than wanting 3 or 4 single pages). >> >>shadow_prealloc could just as easily take no arguments and always free >>four pages in the highest order that''s in use. There''s no real benefit >>from fine-tuning it as the operations that free shadow memory operate in >>much bigger increments anyway. > >While doing so I realized that this would fit well with the suggested HVM related >changes in my other response. However, you seem to indicate that such >changes aren''t worth it from a performance perspective, so I''m not sure it''d >be worth. Otoh, in a long lived production environment fragmentation is likely >to become a significant issue, and it is certainly not desirable to have good >chances of HVM domain creation starting to fail after a system has been up >for a long time (in theory it might be necessary to balloon out of Dom0 three >quarters of the installed physical memory plus one quarter of the to be >allocated amount in order to guarantee that shadow allocation can succeed). >But I also agree that any attempt to change this without changing the parts of >the shadow code that depend on non-zero order pages will only reduce the >likelihood of running into the problem, it won''t eliminate it. So I guess for a >first cut I''ll go with your suggestion of removing the ''order'' parameter of >shadow_prealloc().An additional consideration: Allowing to specify order and count in the call to shadow_prealloc() might also help to avoid flushes, since in many cases the need is really just for a number of single pages, not one big one. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2007-Sep-27 13:43 UTC
[Xen-devel] Re: non-zero order allocations in shadow code may prevent live migration
At 14:21 +0100 on 27 Sep (1190902888), Jan Beulich wrote:> An additional consideration: Allowing to specify order and count in the call > to shadow_prealloc() might also help to avoid flushes, since in many cases > the need is really just for a number of single pages, not one big one.True. Might as well do it that way, then. Cheers, Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel