Khoa Huynh
2005-Sep-13 16:10 UTC
[Xen-devel] [PATCH][VT] Patch to allow VMX domains to be destroyed or shut down cleanly
The problem: A VMX domain cannot be destroyed or shut down completely. After trying to destroy or shut down a VMX domain, the domain data structures still exists, and the domain can still be seen in ''xm list'' even though the amount of memory is shown as 0. The cause: VMX domains use shadow mode (with refcount, translate, and external flags set) which uses external shadow tables for address translations and manipulates page reference counts in a different manner than the usual, page table-based, non-VMX mode. When we tear down VMX domains, we disable shadow mode. However, when we disable shadow mode, we do not fix up shadow page reference counts, and this was thought to be OK because the VMX domain is dying anyway. In fact, there is a flag (unsigned int) called shadow_tainted_refcnts to indicate that the shadow page reference counts are "tainted" when the domain is dying. This flag, which is set in shadow_mode_disable(), allows us to ignore the (tainted) page reference counts while handling pages. (If anyone has more insight into this flag, I''d appreciate it.) As a result, after we release memory pages belonging to a VMX domain, there are pages which have "tainted" ref counts and could not be released immediately. This leads to the incorrect VMX domain''s ref count and prevents the domain''s other resources (e.g. hash tables, event channels, grant tables, etc.) from being released. This is the reason why VMX domains cannot currently be destroyed or shut down. I have looked at scenarios where simple operations are done in Windows XP running in a VMX domain. In these scenarios, there are anywhere from 2 to 100 pages still not released when we try to relinquish all memory from the VMX domain (during a destroy or shutdown operation). These pages have tainted shadow reference counts (these could be external references ?). Since xend reports the amount of memory in MB, it reports that the VMX domain''s memory is 0 MB after we try to destroy or shut down the domain, but in reality, there are still anywhere from 4 KB to 400 KB left (at least in the scenarios that I have examined). Proposed fix: In the patch below, I have modified the routine relinquish_memory() to return an indication as to whether or not all pages on the page list have been released successfully. If not, I then check to see if we indeed have tainted shadow ref counts and if the domain is dying. If all of these conditions are met, the domain''s reference count is adjusted to what it should be. Note that the patch does NOT automatically set any ref count to 0 and force the destruction of the domain. Instead, it adjusts the domain''s ref count to the value it should have if we do not have tainted ref counts. The rest of the domain''s resources should automatically disappear when they are supposed to. IMHO, I believe that the patch below is _probably_ the simplest, least intrusive, and safest way to fix this problem. It is simplest because it is relatively small patch - certainly much smaller than the amount of code which would be required to fix up the shadow reference counts properly (I am not even sure if that is realistic - if anyone has any ideas on how to do this, I''d appreciate it). It is least intrusive because all changes are limited to two routines in a single file (xen/arch/x86/domain.c) and these two routines all deal with relinquishing a domain''s memory pages. The patch is safest because it only tweaks the domain''s reference counts when absolutely necessary, and does not tinker with anything else. I have tested this to some extent with both VMX (Windows XP) and non-VMX domains. So far so good. I have opened Bug 225 for this problem: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=225 Any suggestions, comments, etc. are welcome. Thanks in advance. --- ./xen/arch/x86/domain.c.org 2005-08-23 21:01:09.000000000 -0500 +++ ./xen/arch/x86/domain.c 2005-09-12 22:22:05.000000000 -0500 @@ -975,7 +975,7 @@ #define vmx_relinquish_resources(_v) ((void)0) #endif -static void relinquish_memory(struct domain *d, struct list_head *list) +static int relinquish_memory(struct domain *d, struct list_head *list) { struct list_head *ent; struct pfn_info *page; @@ -1029,15 +1029,17 @@ ent = ent->next; put_page(page); } - + spin_unlock_recursive(&d->page_alloc_lock); + + return (list == list->next); } void domain_relinquish_resources(struct domain *d) { struct vcpu *v; unsigned long pfn; - + BUG_ON(!cpus_empty(d->cpumask)); physdev_destroy_state(d); @@ -1080,12 +1082,19 @@ for_each_vcpu(d, v) destroy_gdt(v); - /* Relinquish every page of memory. */ - relinquish_memory(d, &d->xenpage_list); - relinquish_memory(d, &d->page_list); + /* Relinquish every page of memory. + * If the domain is dying, and if we have tainted shadow reference counts, + adjust the domain''s reference count. + */ + if (!relinquish_memory(d, &d->xenpage_list)) + if (shadow_tainted_refcnts(d) && test_bit(_DOMF_dying, &d->domain_flags)) + put_domain(d); + + if (!relinquish_memory(d, &d->page_list)) + if (shadow_tainted_refcnts(d) && test_bit(_DOMF_dying, &d->domain_flags)) + put_domain(d); } - /* * Local variables: * mode: C _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Sep-13 19:10 UTC
Re: [Xen-devel] [PATCH][VT] Patch to allow VMX domains to be destroyed or shut down cleanly
On 13 Sep 2005, at 17:10, Khoa Huynh wrote:> I have tested this to some extent with both VMX (Windows XP) and > non-VMX > domains. So far so good.It''s good to understand the problem some more, and I think I like your general approach. However, although your current patch drops the domain refcnts, it leaves the tainted-refcnt pages allocated. Worse, they are left allocated and with a dangling domain pointer. Probably we should hit those page refcnts on the head (i.e., to zero). One fly in the ointment is if some refcnts are non-zero because other domains have mappings of them (e.g., device model in domain0)... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Khoa Huynh
2005-Sep-13 21:37 UTC
Re: [Xen-devel] [PATCH][VT] Patch to allow VMX domains to be destroyed or shut down cleanly
Keir Fraser wrote:>> I have tested this to some extent with both VMX (Windows XP) and >> non-VMX >> domains. So far so good. > >It''s good to understand the problem some more, and I think I like your >general approach. However, although your current patch drops the domain >refcnts, it leaves the tainted-refcnt pages allocated. Worse, they are >left allocated and with a dangling domain pointer. Probably we should >hit those page refcnts on the head (i.e., to zero). One fly in the >ointment is if some refcnts are non-zero because other domains have >mappings of them (e.g., device model in domain0)...Thanks for your comments. The current code *does* drop the page reference counts for those tainted-refcnt pages. Even after getting decremented, the ref counts for these pages are still not 0 (most are 1, some are 2, 3, or even 4). I also tried to scrub those tainted-refcnt pages (i.e., attaching them to the page_scrub list which, when the time comes after the domain is killed, we could zero-fill them, and then free them from the heap). However, when I did this, the system (domain0) crashed. This led me to believe that some of these tainted-refcnt pages may have external mappings and cannot be freed immediately. I assume that these pages will be freed eventually, but need to investigate more. Please let me know if you have any ideas or suggestions. Thanks. Regards, Khoa Huynh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Sep-13 22:04 UTC
Re: [Xen-devel] [PATCH][VT] Patch to allow VMX domains to be destroyed or shut down cleanly
On 13 Sep 2005, at 22:37, Khoa Huynh wrote:> Thanks for your comments. The current code *does* drop the page > reference counts for those tainted-refcnt pages. Even after > getting decremented, the ref counts for these pages > are still not 0 (most are 1, some are 2, 3, or even 4).I mean forcibly decrement them to zero and free them right there and then. Of course, as you point out, the problem is that some of the pages are mapped in domain0. I''m not sure how we can distinguish tainted refcnts from genuine external references. Perhaps there''s a proper way we should be destructing the full shadow pagetables such that the refcnts end up at zero. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Khoa Huynh
2005-Sep-19 05:05 UTC
Re: [Xen-devel] [PATCH][VT] Patch to allow VMX domains to be destroyed or shut down cleanly
Keir Fraser wrote:>I mean forcibly decrement them to zero and free them right there and >then. Of course, as you point out, the problem is that some of the >pages are mapped in domain0. I''m not sure how we can distinguish >tainted refcnts from genuine external references. Perhaps there''s a >proper way we should be destructing the full shadow pagetables such >that the refcnts end up at zero.Thanks for your comment. I have done extensive tracing through the domain destruction code in the hypervisor in the last few days. The bottom line: after domain destruction code in the hypervisor is done, all shadow pages were indeed freed up - even though the shadow_tainted_refcnts flag was set. I now believe the remaining pages are genuinely externally referenced (possibly by the qemu device model still running in domain0). Here are more details on what I have found: Ideally, when we destroy or shut down a VMX domain, the general page reference counts ended up at 0 in shadow mode, so that the pages can be released properly from the domain. I have traced quite a bit of code for different scenarios involving Windows XP running in a VMX domain. I only did simple operations in Windows XP, but I tried to destroy the VMX domain at different times (e.g. during Windows XP boot, during simple operations, after Windows XP has been shutdown, etc.) For non-VMX (Linux) domains, after we relinquish memory in domain_relinquish_resources(), all pages in the domain''s page list indeed had reference count of 0 and were properly freed from the xen heap - just like we expected. For VMX (e.g., Windows XP) domains, after we relinquish memory in domain_relinquish_resources(), depending on how many activities were done in Windows XP, there were anywhere from 2 to 100 pages remaining just before the domain''s structures were freed up by the hypervisor. Most of these pages still have page reference counts of 1, and therefore, could not be freed from the heap by the hypervisor. This prevents the rest of the domain''s resources from being released, and therefore, ''xm list'' still shows the VMX domains after they were destroyed. In shadow mode, the following things could be reflected in the page (general) reference counts: (a) General stuff: - page is allocated (PGC_allocated) - page is pinned - page is pointed by CR3''s (b) Shadow page tables (l1, l2, hl2, etc.) (c) Out-of-sync entries (d) Grant table mappings (e) External references (not through grant table) (f) Monitor page table references (external shadow mode) (g) Writable PTE predictions (h) GDTs/LDTs So I put in a lot of instrumentation and tracing code, and made sure that the above things were taken into account and removed from the page reference counts during the domain destruction code sequence in the hypervisor. During this code sequence, we disable shadow mode (shadow_mode_disable()) and the shadow_tainted_refcnts flag was set. However, much to my surprise, the page reference counts were properly taken care of in shadow mode, and all shadow pages (including those in l1, l2, hl2 tables and snapshots) were all freed up. In particular, here''s where each of the things in the above list was taken into account during the domain destruction code sequence in the hypervisor: (a) General stuff: - None of remaining pages have PGC_allocated flag set - None of remaining pages are still pinned - The monitor shadow ref was 0, and all pages pointed to by CR3''s were taken care of in free_shadow_pages() (b) All shadow pages (including those pages in l1, l2, hl2, snapshots) were freed properly. I implemented counters to track all shadow page promotions/allocations and demotions/ deallocations throughout the hypervisor code, and at the end after we relinquished all domain memory pages, these counters did indeed return to 0 - as we expected. (c) out-of-sync entries -> in free_out_of_sync_state() called by free_shadow_pages(). (d) grant table mappings -> the count of active grant table mappings is 0 after the domain destruction sequence in the hypervisor is executed. (e) external references not mapped via grant table -> I believe that these include the qemu-dm pages which still remain after we relinquish all domain memory pages - as the qemu-dm may still be active after a VMX domain has been destroyed. (f) external monitor page references -> all references from monitor page table are dropped in vmx_relinquish_resources(), and monitor table itself is freed in domain_destruct(). In fact, in my code traces, the monitor shadow reference count was 0 after the domain destruction code in the hypervisor. (g) writable PTE predictions -> I didn''t see any pages in this category in my code traces, but if there are, they would be freed up in free_shadow_pages(). (h) GDTs/LDTs -> these were destroyed in destroy_gdt() and invalidate_shadow_ldt() called from domain_relinquish_ resources(). Based on the code instrumentation and tracing above, I am pretty confident that the shadow page reference counts were handled properly during the domain destruction code sequence in the hypervisor. There is a problem in keeping track of shadow page counts (domain->arch.shadow_page_count), and I will submit a patch to fix this shortly. However, this does not really impact how shadow pages are handled. Consequently, the pages that still remain after the domain destruction code sequence in the hypervisor are externally referenced and may belong to the qemu device model running in domain0. The fact that qemu-dm is still active for some time after a VMX domain has been torn down in the hypervisor is evident by examining the tools code (python). In fact, if I forcibly free these remaining pages from the xen heap, the system/dom0 crashed. Am I missing anything ? Your comments, suggestions, etc., are welcome! Thanks for reading this rather long email :-) Khoa H. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel