Updates: - Fixed bugs in v14: Zombie domains, FreeBSD crash, Crash at 4GiB, HVM crash (Thank you to Roger Pau Mone for fixes to the last 3) - Completely eliminated PV emulation codepath == RFC = We had talked about accepting the patch series as-is once I had the known bugs fixed; but I couldn''t help making an attempt at using the HVM IO emulation codepaths so that we could completely eliminate having to use the PV emulation code, in turn eliminating some of the uglier "support" patches required to make the PV emulation code capable of running on a PVH guest. The idea for "admin" pio ranges would be that we would use the vmx hardware to allow the guest direct access, rather than the "re-execute with guest GPRs" trick that PV uses. (This functionality is not implememted by this patch series, so we would need to make sure it was sorted for the dom0 series.) The result looks somewhat cleaner to me. On the other hand, because string in & out instructions use the full emulation code, it means opening up an extra 6k lines of code to PVH guests, including all the complexity of the ioreq path. (It doesn''t actually send ioreqs, but since it shares much of the path, it shares much of the complexity.) Additionally, I''m not sure I''ve done it entirely correctly: the guest boots and the io instructions it executes seem to be handled correctly, but it may not be using the corner cases. This also means no support for "legacy" forced invalid ops -- only native cpuid is supported in this series. I have the fixes in another series, if people think it would be better to check in exactly what we had with bug fixes ASAP. Other "open issues" on the design (which need not stop the series going in) include: - Whether a completely separate mode is necessary, or whether having just having HVM mode with some flags to disable / change certain functionality would be better - Interface-wise: Right now PVH is special-cased for bringing up CPUs. Is this what we want to do going forward, or would it be better to try to make it more like PV (which was tried before and is hard), or more like HVM (which would involve having emulated APICs, &c &c). == Summay = This patch series is a reworking of a series developed by Mukesh Rathor at Oracle. The entirety of the design and development was done by him; I have only reworked, reorganized, and simplified things in a way that I think makes more sense. The vast majority of the credit for this effort therefore goes to him. This version is labelled v14 because it is based on his most recent series, v11. Because this is based on his work, I retain the "Signed-off-by" in patches which are based on his code. This is not meant to imply that he supports the modified version, only that he is involved in certifying that the origin of the code for copyright purposes. This patch series is broken down into several broad strokes: * Miscellaneous fixes or tweaks * Code motion, so future patches are simpler * Introduction of the "hvm_container" concept, which will form the basis for sharing codepaths between hvm and pvh * Start with PVH as an HVM container * Disable unneeded HVM functionality * Enable PV functionality * Disable not-yet-implemented functionality * Enable toolstack changes required to make PVH guests This patch series can also be pulled from this git tree: git://xenbits.xen.org/people/gdunlap/xen.git out/pvh-v14 The kernel code for PVH guests can be found here: git://oss.oracle.com/git/mrathor/linux.git pvh.v9-muk-1 (That repo/branch also contains a config file, pvh-config-file) Changes in v14 can be found inline; major changes since v13 include: * Various bug fixes * Use HVM emulation for IO instructions * ...thus removing many of the changes required to allow the PV emulation codepath to work for PVH guests Changes in v13 can be found inline; major changes since v12 include: * Include Mukesh''s toolstack patches (v4) * Allocate hvm_param struct for PVH domains; remove patch disabling memevents For those who have been following the series as it develops, here is a summary of the major changes from Mukesh''s series (v11->v12): * Introduction of "has_hvm_container_*()" macros, rather than using "!is_pv_*". The patch which introduces this also does the vast majority of the "heavy lifting" in terms of defining PVH. * Effort is made to use as much common code as possible. No separate vmcs constructor, no separate vmexit handlers. More of a "start with everything and disable if necessary" approach rather than "start with nothing and enable as needed" approach. * One exception is arch_set_info_guest(), where a small amount of code duplication meant a lot fewer "if(!is_pvh_domain())"s in awkward places * I rely on things being disabled at a higher level and passed down. For instance, I no longer explicitly disable rdtsc exiting in construct_vmcs(), since that will happen automatically when we''re in NEVER_EMULATE mode (which is currently enforced for PVH). Similarly for nested vmx and things relating to HAP mode. * I have also done a slightly more extensive audit of is_pv_* and is_hvm_* and tried to do more restrictions. * I changed the "enable PVH by setting PV + HAP", replacing it instead with a separate flag, just like the HVM case, since it makes sense to plan on using shadow in the future (although it is Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> CC: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Jan Beulich <beulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> CC: Ian Jackson <ian.jackson@citrix.com> CC: Ian Campbell <ian.campbell@citrix.com>
George Dunlap
2013-Nov-04 12:14 UTC
[PATCH v14 01/17] Allow vmx_update_debug_state to be called when v!=current
Removing the assert allows the PVH code to call this during vmcs construction in a later patch, making the code more robust by removing duplicate code. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> ---- v13: Add vmx_vmcs_{enter,exit} CC: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/vmx/vmx.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 9ca8632..fdb560e 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1051,8 +1051,6 @@ void vmx_update_debug_state(struct vcpu *v) { unsigned long mask; - ASSERT(v == current); - mask = 1u << TRAP_int3; if ( !cpu_has_monitor_trap_flag ) mask |= 1u << TRAP_debug; @@ -1061,7 +1059,10 @@ void vmx_update_debug_state(struct vcpu *v) v->arch.hvm_vmx.exception_bitmap |= mask; else v->arch.hvm_vmx.exception_bitmap &= ~mask; + + vmx_vmcs_enter(v); vmx_update_exception_bitmap(v); + vmx_vmcs_exit(v); } static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr) -- 1.7.9.5
George Dunlap
2013-Nov-04 12:14 UTC
[PATCH v14 02/17] libxc: Move temporary grant table mapping to end of memory
From: Roger Pau Monné <roger.pau@citrix.com> In order to set up the grant table for HVM guests, libxc needs to map the grant table temporarily. At the moment, it does this by adding the grant page to the HVM guest's p2m table in the MMIO hole (at gfn 0xFFFFE), then mapping that gfn, setting up the table, then unmapping the gfn and removing it from the p2m table. This breaks with PVH guests with 4G or more of ram, because there is no MMIO hole; so it ends up clobbering a valid RAM p2m entry, then leaving a "hole" when it removes the grant map from the p2m table. Since the guest thinks this is normal ram, when it maps it and tries to access the page, it crashes. This patch maps the page at max_gfn+1 instead. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- tools/libxc/xc_dom.h | 3 --- tools/libxc/xc_dom_boot.c | 14 ++++++++++++-- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index 86e23ee..935b49e 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -18,9 +18,6 @@ #define INVALID_P2M_ENTRY ((xen_pfn_t)-1) -/* Scrach PFN for temporary mappings in HVM */ -#define SCRATCH_PFN_GNTTAB 0xFFFFE - /* --- typedefs and structs ---------------------------------------- */ typedef uint64_t xen_vaddr_t; diff --git a/tools/libxc/xc_dom_boot.c b/tools/libxc/xc_dom_boot.c index 71e1897..fdfeaf8 100644 --- a/tools/libxc/xc_dom_boot.c +++ b/tools/libxc/xc_dom_boot.c @@ -361,17 +361,27 @@ int xc_dom_gnttab_hvm_seed(xc_interface *xch, domid_t domid, domid_t xenstore_domid) { int rc; + xen_pfn_t max_gfn; struct xen_add_to_physmap xatp = { .domid = domid, .space = XENMAPSPACE_grant_table, .idx = 0, - .gpfn = SCRATCH_PFN_GNTTAB }; struct xen_remove_from_physmap xrfp = { .domid = domid, - .gpfn = SCRATCH_PFN_GNTTAB }; + max_gfn = xc_domain_maximum_gpfn(xch, domid); + if ( max_gfn <= 0 ) { + xc_dom_panic(xch, XC_INTERNAL_ERROR, + "%s: failed to get max gfn " + "[errno=%d]\n", + __FUNCTION__, errno); + return -1; + } + xatp.gpfn = max_gfn + 1; + xrfp.gpfn = max_gfn + 1; + rc = do_memory_op(xch, XENMEM_add_to_physmap, &xatp, sizeof(xatp)); if ( rc != 0 ) { -- 1.7.9.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
There are many functions where PVH requires some code in common with HVM. Rearrange some of these functions so that the code is together. In general, the HVM code that PVH also uses includes: - cacheattr functionality - paging - hvm_funcs - hvm_assert_evtchn_irq tasklet - tm_list - hvm_params And code that PVH shares with PV but not with PVH: - updating the domain wallclock - setting v->is_initialized There should be no end-to-end changes in behavior. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14: - Remove changes in arch_set_info_guest (more of the code is unified) - hvm_funcs.vcpu_initialise() must be called after vlapic_init() v13: - Don''t bother calling tasklet_kill in failure path of hvm_vcpu_initialize - Allocate hvm_params for PVH domains CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/hvm.c | 93 +++++++++++++++++++++++++----------------------- 1 file changed, 49 insertions(+), 44 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 5f3a94a..87a6f42 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -522,27 +522,27 @@ int hvm_domain_initialise(struct domain *d) spin_lock_init(&d->arch.hvm_domain.irq_lock); spin_lock_init(&d->arch.hvm_domain.uc_lock); - INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list); - spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock); + hvm_init_cacheattr_region_list(d); + + rc = paging_enable(d, PG_refcounts|PG_translate|PG_external); + if ( rc != 0 ) + goto fail0; d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler); rc = -ENOMEM; if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler ) - goto fail0; + goto fail1; d->arch.hvm_domain.io_handler->num_slot = 0; + INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list); + spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock); + hvm_init_guest_time(d); d->arch.hvm_domain.params[HVM_PARAM_HPET_ENABLED] = 1; d->arch.hvm_domain.params[HVM_PARAM_TRIPLE_FAULT_REASON] = SHUTDOWN_reboot; - hvm_init_cacheattr_region_list(d); - - rc = paging_enable(d, PG_refcounts|PG_translate|PG_external); - if ( rc != 0 ) - goto fail1; - vpic_init(d); rc = vioapic_init(d); @@ -569,10 +569,10 @@ int hvm_domain_initialise(struct domain *d) stdvga_deinit(d); vioapic_deinit(d); fail1: - hvm_destroy_cacheattr_region_list(d); - fail0: xfree(d->arch.hvm_domain.io_handler); xfree(d->arch.hvm_domain.params); + fail0: + hvm_destroy_cacheattr_region_list(d); return rc; } @@ -601,11 +601,11 @@ void hvm_domain_relinquish_resources(struct domain *d) void hvm_domain_destroy(struct domain *d) { + hvm_destroy_cacheattr_region_list(d); hvm_funcs.domain_destroy(d); rtc_deinit(d); stdvga_deinit(d); vioapic_deinit(d); - hvm_destroy_cacheattr_region_list(d); } static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h) @@ -1091,24 +1091,47 @@ int hvm_vcpu_initialise(struct vcpu *v) { int rc; struct domain *d = v->domain; - domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + domid_t dm_domid; hvm_asid_flush_vcpu(v); - if ( (rc = vlapic_init(v)) != 0 ) + spin_lock_init(&v->arch.hvm_vcpu.tm_lock); + INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); + + rc = hvm_vcpu_cacheattr_init(v); /* teardown: vcpu_cacheattr_destroy */ + if ( rc != 0 ) goto fail1; - if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) + /* NB: vlapic_init must be called before hvm_funcs.vcpu_initialise */ + if ( (rc = vlapic_init(v)) != 0 ) /* teardown: vlapic_destroy */ goto fail2; - if ( nestedhvm_enabled(d) - && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) + if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */ goto fail3; + softirq_tasklet_init( + &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet, + (void(*)(unsigned long))hvm_assert_evtchn_irq, + (unsigned long)v); + + v->arch.user_regs.eflags = 2; + + v->arch.hvm_vcpu.inject_trap.vector = -1; + + rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ + if ( rc != 0 ) + goto fail4; + + if ( nestedhvm_enabled(d) + && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) /* teardown: nestedhvm_vcpu_destroy */ + goto fail5; + + dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + /* Create ioreq event channel. */ - rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); + rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); /* teardown: none */ if ( rc < 0 ) - goto fail4; + goto fail6; /* Register ioreq event channel. */ v->arch.hvm_vcpu.xen_port = rc; @@ -1116,9 +1139,9 @@ int hvm_vcpu_initialise(struct vcpu *v) if ( v->vcpu_id == 0 ) { /* Create bufioreq event channel. */ - rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); + rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); /* teardown: none */ if ( rc < 0 ) - goto fail4; + goto fail6; d->arch.hvm_domain.params[HVM_PARAM_BUFIOREQ_EVTCHN] = rc; } @@ -1127,26 +1150,6 @@ int hvm_vcpu_initialise(struct vcpu *v) get_ioreq(v)->vp_eport = v->arch.hvm_vcpu.xen_port; spin_unlock(&d->arch.hvm_domain.ioreq.lock); - spin_lock_init(&v->arch.hvm_vcpu.tm_lock); - INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); - - v->arch.hvm_vcpu.inject_trap.vector = -1; - - rc = setup_compat_arg_xlat(v); - if ( rc != 0 ) - goto fail4; - - rc = hvm_vcpu_cacheattr_init(v); - if ( rc != 0 ) - goto fail5; - - softirq_tasklet_init( - &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet, - (void(*)(unsigned long))hvm_assert_evtchn_irq, - (unsigned long)v); - - v->arch.user_regs.eflags = 2; - if ( v->vcpu_id == 0 ) { /* NB. All these really belong in hvm_domain_initialise(). */ @@ -1164,14 +1167,16 @@ int hvm_vcpu_initialise(struct vcpu *v) return 0; + fail6: + nestedhvm_vcpu_destroy(v); fail5: free_compat_arg_xlat(v); fail4: - nestedhvm_vcpu_destroy(v); - fail3: hvm_funcs.vcpu_destroy(v); - fail2: + fail3: vlapic_destroy(v); + fail2: + hvm_vcpu_cacheattr_destroy(v); fail1: return rc; } -- 1.7.9.5
George Dunlap
2013-Nov-04 12:14 UTC
[PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
The goal of this patch is to classify conditionals more clearly, as to whether they relate to pv guests, hvm-only guests, or guests with an "hvm container" (which will eventually include PVH). This patch introduces an enum for guest type, as well as two new macros for switching behavior on and off: is_pv_* and has_hvm_container_*. At the moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that it seems to me different to take a path because something does *not* have PV structures as to take a path because it *does* have HVM structures, even if the two happen to coincide 100% at the moment. The exact usage is occasionally a bit fuzzy though, and a judgement call just needs to be made on which is clearer. In general, a switch should use is_pv_* (or !is_pv_*) if the code in question relates directly to a PV guest. Examples include use of pv_vcpu structs or other behavior directly related to PV domains. hvm_container is more of a fuzzy concept, but in general: * Most core HVM behavior will be included in this. Behavior not appropriate for PVH mode will be disabled in later patches * Hypercalls related to HVM guests will *not* be included by default; functionality needed by PVH guests will be enabled in future patches * The following functionality are not considered part of the HVM container, and PVH will end up behaving like PV by default: Event channel, vtsc offset, code related to emulated timers, nested HVM, emuirq, PoD * Some features are left to implement for PVH later: vpmu, shadow mode Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v13: - Readjust where we choose to use has_hvm_container or !is_pv (and vice versa) - Do the memset in arch_set_info_guest unconditionally - Chech for is_pv in do_page_walk CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/acpi/suspend.c | 2 +- xen/arch/x86/cpu/mcheck/vmce.c | 6 ++-- xen/arch/x86/debug.c | 2 +- xen/arch/x86/domain.c | 54 ++++++++++++++++++------------------ xen/arch/x86/domain_page.c | 10 +++---- xen/arch/x86/domctl.c | 11 ++++---- xen/arch/x86/efi/runtime.c | 4 +-- xen/arch/x86/hvm/vmx/vmcs.c | 4 +-- xen/arch/x86/mm.c | 6 ++-- xen/arch/x86/mm/shadow/common.c | 6 ++-- xen/arch/x86/mm/shadow/multi.c | 7 +++-- xen/arch/x86/physdev.c | 4 +-- xen/arch/x86/traps.c | 5 ++-- xen/arch/x86/x86_64/mm.c | 2 +- xen/arch/x86/x86_64/traps.c | 8 +++--- xen/common/domain.c | 2 +- xen/common/grant_table.c | 4 +-- xen/common/kernel.c | 2 +- xen/include/asm-x86/domain.h | 2 +- xen/include/asm-x86/event.h | 2 +- xen/include/asm-x86/guest_access.h | 12 ++++---- xen/include/asm-x86/guest_pt.h | 4 +-- xen/include/xen/sched.h | 14 ++++++++-- xen/include/xen/tmem_xen.h | 2 +- 24 files changed, 92 insertions(+), 83 deletions(-) diff --git a/xen/arch/x86/acpi/suspend.c b/xen/arch/x86/acpi/suspend.c index 6fdd876..1718930 100644 --- a/xen/arch/x86/acpi/suspend.c +++ b/xen/arch/x86/acpi/suspend.c @@ -85,7 +85,7 @@ void restore_rest_processor_state(void) BUG(); /* Maybe load the debug registers. */ - BUG_ON(is_hvm_vcpu(curr)); + BUG_ON(!is_pv_vcpu(curr)); if ( !is_idle_vcpu(curr) && curr->arch.debugreg[7] ) { write_debugreg(0, curr->arch.debugreg[0]); diff --git a/xen/arch/x86/cpu/mcheck/vmce.c b/xen/arch/x86/cpu/mcheck/vmce.c index af3b491..f6c35db 100644 --- a/xen/arch/x86/cpu/mcheck/vmce.c +++ b/xen/arch/x86/cpu/mcheck/vmce.c @@ -83,7 +83,7 @@ int vmce_restore_vcpu(struct vcpu *v, const struct hvm_vmce_vcpu *ctxt) { dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA capabilities" " %#" PRIx64 " for d%d:v%u (supported: %#Lx)\n", - is_hvm_vcpu(v) ? "HVM" : "PV", ctxt->caps, + has_hvm_container_vcpu(v) ? "HVM" : "PV", ctxt->caps, v->domain->domain_id, v->vcpu_id, guest_mcg_cap & ~MCG_CAP_COUNT); return -EPERM; @@ -357,7 +357,7 @@ int inject_vmce(struct domain *d, int vcpu) if ( vcpu != VMCE_INJECT_BROADCAST && vcpu != v->vcpu_id ) continue; - if ( (is_hvm_domain(d) || + if ( (has_hvm_container_domain(d) || guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check)) && !test_and_set_bool(v->mce_pending) ) { @@ -439,7 +439,7 @@ int unmmap_broken_page(struct domain *d, mfn_t mfn, unsigned long gfn) if (!mfn_valid(mfn_x(mfn))) return -EINVAL; - if ( !is_hvm_domain(d) || !paging_mode_hap(d) ) + if ( !has_hvm_container_domain(d) || !paging_mode_hap(d) ) return -ENOSYS; rc = -1; diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c index e67473e..3e21ca8 100644 --- a/xen/arch/x86/debug.c +++ b/xen/arch/x86/debug.c @@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp, pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len); - mfn = (dp->is_hvm + mfn = (has_hvm_container_domain(dp) ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn) : dbg_pv_va2mfn(addr, dp, pgd3)); if ( mfn == INVALID_MFN ) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index b67fcb8..358616c 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -167,7 +167,7 @@ void dump_pageframe_info(struct domain *d) spin_unlock(&d->page_alloc_lock); } - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) p2m_pod_dump_data(d); spin_lock(&d->page_alloc_lock); @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) vmce_init_vcpu(v); - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) { rc = hvm_vcpu_initialise(v); goto done; @@ -438,7 +438,7 @@ int vcpu_initialise(struct vcpu *v) { vcpu_destroy_fpu(v); - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) xfree(v->arch.pv_vcpu.trap_ctxt); } @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) vcpu_destroy_fpu(v); - if ( is_hvm_vcpu(v) ) + if ( has_hvm_container_vcpu(v) ) hvm_vcpu_destroy(v); else xfree(v->arch.pv_vcpu.trap_ctxt); @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) int rc = -ENOMEM; d->arch.hvm_domain.hap_enabled - is_hvm_domain(d) && + has_hvm_container_domain(d) && hvm_funcs.hap_supported && (domcr_flags & DOMCRF_hap); d->arch.hvm_domain.mem_sharing_enabled = 0; @@ -490,7 +490,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) d->domain_id); } - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL); else if ( is_idle_domain(d) ) rc = 0; @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) mapcache_domain_init(d); HYPERVISOR_COMPAT_VIRT_START(d) - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) goto fail; @@ -554,7 +554,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) goto fail; } - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) { if ( (rc = hvm_domain_initialise(d)) != 0 ) { @@ -583,14 +583,14 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) if ( paging_initialised ) paging_final_teardown(d); free_perdomain_mappings(d); - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab); return rc; } void arch_domain_destroy(struct domain *d) { - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) hvm_domain_destroy(d); else xfree(d->arch.pv_domain.e820); @@ -602,7 +602,7 @@ void arch_domain_destroy(struct domain *d) paging_final_teardown(d); free_perdomain_mappings(d); - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab); free_xenheap_page(d->shared_info); @@ -653,7 +653,7 @@ int arch_set_info_guest( #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) flags = c(flags); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) { if ( !compat ) { @@ -698,7 +698,7 @@ int arch_set_info_guest( v->fpu_initialised = !!(flags & VGCF_I387_VALID); v->arch.flags &= ~TF_kernel_mode; - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) + if ( (flags & VGCF_in_kernel) || has_hvm_container_vcpu(v)/*???*/ ) v->arch.flags |= TF_kernel_mode; v->arch.vgc_flags = flags; @@ -713,7 +713,7 @@ int arch_set_info_guest( if ( !compat ) { memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, sizeof(c.nat->trap_ctxt)); } @@ -729,7 +729,7 @@ int arch_set_info_guest( v->arch.user_regs.eflags |= 2; - if ( is_hvm_vcpu(v) ) + if ( has_hvm_container_vcpu(v) ) { hvm_set_info_guest(v); goto out; @@ -959,7 +959,7 @@ int arch_set_info_guest( int arch_vcpu_reset(struct vcpu *v) { - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) { destroy_gdt(v); return vcpu_destroy_pagetables(v); @@ -1309,7 +1309,7 @@ static void update_runstate_area(struct vcpu *v) static inline int need_full_gdt(struct vcpu *v) { - return (!is_hvm_vcpu(v) && !is_idle_vcpu(v)); + return (is_pv_vcpu(v) && !is_idle_vcpu(v)); } static void __context_switch(void) @@ -1435,9 +1435,9 @@ void context_switch(struct vcpu *prev, struct vcpu *next) { __context_switch(); - if ( !is_hvm_vcpu(next) && + if ( is_pv_vcpu(next) && (is_idle_vcpu(prev) || - is_hvm_vcpu(prev) || + has_hvm_container_vcpu(prev) || is_pv_32on64_vcpu(prev) != is_pv_32on64_vcpu(next)) ) { uint64_t efer = read_efer(); @@ -1448,13 +1448,13 @@ void context_switch(struct vcpu *prev, struct vcpu *next) /* Re-enable interrupts before restoring state which may fault. */ local_irq_enable(); - if ( !is_hvm_vcpu(next) ) + if ( is_pv_vcpu(next) ) { load_LDT(next); load_segments(next); } - set_cpuid_faulting(!is_hvm_vcpu(next) && + set_cpuid_faulting(is_pv_vcpu(next) && (next->domain->domain_id != 0)); } @@ -1537,7 +1537,7 @@ void hypercall_cancel_continuation(void) } else { - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) regs->eip += 2; /* skip re-execute ''syscall'' / ''int $xx'' */ else current->arch.hvm_vcpu.hcall_preempted = 0; @@ -1574,12 +1574,12 @@ unsigned long hypercall_create_continuation( regs->eax = op; /* Ensure the hypercall trap instruction is re-executed. */ - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) regs->eip -= 2; /* re-execute ''syscall'' / ''int $xx'' */ else current->arch.hvm_vcpu.hcall_preempted = 1; - if ( !is_hvm_vcpu(current) ? + if ( is_pv_vcpu(current) ? !is_pv_32on64_vcpu(current) : (hvm_guest_x86_mode(current) == 8) ) { @@ -1851,7 +1851,7 @@ int domain_relinquish_resources(struct domain *d) return ret; } - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { for_each_vcpu ( d, v ) { @@ -1924,7 +1924,7 @@ int domain_relinquish_resources(struct domain *d) BUG(); } - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) hvm_domain_relinquish_resources(d); return 0; @@ -2008,7 +2008,7 @@ void vcpu_mark_events_pending(struct vcpu *v) if ( already_pending ) return; - if ( is_hvm_vcpu(v) ) + if ( has_hvm_container_vcpu(v) ) hvm_assert_evtchn_irq(v); else vcpu_kick(v); diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index bc18263..3903952 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -35,7 +35,7 @@ static inline struct vcpu *mapcache_current_vcpu(void) * then it means we are running on the idle domain''s page table and must * therefore use its mapcache. */ - if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) ) + if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) ) { /* If we really are idling, perform lazy context switch now. */ if ( (v = idle_vcpu[smp_processor_id()]) == current ) @@ -72,7 +72,7 @@ void *map_domain_page(unsigned long mfn) #endif v = mapcache_current_vcpu(); - if ( !v || is_hvm_vcpu(v) ) + if ( !v || !is_pv_vcpu(v) ) return mfn_to_virt(mfn); dcache = &v->domain->arch.pv_domain.mapcache; @@ -177,7 +177,7 @@ void unmap_domain_page(const void *ptr) ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); v = mapcache_current_vcpu(); - ASSERT(v && !is_hvm_vcpu(v)); + ASSERT(v && is_pv_vcpu(v)); dcache = &v->domain->arch.pv_domain.mapcache; ASSERT(dcache->inuse); @@ -244,7 +244,7 @@ int mapcache_domain_init(struct domain *d) struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; unsigned int bitmap_pages; - if ( is_hvm_domain(d) || is_idle_domain(d) ) + if ( !is_pv_domain(d) || is_idle_domain(d) ) return 0; #ifdef NDEBUG @@ -275,7 +275,7 @@ int mapcache_vcpu_init(struct vcpu *v) unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES; unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long)); - if ( is_hvm_vcpu(v) || !dcache->inuse ) + if ( !is_pv_vcpu(v) || !dcache->inuse ) return 0; if ( ents > dcache->entries ) diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index e75918a..9531a16 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -800,7 +800,7 @@ long arch_do_domctl( if ( domctl->cmd == XEN_DOMCTL_get_ext_vcpucontext ) { evc->size = sizeof(*evc); - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { evc->sysenter_callback_cs v->arch.pv_vcpu.sysenter_callback_cs; @@ -833,7 +833,7 @@ long arch_do_domctl( ret = -EINVAL; if ( evc->size < offsetof(typeof(*evc), vmce) ) goto ext_vcpucontext_out; - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { if ( !is_canonical_address(evc->sysenter_callback_eip) || !is_canonical_address(evc->syscall32_callback_eip) ) @@ -1246,8 +1246,7 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) bool_t compat = is_pv_32on64_domain(v->domain); #define c(fld) (!compat ? (c.nat->fld) : (c.cmp->fld)) - if ( is_hvm_vcpu(v) ) - memset(c.nat, 0, sizeof(*c.nat)); + memset(c.nat, 0, sizeof(*c.nat)); memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt, sizeof(c.nat->fpu_ctxt)); c(flags = v->arch.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel)); if ( v->fpu_initialised ) @@ -1257,7 +1256,7 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) if ( !compat ) { memcpy(&c.nat->user_regs, &v->arch.user_regs, sizeof(c.nat->user_regs)); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) memcpy(c.nat->trap_ctxt, v->arch.pv_vcpu.trap_ctxt, sizeof(c.nat->trap_ctxt)); } @@ -1272,7 +1271,7 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) for ( i = 0; i < ARRAY_SIZE(v->arch.debugreg); ++i ) c(debugreg[i] = v->arch.debugreg[i]); - if ( is_hvm_vcpu(v) ) + if ( has_hvm_container_vcpu(v) ) { struct segment_register sreg; diff --git a/xen/arch/x86/efi/runtime.c b/xen/arch/x86/efi/runtime.c index 37bb535..d7c884b 100644 --- a/xen/arch/x86/efi/runtime.c +++ b/xen/arch/x86/efi/runtime.c @@ -52,7 +52,7 @@ unsigned long efi_rs_enter(void) /* prevent fixup_page_fault() from doing anything */ irq_enter(); - if ( !is_hvm_vcpu(current) && !is_idle_vcpu(current) ) + if ( is_pv_vcpu(current) && !is_idle_vcpu(current) ) { struct desc_ptr gdt_desc = { .limit = LAST_RESERVED_GDT_BYTE, @@ -71,7 +71,7 @@ unsigned long efi_rs_enter(void) void efi_rs_leave(unsigned long cr3) { write_cr3(cr3); - if ( !is_hvm_vcpu(current) && !is_idle_vcpu(current) ) + if ( is_pv_vcpu(current) && !is_idle_vcpu(current) ) { struct desc_ptr gdt_desc = { .limit = LAST_RESERVED_GDT_BYTE, diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 6526504..f2a2857 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -650,7 +650,7 @@ void vmx_vmcs_exit(struct vcpu *v) { /* Don''t confuse vmx_do_resume (for @v or @current!) */ vmx_clear_vmcs(v); - if ( is_hvm_vcpu(current) ) + if ( has_hvm_container_vcpu(current) ) vmx_load_vmcs(current); spin_unlock(&v->arch.hvm_vmx.vmcs_lock); @@ -1479,7 +1479,7 @@ static void vmcs_dump(unsigned char ch) for_each_domain ( d ) { - if ( !is_hvm_domain(d) ) + if ( !has_hvm_container_domain(d) ) continue; printk("\n>>> Domain %d <<<\n", d->domain_id); for_each_vcpu ( d, v ) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 43aaceb..9621e22 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -181,7 +181,7 @@ static uint32_t base_disallow_mask; (rangeset_is_empty((d)->iomem_caps) && \ rangeset_is_empty((d)->arch.ioport_caps) && \ !has_arch_pdevs(d) && \ - !is_hvm_domain(d)) ? \ + is_pv_domain(d)) ? \ L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS)) static void __init init_frametable_chunk(void *start, void *end) @@ -433,7 +433,7 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) unsigned long domain_get_maximum_gpfn(struct domain *d) { - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) return p2m_get_hostp2m(d)->max_mapped_pfn; /* NB. PV guests specify nr_pfns rather than max_pfn so we adjust here. */ return (arch_get_max_pfn(d) ?: 1) - 1; @@ -2381,7 +2381,7 @@ static int __get_page_type(struct page_info *page, unsigned long type, { /* Special pages should not be accessible from devices. */ struct domain *d = page_get_owner(page); - if ( d && !is_hvm_domain(d) && unlikely(need_iommu(d)) ) + if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) ) { if ( (x & PGT_type_mask) == PGT_writable_page ) iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page))); diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c index adffa06..0bfa595 100644 --- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -367,7 +367,7 @@ const struct x86_emulate_ops *shadow_init_emulation( sh_ctxt->ctxt.regs = regs; sh_ctxt->ctxt.force_writeback = 0; - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) { sh_ctxt->ctxt.addr_size = sh_ctxt->ctxt.sp_size = BITS_PER_LONG; return &pv_shadow_emulator_ops; @@ -964,7 +964,7 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn) if ( pg->shadow_flags & ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync) || sh_page_has_multiple_shadows(pg) - || !is_hvm_domain(v->domain) + || is_pv_domain(v->domain) || !v->domain->arch.paging.shadow.oos_active ) return 0; @@ -2753,7 +2753,7 @@ static void sh_update_paging_modes(struct vcpu *v) if ( v->arch.paging.mode ) v->arch.paging.mode->shadow.detach_old_tables(v); - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { /// /// PV guest diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index 3fed0b6..d3fa25c 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -711,8 +711,9 @@ _sh_propagate(struct vcpu *v, // PV guests in 64-bit mode use two different page tables for user vs // supervisor permissions, making the guest''s _PAGE_USER bit irrelevant. // It is always shadowed as present... - if ( (GUEST_PAGING_LEVELS == 4) && !is_pv_32on64_domain(d) - && !is_hvm_domain(d) ) + if ( (GUEST_PAGING_LEVELS == 4) + && !is_pv_32on64_domain(d) + && is_pv_domain(d) ) { sflags |= _PAGE_USER; } @@ -3922,7 +3923,7 @@ sh_update_cr3(struct vcpu *v, int do_locking) #endif /* Don''t do anything on an uninitialised vcpu */ - if ( !is_hvm_domain(d) && !v->is_initialised ) + if ( is_pv_domain(d) && !v->is_initialised ) { ASSERT(v->arch.cr3 == 0); return; diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 4835ed7..dab6213 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -310,10 +310,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) spin_unlock(&v->domain->event_lock); break; } - if ( !is_hvm_domain(v->domain) && + if ( is_pv_domain(v->domain) && v->domain->arch.pv_domain.auto_unmask ) evtchn_unmask(pirq->evtchn); - if ( !is_hvm_domain(v->domain) || + if ( is_pv_domain(v->domain) || domain_pirq_to_irq(v->domain, eoi.irq) > 0 ) pirq_guest_eoi(pirq); if ( is_hvm_domain(v->domain) && diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 77c200b..edb7a6a 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -120,6 +120,7 @@ static void show_guest_stack(struct vcpu *v, struct cpu_user_regs *regs) unsigned long *stack, addr; unsigned long mask = STACK_SIZE; + /* Avoid HVM as we don''t know what the stack looks like. */ if ( is_hvm_vcpu(v) ) return; @@ -557,7 +558,7 @@ static inline void do_trap( } if ( ((trapnr == TRAP_copro_error) || (trapnr == TRAP_simd_error)) && - is_hvm_vcpu(curr) && curr->arch.hvm_vcpu.fpu_exception_callback ) + has_hvm_container_vcpu(curr) && curr->arch.hvm_vcpu.fpu_exception_callback ) { curr->arch.hvm_vcpu.fpu_exception_callback( curr->arch.hvm_vcpu.fpu_exception_callback_arg, regs); @@ -712,7 +713,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx, *ebx = 0x40000200; *ecx = 0; /* Features 1 */ *edx = 0; /* Features 2 */ - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) *ecx |= XEN_CPUID_FEAT1_MMU_PT_UPDATE_PRESERVE_AD; break; diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c index 2bdbad0..4a3b3f1 100644 --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -73,7 +73,7 @@ void *do_page_walk(struct vcpu *v, unsigned long addr) l2_pgentry_t l2e, *l2t; l1_pgentry_t l1e, *l1t; - if ( is_hvm_vcpu(v) || !is_canonical_address(addr) ) + if ( !is_pv_vcpu(v) || !is_canonical_address(addr) ) return NULL; l4t = map_domain_page(mfn); diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index 0316d7c..f66ab5a 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -86,7 +86,7 @@ void show_registers(struct cpu_user_regs *regs) enum context context; struct vcpu *v = current; - if ( is_hvm_vcpu(v) && guest_mode(regs) ) + if ( has_hvm_container_vcpu(v) && guest_mode(regs) ) { struct segment_register sreg; context = CTXT_hvm_guest; @@ -147,8 +147,8 @@ void vcpu_show_registers(const struct vcpu *v) const struct cpu_user_regs *regs = &v->arch.user_regs; unsigned long crs[8]; - /* No need to handle HVM for now. */ - if ( is_hvm_vcpu(v) ) + /* Only handle PV guests for now */ + if ( !is_pv_vcpu(v) ) return; crs[0] = v->arch.pv_vcpu.ctrlreg[0]; @@ -624,7 +624,7 @@ static void hypercall_page_initialise_ring3_kernel(void *hypercall_page) void hypercall_page_initialise(struct domain *d, void *hypercall_page) { memset(hypercall_page, 0xCC, PAGE_SIZE); - if ( is_hvm_domain(d) ) + if ( has_hvm_container_domain(d) ) hvm_hypercall_page_initialise(d, hypercall_page); else if ( !is_pv_32bit_domain(d) ) hypercall_page_initialise_ring3_kernel(hypercall_page); diff --git a/xen/common/domain.c b/xen/common/domain.c index 5999779..995ba63 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -238,7 +238,7 @@ struct domain *domain_create( goto fail; if ( domcr_flags & DOMCRF_hvm ) - d->is_hvm = 1; + d->guest_type = guest_type_hvm; if ( domid == 0 ) { diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index 21c6a14..107b000 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -721,7 +721,7 @@ __gnttab_map_grant_ref( double_gt_lock(lgt, rgt); - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; @@ -931,7 +931,7 @@ __gnttab_unmap_common( act->pin -= GNTPIN_hstw_inc; } - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 4ca50c4..97d9050 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -306,7 +306,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( current->domain == dom0 ) fi.submap |= 1U << XENFEAT_dom0; #ifdef CONFIG_X86 - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) | (1U << XENFEAT_highmem_assist) | (1U << XENFEAT_gnttab_map_avail_bits); diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index e42651e..07e21f7 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -16,7 +16,7 @@ #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d)) #define is_pv_32on64_vcpu(v) (is_pv_32on64_domain((v)->domain)) -#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \ +#define is_hvm_pv_evtchn_domain(d) (has_hvm_container_domain(d) && \ d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector) #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain)) diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h index 7edeb5b..a82062e 100644 --- a/xen/include/asm-x86/event.h +++ b/xen/include/asm-x86/event.h @@ -23,7 +23,7 @@ int hvm_local_events_need_delivery(struct vcpu *v); static inline int local_events_need_delivery(void) { struct vcpu *v = current; - return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) : + return (has_hvm_container_vcpu(v) ? hvm_local_events_need_delivery(v) : (vcpu_info(v, evtchn_upcall_pending) && !vcpu_info(v, evtchn_upcall_mask))); } diff --git a/xen/include/asm-x86/guest_access.h b/xen/include/asm-x86/guest_access.h index ca700c9..88edb3f 100644 --- a/xen/include/asm-x86/guest_access.h +++ b/xen/include/asm-x86/guest_access.h @@ -14,27 +14,27 @@ /* Raw access functions: no type checking. */ #define raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ copy_to_user((dst), (src), (len))) #define raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ copy_from_user((dst), (src), (len))) #define raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) #define __raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ __copy_to_user((dst), (src), (len))) #define __raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ __copy_from_user((dst), (src), (len))) #define __raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (has_hvm_container_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h index b62bc6a..d2a8250 100644 --- a/xen/include/asm-x86/guest_pt.h +++ b/xen/include/asm-x86/guest_pt.h @@ -196,7 +196,7 @@ guest_supports_superpages(struct vcpu *v) /* The _PAGE_PSE bit must be honoured in HVM guests, whenever * CR4.PSE is set or the guest is in PAE or long mode. * It''s also used in the dummy PT for vcpus with CR4.PG cleared. */ - return (!is_hvm_vcpu(v) + return (is_pv_vcpu(v) ? opt_allow_superpage : (GUEST_PAGING_LEVELS != 2 || !hvm_paging_enabled(v) @@ -214,7 +214,7 @@ guest_supports_nx(struct vcpu *v) { if ( GUEST_PAGING_LEVELS == 2 || !cpu_has_nx ) return 0; - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) return cpu_has_nx; return hvm_nx_enabled(v); } diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 25bf637..a16dee4 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -258,6 +258,10 @@ struct mem_event_per_domain struct evtchn_port_ops; +enum guest_type { + guest_type_pv, guest_type_hvm +}; + struct domain { domid_t domain_id; @@ -310,8 +314,8 @@ struct domain struct rangeset *iomem_caps; struct rangeset *irq_caps; - /* Is this an HVM guest? */ - bool_t is_hvm; + enum guest_type guest_type; + #ifdef HAS_PASSTHROUGH /* Does this guest need iommu mappings? */ bool_t need_iommu; @@ -770,8 +774,12 @@ void watchdog_domain_destroy(struct domain *d); #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist)) -#define is_hvm_domain(d) ((d)->is_hvm) +#define is_pv_domain(d) ((d)->guest_type == guest_type_pv) +#define is_pv_vcpu(v) (is_pv_domain((v)->domain)) +#define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm) #define is_hvm_vcpu(v) (is_hvm_domain(v->domain)) +#define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv) +#define has_hvm_container_vcpu(v) (has_hvm_container_domain((v)->domain)) #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ cpumask_weight((v)->cpu_affinity) == 1) #ifdef HAS_PASSTHROUGH diff --git a/xen/include/xen/tmem_xen.h b/xen/include/xen/tmem_xen.h index ad1ddd5..9fb7446 100644 --- a/xen/include/xen/tmem_xen.h +++ b/xen/include/xen/tmem_xen.h @@ -442,7 +442,7 @@ typedef XEN_GUEST_HANDLE_PARAM(char) tmem_cli_va_param_t; static inline int tmh_get_tmemop_from_client(tmem_op_t *op, tmem_cli_op_t uops) { #ifdef CONFIG_COMPAT - if ( is_hvm_vcpu(current) ? + if ( has_hvm_container_vcpu(current) ? hvm_guest_x86_mode(current) != 8 : is_pv_32on64_vcpu(current) ) { -- 1.7.9.5
Introduce new PVH guest type, flags to create it, and ways to identify it. To begin with, it will inherit functionality marked hvm_container. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- v13: Changed if/else in getdomaininfo into a switch statement as requested. CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/common/domain.c | 11 +++++++++++ xen/common/domctl.c | 24 +++++++++++++++++++++--- xen/include/public/domctl.h | 8 +++++++- xen/include/xen/sched.h | 11 ++++++++++- 4 files changed, 49 insertions(+), 5 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 995ba63..19d96c2 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -239,6 +239,17 @@ struct domain *domain_create( if ( domcr_flags & DOMCRF_hvm ) d->guest_type = guest_type_hvm; + else if ( domcr_flags & DOMCRF_pvh ) + { + if ( !(domcr_flags & DOMCRF_hap) ) + { + err = -EOPNOTSUPP; + printk(XENLOG_INFO "PVH guest must have HAP on\n"); + goto fail; + } + d->guest_type = guest_type_pvh; + printk("Creating PVH guest d%d\n", d->domain_id); + } if ( domid == 0 ) { diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 870eef1..552669d 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -185,8 +185,17 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info) (d->debugger_attached ? XEN_DOMINF_debugged : 0) | d->shutdown_code << XEN_DOMINF_shutdownshift; - if ( is_hvm_domain(d) ) + switch(d->guest_type) + { + case guest_type_hvm: info->flags |= XEN_DOMINF_hvm_guest; + break; + case guest_type_pvh: + info->flags |= XEN_DOMINF_pvh_guest; + break; + default: + break; + } xsm_security_domaininfo(d, info); @@ -412,8 +421,11 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) ret = -EINVAL; if ( supervisor_mode_kernel || (op->u.createdomain.flags & - ~(XEN_DOMCTL_CDF_hvm_guest | XEN_DOMCTL_CDF_hap | - XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off)) ) + ~(XEN_DOMCTL_CDF_hvm_guest + | XEN_DOMCTL_CDF_pvh_guest + | XEN_DOMCTL_CDF_hap + | XEN_DOMCTL_CDF_s3_integrity + | XEN_DOMCTL_CDF_oos_off)) ) break; dom = op->domain; @@ -440,9 +452,15 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) rover = dom; } + if ( (op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest) + && (op->u.createdomain.flags & XEN_DOMCTL_CDF_pvh_guest) ) + return -EINVAL; + domcr_flags = 0; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest ) domcr_flags |= DOMCRF_hvm; + if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_pvh_guest ) + domcr_flags |= DOMCRF_pvh; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) domcr_flags |= DOMCRF_hap; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity ) diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index d4e479f..c79dabe 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -47,7 +47,7 @@ struct xen_domctl_createdomain { /* IN parameters */ uint32_t ssidref; xen_domain_handle_t handle; - /* Is this an HVM guest (as opposed to a PV guest)? */ + /* Is this an HVM guest (as opposed to a PVH or PV guest)? */ #define _XEN_DOMCTL_CDF_hvm_guest 0 #define XEN_DOMCTL_CDF_hvm_guest (1U<<_XEN_DOMCTL_CDF_hvm_guest) /* Use hardware-assisted paging if available? */ @@ -59,6 +59,9 @@ struct xen_domctl_createdomain { /* Disable out-of-sync shadow page tables? */ #define _XEN_DOMCTL_CDF_oos_off 3 #define XEN_DOMCTL_CDF_oos_off (1U<<_XEN_DOMCTL_CDF_oos_off) + /* Is this a PVH guest (as opposed to an HVM or PV guest)? */ +#define _XEN_DOMCTL_CDF_pvh_guest 4 +#define XEN_DOMCTL_CDF_pvh_guest (1U<<_XEN_DOMCTL_CDF_pvh_guest) uint32_t flags; }; typedef struct xen_domctl_createdomain xen_domctl_createdomain_t; @@ -89,6 +92,9 @@ struct xen_domctl_getdomaininfo { /* Being debugged. */ #define _XEN_DOMINF_debugged 6 #define XEN_DOMINF_debugged (1U<<_XEN_DOMINF_debugged) +/* domain is PVH */ +#define _XEN_DOMINF_pvh_guest 7 +#define XEN_DOMINF_pvh_guest (1U<<_XEN_DOMINF_pvh_guest) /* XEN_DOMINF_shutdown guest-supplied code. */ #define XEN_DOMINF_shutdownmask 255 #define XEN_DOMINF_shutdownshift 16 diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index a16dee4..ca2bb1a 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -258,8 +258,12 @@ struct mem_event_per_domain struct evtchn_port_ops; +/* + * PVH is a PV guest running in an HVM container. is_hvm_* checks + * will be false, but has_hvm_container_* checks will be true. + */ enum guest_type { - guest_type_pv, guest_type_hvm + guest_type_pv, guest_type_pvh, guest_type_hvm }; struct domain @@ -499,6 +503,9 @@ struct domain *domain_create( /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */ #define _DOMCRF_oos_off 4 #define DOMCRF_oos_off (1U<<_DOMCRF_oos_off) + /* DOMCRF_pvh: Create PV domain in HVM container. */ +#define _DOMCRF_pvh 5 +#define DOMCRF_pvh (1U<<_DOMCRF_pvh) /* * rcu_lock_domain_by_id() is more efficient than get_domain_by_id(). @@ -776,6 +783,8 @@ void watchdog_domain_destroy(struct domain *d); #define is_pv_domain(d) ((d)->guest_type == guest_type_pv) #define is_pv_vcpu(v) (is_pv_domain((v)->domain)) +#define is_pvh_domain(d) ((d)->guest_type == guest_type_pvh) +#define is_pvh_vcpu(v) (is_pvh_domain((v)->domain)) #define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm) #define is_hvm_vcpu(v) (is_hvm_domain(v->domain)) #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv) -- 1.7.9.5
George Dunlap
2013-Nov-04 12:14 UTC
[PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
Things kept: * cacheattr_region lists * irq-related structures * paging * tm_list * hvm params Things disabled for now: * compat xlation Things disabled: * Emulated timers and clock sources * IO/MMIO io requests * msix tables * hvm_funcs * nested HVM * Fast-path for emulated lapic accesses Getting rid of the hvm_params struct required a couple other places to check for its existence before attempting to read the params. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14: - Also free the params struct for pvh domains, since we''ve allocated it - Fail io for pvh VMs further down the stack, as we will be using the emulation code before calling into the pv pio handlers v13: - Removed unnecessary comment - Allocate params for pvh domains; remove null checks necessary in last patch - Add ASSERT(!is_pvh) to handle_pio CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/emulate.c | 11 +++++++++- xen/arch/x86/hvm/hvm.c | 50 +++++++++++++++++++++++++++++++++++++------ xen/arch/x86/hvm/irq.c | 3 +++ xen/arch/x86/hvm/vmx/intr.c | 3 ++- 4 files changed, 58 insertions(+), 9 deletions(-) diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c index f39c173..a41eaa1 100644 --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -57,12 +57,21 @@ static int hvmemul_do_io( int value_is_ptr = (p_data == NULL); struct vcpu *curr = current; struct hvm_vcpu_io *vio; - ioreq_t *p = get_ioreq(curr); + ioreq_t *p; unsigned long ram_gfn = paddr_to_pfn(ram_gpa); p2m_type_t p2mt; struct page_info *ram_page; int rc; + /* PVH doesn''t have an ioreq infrastructure */ + if ( is_pvh_vcpu(curr) ) + { + gdprintk(XENLOG_WARNING, "Unexpected io from PVH guest\n"); + return X86EMUL_UNHANDLEABLE; + } + + p = get_ioreq(curr); + /* Check for paged out page */ ram_page = get_page_from_gfn(curr->domain, ram_gfn, &p2mt, P2M_UNSHARE); if ( p2m_is_paging(p2mt) ) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 87a6f42..72ca936 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -301,6 +301,10 @@ u64 hvm_get_guest_tsc_adjust(struct vcpu *v) void hvm_migrate_timers(struct vcpu *v) { + /* PVH doesn''t use rtc and emulated timers, it uses pvclock mechanism. */ + if ( is_pvh_vcpu(v) ) + return; + rtc_migrate_timers(v); pt_migrate(v); } @@ -342,10 +346,13 @@ void hvm_do_resume(struct vcpu *v) { ioreq_t *p; - pt_restore_timer(v); - check_wakeup_from_wait(); + if ( is_pvh_vcpu(v) ) + goto check_inject_trap; + + pt_restore_timer(v); + /* NB. Optimised for common case (p->state == STATE_IOREQ_NONE). */ p = get_ioreq(v); while ( p->state != STATE_IOREQ_NONE ) @@ -368,6 +375,7 @@ void hvm_do_resume(struct vcpu *v) } } + check_inject_trap: /* Inject pending hw/sw trap */ if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ) { @@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d) if ( rc != 0 ) goto fail0; + rc = -ENOMEM; d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); + if ( !d->arch.hvm_domain.params ) + goto fail1; + + if ( is_pvh_domain(d) ) + return 0; + d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler); - rc = -ENOMEM; - if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler ) + if ( !d->arch.hvm_domain.io_handler ) goto fail1; d->arch.hvm_domain.io_handler->num_slot = 0; @@ -578,6 +592,11 @@ int hvm_domain_initialise(struct domain *d) void hvm_domain_relinquish_resources(struct domain *d) { + xfree(d->arch.hvm_domain.params); + + if ( is_pvh_domain(d) ) + return; + if ( hvm_funcs.nhvm_domain_relinquish_resources ) hvm_funcs.nhvm_domain_relinquish_resources(d); @@ -596,12 +615,15 @@ void hvm_domain_relinquish_resources(struct domain *d) } xfree(d->arch.hvm_domain.io_handler); - xfree(d->arch.hvm_domain.params); } void hvm_domain_destroy(struct domain *d) { hvm_destroy_cacheattr_region_list(d); + + if ( is_pvh_domain(d) ) + return; + hvm_funcs.domain_destroy(d); rtc_deinit(d); stdvga_deinit(d); @@ -1103,7 +1125,9 @@ int hvm_vcpu_initialise(struct vcpu *v) goto fail1; /* NB: vlapic_init must be called before hvm_funcs.vcpu_initialise */ - if ( (rc = vlapic_init(v)) != 0 ) /* teardown: vlapic_destroy */ + if ( is_hvm_vcpu(v) ) + rc = vlapic_init(v); + if ( rc != 0 ) /* teardown: vlapic_destroy */ goto fail2; if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */ @@ -1118,6 +1142,14 @@ int hvm_vcpu_initialise(struct vcpu *v) v->arch.hvm_vcpu.inject_trap.vector = -1; + if ( is_pvh_vcpu(v) ) + { + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ + /* This for hvm_long_mode_enabled(v). */ + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; + return 0; + } + rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ if ( rc != 0 ) goto fail4; @@ -1189,7 +1221,10 @@ void hvm_vcpu_destroy(struct vcpu *v) tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet); hvm_vcpu_cacheattr_destroy(v); - vlapic_destroy(v); + + if ( is_hvm_vcpu(v) ) + vlapic_destroy(v); + hvm_funcs.vcpu_destroy(v); /* Event channel is already freed by evtchn_destroy(). */ @@ -1390,6 +1425,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, /* For the benefit of 32-bit WinXP (& older Windows) on AMD CPUs, * a fast path for LAPIC accesses, skipping the p2m lookup. */ if ( !nestedhvm_vcpu_in_guestmode(v) + && is_hvm_vcpu(v) && gfn == PFN_DOWN(vlapic_base_address(vcpu_vlapic(v))) ) { if ( !handle_mmio() ) diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index 6a6fb68..677fbcd 100644 --- a/xen/arch/x86/hvm/irq.c +++ b/xen/arch/x86/hvm/irq.c @@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v) && vcpu_info(v, evtchn_upcall_pending) ) return hvm_intack_vector(plat->irq.callback_via.vector); + if ( is_pvh_vcpu(v) ) + return hvm_intack_none; + if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output ) return hvm_intack_pic(0); diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c index 1942e31..7757910 100644 --- a/xen/arch/x86/hvm/vmx/intr.c +++ b/xen/arch/x86/hvm/vmx/intr.c @@ -236,7 +236,8 @@ void vmx_intr_assist(void) } /* Crank the handle on interrupt state. */ - pt_vector = pt_update_irq(v); + if ( is_hvm_vcpu(v) ) + pt_vector = pt_update_irq(v); do { unsigned long intr_info; -- 1.7.9.5
Changes: * Enforce HAP mode for now * Disable exits related to virtual interrupts or emulated APICs * Disable changing paging mode - "unrestricted guest" (i.e., real mode for EPT) disabled - write guest EFER disabled * Start in 64-bit mode * Force TSC mode to be "none" * Paging mode update to happen in arch_set_info_guest Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14: - Mask out bits of cr4 that the guest is not allowed to set v13: - Fix up default cr0 settings - Get rid of some unnecessary PVH-related changes - Return EOPNOTSUPP instead of ENOSYS if hardware features are not present - Remove an unnecessary variable from pvh_check_requirements CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/vmx/vmcs.c | 132 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 128 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index f2a2857..ba05ebb 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -28,6 +28,7 @@ #include <asm/msr.h> #include <asm/xstate.h> #include <asm/hvm/hvm.h> +#include <asm/hvm/nestedhvm.h> #include <asm/hvm/io.h> #include <asm/hvm/support.h> #include <asm/hvm/vmx/vmx.h> @@ -841,6 +842,60 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val) virtual_vmcs_exit(vvmcs); } +static int pvh_check_requirements(struct vcpu *v) +{ + u64 required; + + /* Check for required hardware features */ + if ( !cpu_has_vmx_ept ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have EPT support\n"); + return -EOPNOTSUPP; + } + if ( !cpu_has_vmx_pat ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); + return -EOPNOTSUPP; + } + if ( !cpu_has_vmx_msr_bitmap ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); + return -EOPNOTSUPP; + } + if ( !cpu_has_vmx_secondary_exec_control ) + { + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); + return -EOPNOTSUPP; + } + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; + if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) != required ) + { + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", + required); + return -EOPNOTSUPP; + } + + /* Check for required configuration options */ + if ( !paging_mode_hap(v->domain) ) + { + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); + return -EINVAL; + } + /* + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, + * then pv_vcpu.ctrlreg must be added to the pvh struct. + */ + if ( v->domain->arch.vtsc ) + { + printk(XENLOG_G_INFO + "At present PVH only supports the default timer mode\n"); + return -EINVAL; + } + + + return 0; +} + static int construct_vmcs(struct vcpu *v) { struct domain *d = v->domain; @@ -849,6 +904,13 @@ static int construct_vmcs(struct vcpu *v) u32 vmexit_ctl = vmx_vmexit_control; u32 vmentry_ctl = vmx_vmentry_control; + if ( is_pvh_domain(d) ) + { + int rc = pvh_check_requirements(v); + if ( rc ) + return rc; + } + vmx_vmcs_enter(v); /* VMCS controls. */ @@ -887,7 +949,32 @@ static int construct_vmcs(struct vcpu *v) /* Do not enable Monitor Trap Flag unless start single step debug */ v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + if ( is_pvh_domain(d) ) + { + /* Disable virtual apics, TPR */ + v->arch.hvm_vmx.secondary_exec_control &= + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES + | SECONDARY_EXEC_APIC_REGISTER_VIRT + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; + + /* Disable wbinvd (only necessary for MMIO), + * unrestricted guest (real mode for EPT) */ + v->arch.hvm_vmx.secondary_exec_control &= + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST + | SECONDARY_EXEC_WBINVD_EXITING); + + /* Start in 64-bit mode. + * PVH 32bitfixme. */ + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ + + ASSERT(v->arch.hvm_vmx.exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS); + ASSERT(v->arch.hvm_vmx.exec_control & CPU_BASED_ACTIVATE_MSR_BITMAP); + ASSERT(!(v->arch.hvm_vmx.exec_control & CPU_BASED_RDTSC_EXITING)); + } + vmx_update_cpu_exec_control(v); + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); @@ -923,6 +1010,17 @@ static int construct_vmcs(struct vcpu *v) vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, MSR_TYPE_R | MSR_TYPE_W); if ( cpu_has_vmx_pat && paging_mode_hap(d) ) vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, MSR_TYPE_R | MSR_TYPE_W); + if ( is_pvh_domain(d) ) + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, MSR_TYPE_R | MSR_TYPE_W); + + /* + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to specify + * save/restore area to save/restore at every VM exit and entry. + * Instead, let the intercept functions save them into + * vmx_msr_state fields. See comment in vmx_restore_host_msrs(). + * See also vmx_restore_guest_msrs(). + */ } /* I/O access bitmap. */ @@ -1011,7 +1109,11 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(GUEST_DS_AR_BYTES, 0xc093); __vmwrite(GUEST_FS_AR_BYTES, 0xc093); __vmwrite(GUEST_GS_AR_BYTES, 0xc093); - __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ + if ( is_pvh_domain(d) ) + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); + else + __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ /* Guest IDT. */ __vmwrite(GUEST_IDTR_BASE, 0); @@ -1041,12 +1143,29 @@ static int construct_vmcs(struct vcpu *v) | (1U << TRAP_no_device); vmx_update_exception_bitmap(v); + /* In HVM domains, this happens on the realmode->paging + * transition. Since PVH never goes through this transition, we + * need to do it at start-of-day. */ + if ( is_pvh_domain(d) ) + vmx_update_debug_state(v); + v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; + + /* PVH domains always start in paging mode */ + if ( is_pvh_domain(d) ) + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | X86_CR0_WP; + hvm_update_guest_cr(v, 0); - v->arch.hvm_vcpu.guest_cr[4] = 0; + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? + (real_cr4_to_pv_guest_cr4(mmu_cr4_features) + & ~HVM_CR4_GUEST_RESERVED_BITS(v)) + : 0; hvm_update_guest_cr(v, 4); + if ( is_pvh_domain(d) ) + v->arch.hvm_vmx.vmx_realmode = 0; + if ( cpu_has_vmx_tpr_shadow ) { __vmwrite(VIRTUAL_APIC_PAGE_ADDR, @@ -1076,9 +1195,14 @@ static int construct_vmcs(struct vcpu *v) vmx_vmcs_exit(v); - paging_update_paging_modes(v); /* will update HOST & GUEST_CR3 as reqd */ + /* PVH: paging mode is updated by arch_set_info_guest(). */ + if ( is_hvm_vcpu(v) ) + { + /* will update HOST & GUEST_CR3 as reqd */ + paging_update_paging_modes(v); - vmx_vlapic_msr_changed(v); + vmx_vlapic_msr_changed(v); + } return 0; } -- 1.7.9.5
George Dunlap
2013-Nov-04 12:14 UTC
[PATCH v14 08/17] pvh: Do not allow PVH guests to change paging modes
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- v13: Removed unnecessary else. CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/hvm.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 72ca936..1e1bef0 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1769,6 +1769,16 @@ int hvm_set_cr0(unsigned long value) (value & (X86_CR0_PE | X86_CR0_PG)) == X86_CR0_PG ) goto gpf; + + + /* A pvh is not expected to change to real mode. */ + if ( is_pvh_vcpu(v) + && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) ) + { + printk(XENLOG_G_WARNING + "PVH attempting to turn off PE/PG. CR0:%lx\n", value); + goto gpf; + } if ( (value & X86_CR0_PG) && !(old_value & X86_CR0_PG) ) { if ( v->arch.hvm_vcpu.guest_efer & EFER_LME ) -- 1.7.9.5
Hypercalls where we now have unrestricted access: * memory_op * console_io * vcpu_op * mmuext_op We also restrict PVH domain access to HVMOP_*_param to reading and writing HVM_PARAM_CALLBACK_IRQ. Most hvm_op functions require "is_hvm_domain()" and will default to -EINVAL; exceptions are HVMOP_get_time and HVMOP_xentrace. Finally, we restrict setting IOPL permissions for a PVH domain. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14: - Get rid of (now) spurious null check for hvm_domain.params v13: - Minor code tweaks, as suggested during review - return -ENOSYS for set_iopl and set_iobitmap calls - Allow HVMOP_set_param for HVM_PARAM_CALLBACK_IRQ. We still don''t allow other values to be written. CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/hvm.c | 44 ++++++++++++++++++++++++++++++++++++-------- xen/arch/x86/hvm/mtrr.c | 1 + xen/arch/x86/physdev.c | 10 ++++++++++ xen/common/kernel.c | 13 +++++++++++-- 4 files changed, 58 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 1e1bef0..0708913 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3376,6 +3376,24 @@ static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = { HYPERCALL(tmem_op) }; +/* PVH 32bitfixme. */ +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] = { + HYPERCALL(platform_op), + HYPERCALL(memory_op), + HYPERCALL(xen_version), + HYPERCALL(console_io), + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, + HYPERCALL(vcpu_op), + HYPERCALL(mmuext_op), + HYPERCALL(xsm_op), + HYPERCALL(sched_op), + HYPERCALL(event_channel_op), + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, + HYPERCALL(hvm_op), + HYPERCALL(sysctl), + HYPERCALL(domctl) +}; + int hvm_do_hypercall(struct cpu_user_regs *regs) { struct vcpu *curr = current; @@ -3402,7 +3420,9 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) if ( (eax & 0x80000000) && is_viridian_domain(curr->domain) ) return viridian_hypercall(regs); - if ( (eax >= NR_hypercalls) || !hvm_hypercall32_table[eax] ) + if ( (eax >= NR_hypercalls) || + (is_pvh_vcpu(curr) ? !pvh_hypercall64_table[eax] + : !hvm_hypercall32_table[eax]) ) { regs->eax = -ENOSYS; return HVM_HCALL_completed; @@ -3417,16 +3437,20 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) regs->r10, regs->r8, regs->r9); curr->arch.hvm_vcpu.hcall_64bit = 1; - regs->rax = hvm_hypercall64_table[eax](regs->rdi, - regs->rsi, - regs->rdx, - regs->r10, - regs->r8, - regs->r9); + if ( is_pvh_vcpu(curr) ) + regs->rax = pvh_hypercall64_table[eax](regs->rdi, regs->rsi, + regs->rdx, regs->r10, + regs->r8, regs->r9); + else + regs->rax = hvm_hypercall64_table[eax](regs->rdi, regs->rsi, + regs->rdx, regs->r10, + regs->r8, regs->r9); curr->arch.hvm_vcpu.hcall_64bit = 0; } else { + ASSERT(!is_pvh_vcpu(curr)); /* PVH 32bitfixme. */ + HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%u(%x, %x, %x, %x, %x, %x)", eax, (uint32_t)regs->ebx, (uint32_t)regs->ecx, (uint32_t)regs->edx, (uint32_t)regs->esi, @@ -3851,7 +3875,11 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) return -ESRCH; rc = -EINVAL; - if ( !is_hvm_domain(d) ) + if ( !has_hvm_container_domain(d) ) + goto param_fail; + + if ( is_pvh_domain(d) + && (a.index != HVM_PARAM_CALLBACK_IRQ) ) goto param_fail; rc = xsm_hvm_param(XSM_TARGET, d, op); diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c index ef51a8d..9785cef 100644 --- a/xen/arch/x86/hvm/mtrr.c +++ b/xen/arch/x86/hvm/mtrr.c @@ -578,6 +578,7 @@ int32_t hvm_set_mem_pinned_cacheattr( { struct hvm_mem_pinned_cacheattr_range *range; + /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */ if ( !((type == PAT_TYPE_UNCACHABLE) || (type == PAT_TYPE_WRCOMB) || (type == PAT_TYPE_WRTHROUGH) || diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index dab6213..7d787dc 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -519,6 +519,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iopl: { struct physdev_set_iopl set_iopl; + + ret = -ENOSYS; + if ( is_pvh_vcpu(current) ) + break; + ret = -EFAULT; if ( copy_from_guest(&set_iopl, arg, 1) != 0 ) break; @@ -532,6 +537,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iobitmap: { struct physdev_set_iobitmap set_iobitmap; + + ret = -ENOSYS; + if ( is_pvh_vcpu(current) ) + break; + ret = -EFAULT; if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 ) break; diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 97d9050..cc1f743 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -306,14 +306,23 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( current->domain == dom0 ) fi.submap |= 1U << XENFEAT_dom0; #ifdef CONFIG_X86 - if ( is_pv_vcpu(current) ) + switch(d->guest_type) { + case guest_type_pv: fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) | (1U << XENFEAT_highmem_assist) | (1U << XENFEAT_gnttab_map_avail_bits); - else + break; + case guest_type_pvh: + fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | + (1U << XENFEAT_supervisor_mode_kernel) | + (1U << XENFEAT_hvm_callback_vector); + break; + case guest_type_hvm: fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | (1U << XENFEAT_hvm_callback_vector) | (1U << XENFEAT_hvm_pirqs); + break; + } #endif break; default: -- 1.7.9.5
Allow PV e820 map to be set and read from a PVH domain. This requires moving the pv e820 struct out from the pv-specific domain struct and into the arch domain struct. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/domain.c | 9 +++------ xen/arch/x86/mm.c | 26 ++++++++++++-------------- xen/include/asm-x86/domain.h | 10 +++++----- 3 files changed, 20 insertions(+), 25 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 358616c..8c2a57f 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) if ( (rc = iommu_domain_init(d)) != 0 ) goto fail; } + spin_lock_init(&d->arch.e820_lock); if ( has_hvm_container_domain(d) ) { @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) } } else - { /* 64-bit PV guest by default. */ d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; - spin_lock_init(&d->arch.pv_domain.e820_lock); - } - /* initialize default tsc behavior in case tools don''t */ tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); spin_lock_init(&d->arch.vtsc_lock); @@ -592,8 +589,8 @@ void arch_domain_destroy(struct domain *d) { if ( has_hvm_container_domain(d) ) hvm_domain_destroy(d); - else - xfree(d->arch.pv_domain.e820); + + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 9621e22..6c26026 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4759,11 +4759,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) return -EFAULT; } - spin_lock(&d->arch.pv_domain.e820_lock); - xfree(d->arch.pv_domain.e820); - d->arch.pv_domain.e820 = e820; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); + xfree(d->arch.e820); + d->arch.e820 = e820; + d->arch.nr_e820 = fmap.map.nr_entries; + spin_unlock(&d->arch.e820_lock); rcu_unlock_domain(d); return rc; @@ -4777,26 +4777,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; - spin_lock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); /* Backwards compatibility. */ - if ( (d->arch.pv_domain.nr_e820 == 0) || - (d->arch.pv_domain.e820 == NULL) ) + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -ENOSYS; } - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, - map.nr_entries) || + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || __copy_to_guest(arg, &map, 1) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -EFAULT; } - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return 0; } diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index 07e21f7..67c3d4b 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -234,11 +234,6 @@ struct pv_domain /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; - - /* Pseudophysical e820 map (XENMEM_memory_map). */ - spinlock_t e820_lock; - struct e820entry *e820; - unsigned int nr_e820; }; struct arch_domain @@ -313,6 +308,11 @@ struct arch_domain (possibly other cases in the future */ uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ uint64_t vtsc_usercount; /* not used for hvm */ + + /* Pseudophysical e820 map (XENMEM_memory_map). */ + spinlock_t e820_lock; + struct e820entry *e820; + unsigned int nr_e820; } __cacheline_aligned; #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
Allow the guest to set up a few more things when bringing up a vcpu. This includes cr3 and gs_base. Also set up wallclock, and only initialize a vcpu once. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14: - Share more of the codepath, removing a potential bug that might happen if paging functions are called with "is_initialised" set to zero. - Put cr3 in v->arch.guest_table, so the ref counting happens properly. This should fix the "zombie domains" problem. v13: - Get rid of separate pvh call, and fold gs_base write into hvm_set_info_guest - Check pvh parameters for validity at the top of arch_set_info_guest - Fix comment about PVH and set_info_guest CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/domain.c | 30 ++++++++++++++++++++++++++++-- xen/arch/x86/hvm/vmx/vmx.c | 7 ++++++- xen/include/asm-x86/hvm/hvm.h | 6 +++--- xen/include/public/arch-x86/xen.h | 11 +++++++++++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 8c2a57f..c80ef4c 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -691,6 +691,18 @@ int arch_set_info_guest( (c(ldt_ents) > 8192) ) return -EINVAL; } + else if ( is_pvh_vcpu(v) ) + { + /* PVH 32bitfixme */ + ASSERT(!compat); + + if ( c(ctrlreg[1]) || c(ldt_base) || c(ldt_ents) || + c(user_regs.cs) || c(user_regs.ss) || c(user_regs.es) || + c(user_regs.ds) || c(user_regs.fs) || c(user_regs.gs) || + c.nat->gdt_ents || c.nat->fs_base || c.nat->gs_base_user ) + return -EINVAL; + + } v->fpu_initialised = !!(flags & VGCF_I387_VALID); @@ -728,8 +740,21 @@ int arch_set_info_guest( if ( has_hvm_container_vcpu(v) ) { - hvm_set_info_guest(v); - goto out; + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); + + if ( is_hvm_vcpu(v) || v->is_initialised ) + goto out; + + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); + cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); + + v->arch.cr3 = page_to_maddr(cr3_page); + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; + v->arch.guest_table = pagetable_from_page(cr3_page); + + ASSERT(paging_mode_enabled(d)); + + goto pvh_skip_pv_stuff; } init_int80_direct_trap(v); @@ -934,6 +959,7 @@ int arch_set_info_guest( clear_bit(_VPF_in_reset, &v->pause_flags); + pvh_skip_pv_stuff: if ( v->vcpu_id == 0 ) update_domain_wallclock_time(d); diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index fdb560e..94e9e21 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1401,7 +1401,7 @@ static void vmx_set_uc_mode(struct vcpu *v) hvm_asid_flush_vcpu(v); } -static void vmx_set_info_guest(struct vcpu *v) +static void vmx_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel) { unsigned long intr_shadow; @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); } + /* PVH 32bitfixme */ + if ( is_pvh_vcpu(v) ) + __vmwrite(GUEST_GS_BASE, gs_base_kernel); + + vmx_vmcs_exit(v); } diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 3376418..d6bfcf2 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -157,7 +157,7 @@ struct hvm_function_table { int (*msr_write_intercept)(unsigned int msr, uint64_t msr_content); void (*invlpg_intercept)(unsigned long vaddr); void (*set_uc_mode)(struct vcpu *v); - void (*set_info_guest)(struct vcpu *v); + void (*set_info_guest)(struct vcpu *v, uint64_t gs_base_kernel); void (*set_rdtsc_exiting)(struct vcpu *v, bool_t); /* Nested HVM */ @@ -431,10 +431,10 @@ void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent); void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent); void hvm_unmap_guest_frame(void *p, bool_t permanent); -static inline void hvm_set_info_guest(struct vcpu *v) +static inline void hvm_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel) { if ( hvm_funcs.set_info_guest ) - return hvm_funcs.set_info_guest(v); + return hvm_funcs.set_info_guest(v, gs_base_kernel); } int hvm_debug_op(struct vcpu *v, int32_t op); diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h index 908ef87..42b818e 100644 --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -154,6 +154,17 @@ typedef uint64_t tsc_timestamp_t; /* RDTSC timestamp */ /* * The following is all CPU context. Note that the fpu_ctxt block is filled * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used. + * + * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise + * for HVM and PVH guests, not all information in this structure is updated: + * + * - For HVM guests, the structures read include: fpu_ctxt (if + * VGCT_I387_VALID is set), flags, user_regs, debugreg[*] + * + * - PVH guests are the same as HVM guests, but additionally set cr3, + * and for 64-bit guests, gs_base_kernel. Additionally, the following + * entries must be 0: ctrlreg[1], ldt_base, ldg_ents, user_regs.{cs, + * ss, es, ds, fs, gs), gdt_ents, fs_base, and gs_base_user. */ struct vcpu_guest_context { /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */ -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 12/17] pvh: Use PV handlers for cpuid, and IO
For cpuid, this means putting hooks into the vmexit handler to call it instead of the hvm one. For IO, this now means putting a hook into the emulation code to call the PV guest_io_{read,write} functions. NB at this point this won''t do the full "copy and execute on the stack with full GPRs" work-around; this may need to be sorted out for dom0 to allow these instructions to happen in guest context. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v14-hvmio: - Use HVM path to emulate IO, calling into PV just for the final IO - Don''t support forced invalid ops any more (so we can avoid PV emulation altogether) v13: - Remove unnecessary privilege check in PIO path, update related comment - Move ? and : to end of line rather than beginning, as requested CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/hvm/emulate.c | 75 ++++++++++++++++++++++++++++++++++----- xen/arch/x86/hvm/vmx/vmx.c | 3 +- xen/arch/x86/traps.c | 6 ++-- xen/include/asm-x86/processor.h | 2 ++ xen/include/asm-x86/traps.h | 8 +++++ 5 files changed, 81 insertions(+), 13 deletions(-) diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c index a41eaa1..0d767c2 100644 --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -16,14 +16,14 @@ #include <xen/paging.h> #include <xen/trace.h> #include <asm/event.h> +#include <asm/traps.h> #include <asm/xstate.h> #include <asm/hvm/emulate.h> #include <asm/hvm/hvm.h> #include <asm/hvm/trace.h> #include <asm/hvm/support.h> -static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) -{ +static void trace_io_assist(int is_mmio, int dir, int data_valid, paddr_t addr, unsigned int data) { unsigned int size, event; unsigned char buffer[12]; @@ -31,22 +31,28 @@ static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) return; if ( is_mmio ) - event = p->dir ? TRC_HVM_IOMEM_READ : TRC_HVM_IOMEM_WRITE; + event = dir ? TRC_HVM_IOMEM_READ : TRC_HVM_IOMEM_WRITE; else - event = p->dir ? TRC_HVM_IOPORT_READ : TRC_HVM_IOPORT_WRITE; + event = dir ? TRC_HVM_IOPORT_READ : TRC_HVM_IOPORT_WRITE; - *(uint64_t *)buffer = p->addr; - size = (p->addr != (u32)p->addr) ? 8 : 4; + *(uint64_t *)buffer = addr; + size = (addr != (u32)addr) ? 8 : 4; if ( size == 8 ) event |= TRC_64_FLAG; - if ( !p->data_is_ptr ) + if ( data_valid ) { - *(uint32_t *)&buffer[size] = p->data; + *(uint32_t *)&buffer[size] = data; size += 4; } trace_var(event, 0/*!cycles*/, size, buffer); + +} + +static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) +{ + trace_io_assist(is_mmio, p->dir, !p->data_is_ptr, p->addr, p->data); } static int hvmemul_do_io( @@ -140,6 +146,9 @@ static int hvmemul_do_io( } } + if ( is_pvh_vcpu(curr) ) + ASSERT(vio->io_state == HVMIO_none); + switch ( vio->io_state ) { case HVMIO_none: @@ -284,11 +293,59 @@ static int hvmemul_do_io( return X86EMUL_OKAY; } +static int pvhemul_do_pio( + unsigned long port, int size, paddr_t ram_gpa, int dir, void *p_data) +{ + paddr_t value = ram_gpa; + struct vcpu *curr = current; + struct cpu_user_regs *regs = guest_cpu_user_regs(); + + /* + * Weird-sized accesses have undefined behaviour: we discard writes + * and read all-ones. + */ + if ( unlikely((size > sizeof(long)) || (size & (size - 1))) ) + { + gdprintk(XENLOG_WARNING, "bad mmio size %d\n", size); + ASSERT(p_data != NULL); /* cannot happen with a REP prefix */ + if ( dir == IOREQ_READ ) + memset(p_data, ~0, size); + return X86EMUL_UNHANDLEABLE; + } + + if ( dir == IOREQ_WRITE ) { + if ( (p_data != NULL) ) + { + memcpy(&value, p_data, size); + p_data = NULL; + } + + if ( dir == IOREQ_WRITE ) + trace_io_assist(0, dir, 1, port, value); + + guest_io_write(port, size, value, curr, regs); + } + else + { + value = guest_io_read(port, size, curr, regs); + trace_io_assist(0, dir, 1, port, value); + if ( (p_data != NULL) ) + memcpy(p_data, &value, size); + memcpy(®s->eax, &value, size); + } + + return X86EMUL_OKAY; +} + + int hvmemul_do_pio( unsigned long port, unsigned long *reps, int size, paddr_t ram_gpa, int dir, int df, void *p_data) { - return hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data); + return is_hvm_vcpu(current) ? + hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data) : + pvhemul_do_pio(port, size, ram_gpa, dir, p_data); + } static int hvmemul_do_mmio( diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 94e9e21..5d1e367 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -56,6 +56,7 @@ #include <asm/apic.h> #include <asm/hvm/nestedhvm.h> #include <asm/event.h> +#include <asm/traps.h> enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised }; @@ -2694,8 +2695,8 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) break; } case EXIT_REASON_CPUID: + is_pvh_vcpu(v) ? pv_cpuid(regs) : vmx_do_cpuid(regs); update_guest_eip(); /* Safe: CPUID */ - vmx_do_cpuid(regs); break; case EXIT_REASON_HLT: update_guest_eip(); /* Safe: HLT */ diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index edb7a6a..6c278bc 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -729,7 +729,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx, return 1; } -static void pv_cpuid(struct cpu_user_regs *regs) +void pv_cpuid(struct cpu_user_regs *regs) { uint32_t a, b, c, d; @@ -1681,7 +1681,7 @@ static int pci_cfg_ok(struct domain *d, int write, int size) return 1; } -static uint32_t guest_io_read( +uint32_t guest_io_read( unsigned int port, unsigned int bytes, struct vcpu *v, struct cpu_user_regs *regs) { @@ -1748,7 +1748,7 @@ static uint32_t guest_io_read( return data; } -static void guest_io_write( +void guest_io_write( unsigned int port, unsigned int bytes, uint32_t data, struct vcpu *v, struct cpu_user_regs *regs) { diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h index 893afa3..551036d 100644 --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -567,6 +567,8 @@ void microcode_set_module(unsigned int); int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); int microcode_resume_cpu(int cpu); +void pv_cpuid(struct cpu_user_regs *regs); + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_X86_PROCESSOR_H */ diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h index 82cbcee..a26b318 100644 --- a/xen/include/asm-x86/traps.h +++ b/xen/include/asm-x86/traps.h @@ -49,4 +49,12 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid, extern int send_guest_trap(struct domain *d, uint16_t vcpuid, unsigned int trap_nr); +uint32_t guest_io_read( + unsigned int port, unsigned int bytes, + struct vcpu *v, struct cpu_user_regs *regs); +void guest_io_write( + unsigned int port, unsigned int bytes, uint32_t data, + struct vcpu *v, struct cpu_user_regs *regs); + + #endif /* ASM_TRAP_H */ -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 13/17] pvh: Disable 32-bit guest support for now
To be implemented. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- Acked-by: Jan Beulich <jbeulich@suse.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/domain.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index c80ef4c..cafb4b8 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -339,6 +339,14 @@ int switch_compat(struct domain *d) if ( d == NULL ) return -EINVAL; + + if ( is_pvh_domain(d) ) + { + printk(XENLOG_INFO + "Xen currently does not support 32bit PVH guests\n"); + return -EINVAL; + } + if ( !may_switch_mode(d) ) return -EACCES; if ( is_pv_32on64_domain(d) ) -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 14/17] pvh: Restrict tsc_mode to NEVER_EMULATE for now
To be implemented. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- v13: - Only print a warning if tsc_mode != TSC_MODE_DEFAULT CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- xen/arch/x86/time.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index c1bbd50..087b301 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1827,6 +1827,22 @@ void tsc_set_info(struct domain *d, d->arch.vtsc = 0; return; } + if ( is_pvh_domain(d) ) + { + /* PVH fixme: support more tsc modes. */ + switch ( tsc_mode ) + { + case TSC_MODE_NEVER_EMULATE: + break; + default: + printk(XENLOG_WARNING + "PVH currently does not support tsc emulation. Setting timer_mode = never_emulate\n"); + /* FALLTHRU */ + case TSC_MODE_DEFAULT: + tsc_mode = TSC_MODE_NEVER_EMULATE; + break; + } + } switch ( d->arch.tsc_mode = tsc_mode ) { -- 1.7.9.5
--- v14 - Update interface description - Update list of outstanding fixmes Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Jan Beulich <jan.beulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- docs/misc/pvh-readme.txt | 60 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 docs/misc/pvh-readme.txt diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt new file mode 100644 index 0000000..8913d1e --- /dev/null +++ b/docs/misc/pvh-readme.txt @@ -0,0 +1,60 @@ + +PVH : an x86 PV guest running in an HVM container. + +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ + +At the moment HAP is required for PVH. + +At present the only PVH guest is an x86 64bit PV linux. Patches are at: + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git + +A PVH guest kernel must support following features, as defined for linux +in arch/x86/xen/xen-head.S: + + #define FEATURES_PVH "|writable_descriptor_tables" \ + "|auto_translated_physmap" \ + "|supervisor_mode_kernel" \ + "|hvm_callback_vector" + +In a nutshell: +* the guest uses auto translate: + - p2m is managed by xen + - pagetables are owned by the guest + - mmu_update hypercall not available +* it uses event callback and not vlapic emulation, +* IDT is native, so set_trap_table hcall is also N/A for a PVH guest. + +For a full list of hcalls supported for PVH, see pvh_hypercall64_table +in arch/x86/hvm/hvm.c in xen. From the ABI prespective, it''s mostly a +PV guest with auto translate, although it does use hvm_op for setting +callback vector, and has a special version of arch_set_guest_info for bringing +up secondary cpus. + +The initial phase targets the booting of a 64bit UP/SMP linux guest in PVH +mode. This is done by adding: pvh=1 in the config file. xl, and not xm, is +supported. Phase I patches are broken into three parts: + - xen changes for booting of 64bit PVH guest + - tools changes for creating a PVH guest + - boot of 64bit dom0 in PVH mode. + +Following fixme''s exist in the code: + - arch/x86/time.c: support more tsc modes. + - implement arch_get_info_guest() for pvh. + +Following remain to be done for PVH: + - Investigate what else needs to be done for VMI support. + - AMD port. + - 32bit PVH guest support in both linux and xen. Xen changes are tagged + "32bitfixme". + - Add support for monitoring guest behavior. See hvm_memory_event* functions + in hvm.c + - vcpu hotplug support + - Live migration of PVH guests. + - Avail PVH dom0 of posted interrupts. (This will be a big win). + + +Note, any emails to me must be cc''d to xen devel mailing list. OTOH, please +cc me on PVH emails to the xen devel mailing list. + +Mukesh Rathor +mukesh.rathor [at] oracle [dot] com -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 16/17] PVH xen tools: libxc changes to build a PVH guest.
From: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Ian Jackson <ian.jackson@citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> --- tools/libxc/xc_dom.h | 1 + tools/libxc/xc_dom_core.c | 9 +++++ tools/libxc/xc_dom_x86.c | 90 +++++++++++++++++++++++++++++---------------- 3 files changed, 69 insertions(+), 31 deletions(-) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index 935b49e..90679da 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -127,6 +127,7 @@ struct xc_dom_image { domid_t console_domid; domid_t xenstore_domid; xen_pfn_t shared_info_mfn; + int pvh_enabled; xc_interface *xch; domid_t guest_domid; diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c index 3bf51ef..9355fe8 100644 --- a/tools/libxc/xc_dom_core.c +++ b/tools/libxc/xc_dom_core.c @@ -766,6 +766,15 @@ int xc_dom_parse_image(struct xc_dom_image *dom) goto err; } + if ( dom->pvh_enabled ) + { + const char *pvh_features = "writable_descriptor_tables|" + "auto_translated_physmap|" + "supervisor_mode_kernel|" + "hvm_callback_vector"; + elf_xen_parse_features(pvh_features, dom->f_requested, NULL); + } + /* check features */ for ( i = 0; i < XENFEAT_NR_SUBMAPS; i++ ) { diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c index 60fc544..a30d9cc 100644 --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -407,7 +407,8 @@ static int setup_pgtables_x86_64(struct xc_dom_image *dom) pgpfn = (addr - dom->parms.virt_base) >> PAGE_SHIFT_X86; l1tab[l1off] pfn_to_paddr(xc_dom_p2m_guest(dom, pgpfn)) | L1_PROT; - if ( (addr >= dom->pgtables_seg.vstart) && + if ( (!dom->pvh_enabled) && + (addr >= dom->pgtables_seg.vstart) && (addr < dom->pgtables_seg.vend) ) l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ @@ -588,6 +589,13 @@ static int vcpu_x86_32(struct xc_dom_image *dom, void *ptr) DOMPRINTF_CALLED(dom->xch); + if ( dom->pvh_enabled ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: PVH not supported for 32bit guests.", __FUNCTION__); + return -1; + } + /* clear everything */ memset(ctxt, 0, sizeof(*ctxt)); @@ -630,12 +638,6 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void *ptr) /* clear everything */ memset(ctxt, 0, sizeof(*ctxt)); - ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64; - ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64; - ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64; - ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64; - ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64; - ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64; ctxt->user_regs.rip = dom->parms.virt_entry; ctxt->user_regs.rsp dom->parms.virt_base + (dom->bootstack_pfn + 1) * PAGE_SIZE_X86; @@ -643,15 +645,25 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void *ptr) dom->parms.virt_base + (dom->start_info_pfn) * PAGE_SIZE_X86; ctxt->user_regs.rflags = 1 << 9; /* Interrupt Enable */ - ctxt->kernel_ss = ctxt->user_regs.ss; - ctxt->kernel_sp = ctxt->user_regs.esp; - ctxt->flags = VGCF_in_kernel_X86_64 | VGCF_online_X86_64; cr3_pfn = xc_dom_p2m_guest(dom, dom->pgtables_seg.pfn); ctxt->ctrlreg[3] = xen_pfn_to_cr3_x86_64(cr3_pfn); DOMPRINTF("%s: cr3: pfn 0x%" PRIpfn " mfn 0x%" PRIpfn "", __FUNCTION__, dom->pgtables_seg.pfn, cr3_pfn); + if ( dom->pvh_enabled ) + return 0; + + ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64; + ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64; + ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64; + ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64; + ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64; + ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64; + + ctxt->kernel_ss = ctxt->user_regs.ss; + ctxt->kernel_sp = ctxt->user_regs.esp; + return 0; } @@ -752,7 +764,7 @@ int arch_setup_meminit(struct xc_dom_image *dom) rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type); if ( rc ) return rc; - if ( xc_dom_feature_translated(dom) ) + if ( xc_dom_feature_translated(dom) && !dom->pvh_enabled ) { dom->shadow_enabled = 1; rc = x86_shadow(dom->xch, dom->guest_domid); @@ -828,6 +840,38 @@ int arch_setup_bootearly(struct xc_dom_image *dom) return 0; } +/* + * Map grant table frames into guest physmap. PVH manages grant during boot + * via HVM mechanisms. + */ +static int map_grant_table_frames(struct xc_dom_image *dom) +{ + int i, rc; + + if ( dom->pvh_enabled ) + return 0; + + for ( i = 0; ; i++ ) + { + rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, + XENMAPSPACE_grant_table, + i, dom->total_pages + i); + if ( rc != 0 ) + { + if ( (i > 0) && (errno == EINVAL) ) + { + DOMPRINTF("%s: %d grant tables mapped", __FUNCTION__, i); + break; + } + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: mapping grant tables failed " "(pfn=0x%" PRIpfn + ", rc=%d)", __FUNCTION__, dom->total_pages + i, rc); + return rc; + } + } + return 0; +} + int arch_setup_bootlate(struct xc_dom_image *dom) { static const struct { @@ -866,7 +910,6 @@ int arch_setup_bootlate(struct xc_dom_image *dom) else { /* paravirtualized guest with auto-translation */ - int i; /* Map shared info frame into guest physmap. */ rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, @@ -880,25 +923,10 @@ int arch_setup_bootlate(struct xc_dom_image *dom) return rc; } - /* Map grant table frames into guest physmap. */ - for ( i = 0; ; i++ ) - { - rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, - XENMAPSPACE_grant_table, - i, dom->total_pages + i); - if ( rc != 0 ) - { - if ( (i > 0) && (errno == EINVAL) ) - { - DOMPRINTF("%s: %d grant tables mapped", __FUNCTION__, i); - break; - } - xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, - "%s: mapping grant tables failed " "(pfn=0x%" - PRIpfn ", rc=%d)", __FUNCTION__, dom->total_pages + i, rc); - return rc; - } - } + rc = map_grant_table_frames(dom); + if ( rc != 0 ) + return rc; + shinfo = dom->shared_info_pfn; } -- 1.7.9.5
George Dunlap
2013-Nov-04 12:15 UTC
[PATCH v14 17/17] PVH xen tools: libxl changes to create a PVH guest.
From: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> --- v13 (gwd): - Added XEN_DOMCTL_CDF_pvh_guest flag CC: Ian Jackson <ian.jackson@citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> --- docs/man/xl.cfg.pod.5 | 3 +++ tools/libxl/libxl.h | 6 ++++++ tools/libxl/libxl_create.c | 13 +++++++++++++ tools/libxl/libxl_dom.c | 3 +++ tools/libxl/libxl_internal.h | 1 + tools/libxl/libxl_types.idl | 1 + tools/libxl/libxl_x86.c | 4 +++- tools/libxl/xl_cmdimpl.c | 1 + 8 files changed, 31 insertions(+), 1 deletion(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 278dba1..e0b5290 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -653,6 +653,9 @@ if your particular guest kernel does not require this behaviour then it is safe to allow this to be enabled but you may wish to disable it anyway. +=item B<pvh=BOOLEAN> +Selects whether to run this PV guest in an HVM container. Default is 0. + =back =head2 Fully-virtualised (HVM) Guest Specific Options diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index 1c6675d..a214d77 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -369,6 +369,12 @@ */ #define LIBXL_HAVE_DOMAIN_CREATE_RESTORE_PARAMS 1 +/* + * LIBXL_HAVE_CREATEINFO_PVH + * If this is defined, then libxl supports creation of a PVH guest. + */ +#define LIBXL_HAVE_CREATEINFO_PVH 1 + /* Functions annotated with LIBXL_EXTERNAL_CALLERS_ONLY may not be * called from within libxl itself. Callers outside libxl, who * do not #include libxl_internal.h, are fine. */ diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 1b320d3..cc25329 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -33,6 +33,9 @@ int libxl__domain_create_info_setdefault(libxl__gc *gc, if (c_info->type == LIBXL_DOMAIN_TYPE_HVM) { libxl_defbool_setdefault(&c_info->hap, true); libxl_defbool_setdefault(&c_info->oos, true); + } else { + libxl_defbool_setdefault(&c_info->pvh, false); + libxl_defbool_setdefault(&c_info->hap, libxl_defbool_val(c_info->pvh)); } libxl_defbool_setdefault(&c_info->run_hotplug_scripts, true); @@ -352,6 +355,8 @@ int libxl__domain_build(libxl__gc *gc, break; case LIBXL_DOMAIN_TYPE_PV: + state->pvh_enabled = libxl_defbool_val(d_config->c_info.pvh); + ret = libxl__build_pv(gc, domid, info, state); if (ret) goto out; @@ -411,6 +416,14 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_create_info *info, flags |= XEN_DOMCTL_CDF_hvm_guest; flags |= libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0; flags |= libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off; + } else if ( libxl_defbool_val(info->pvh) ) { + flags |= XEN_DOMCTL_CDF_pvh_guest; + if ( !libxl_defbool_val(info->hap) ) { + LOG(ERROR, "HAP must be on for PVH"); + rc = ERROR_INVAL; + goto out; + } + flags |= XEN_DOMCTL_CDF_hap; } *domid = -1; diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 1812bdc..301d29b 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -348,7 +348,10 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, return ERROR_FAIL; } + dom->pvh_enabled = state->pvh_enabled; + LOG(DEBUG, "pv kernel mapped %d path %s", state->pv_kernel.mapped, state->pv_kernel.path); + if (state->pv_kernel.mapped) { ret = xc_dom_kernel_mem(dom, state->pv_kernel.data, diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 4f92522..79bc9c7 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -886,6 +886,7 @@ typedef struct { libxl__file_reference pv_kernel; libxl__file_reference pv_ramdisk; const char * pv_cmdline; + bool pvh_enabled; } libxl__domain_build_state; _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid, diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 5c43d6f..a189db3 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -257,6 +257,7 @@ libxl_domain_create_info = Struct("domain_create_info",[ ("platformdata", libxl_key_value_list), ("poolid", uint32), ("run_hotplug_scripts",libxl_defbool), + ("pvh", libxl_defbool), ], dir=DIR_IN) libxl_domain_restore_params = Struct("domain_restore_params", [ diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index a78c91d..87a8110 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -290,7 +290,9 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, if (rtc_timeoffset) xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset); - if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM) { + if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM || + libxl_defbool_val(d_config->c_info.pvh)) { + unsigned long shadow; shadow = (d_config->b_info.shadow_memkb + 1023) / 1024; xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 40feb7d..ba2b97f 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -642,6 +642,7 @@ static void parse_config_data(const char *config_source, !strncmp(buf, "hvm", strlen(buf))) c_info->type = LIBXL_DOMAIN_TYPE_HVM; + xlu_cfg_get_defbool(config, "pvh", &c_info->pvh, 0); xlu_cfg_get_defbool(config, "hap", &c_info->hap, 0); if (xlu_cfg_replace_string (config, "name", &c_info->name, 0)) { -- 1.7.9.5
Jan Beulich
2013-Nov-04 16:01 UTC
Re: [PATCH v14 01/17] Allow vmx_update_debug_state to be called when v!=current
>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Removing the assert allows the PVH code to call this during vmcs > construction in a later patch, making the code more robust by removing > duplicate code. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>While the patch looks okay to me this as well as at least a few other ones I quickly peeked into need to be ack-ed by VMX maintainers, yet (despite having pointed this out before) you didn''t even Cc them. Jan
>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -522,27 +522,27 @@ int hvm_domain_initialise(struct domain *d) > spin_lock_init(&d->arch.hvm_domain.irq_lock); > spin_lock_init(&d->arch.hvm_domain.uc_lock); > > - INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list); > - spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock);While I can see the need for moving stuff so that it gets done earlier - why do these two lines need to be moved _down_? Even if PVH wasn''t using the MSI-X support code HVM needs, I can''t see them doing any harm. Jan
George Dunlap
2013-Nov-04 16:18 UTC
Re: [PATCH v14 01/17] Allow vmx_update_debug_state to be called when v!=current
CC''ing the vmx maintainers... On 04/11/13 12:14, George Dunlap wrote:> Removing the assert allows the PVH code to call this during vmcs > construction in a later patch, making the code more robust by removing > duplicate code. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > ---- > v13: Add vmx_vmcs_{enter,exit} > > CC: Mukesh Rathor <mukesh.rathor@oracle.com> > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/hvm/vmx/vmx.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index 9ca8632..fdb560e 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -1051,8 +1051,6 @@ void vmx_update_debug_state(struct vcpu *v) > { > unsigned long mask; > > - ASSERT(v == current); > - > mask = 1u << TRAP_int3; > if ( !cpu_has_monitor_trap_flag ) > mask |= 1u << TRAP_debug; > @@ -1061,7 +1059,10 @@ void vmx_update_debug_state(struct vcpu *v) > v->arch.hvm_vmx.exception_bitmap |= mask; > else > v->arch.hvm_vmx.exception_bitmap &= ~mask; > + > + vmx_vmcs_enter(v); > vmx_update_exception_bitmap(v); > + vmx_vmcs_exit(v); > } > > static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
CC''ing the VMX maintainers.. -George On 04/11/13 12:14, George Dunlap wrote:> Changes: > * Enforce HAP mode for now > * Disable exits related to virtual interrupts or emulated APICs > * Disable changing paging mode > - "unrestricted guest" (i.e., real mode for EPT) disabled > - write guest EFER disabled > * Start in 64-bit mode > * Force TSC mode to be "none" > * Paging mode update to happen in arch_set_info_guest > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > v14: > - Mask out bits of cr4 that the guest is not allowed to set > v13: > - Fix up default cr0 settings > - Get rid of some unnecessary PVH-related changes > - Return EOPNOTSUPP instead of ENOSYS if hardware features are not present > - Remove an unnecessary variable from pvh_check_requirements > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/hvm/vmx/vmcs.c | 132 +++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 128 insertions(+), 4 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > index f2a2857..ba05ebb 100644 > --- a/xen/arch/x86/hvm/vmx/vmcs.c > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > @@ -28,6 +28,7 @@ > #include <asm/msr.h> > #include <asm/xstate.h> > #include <asm/hvm/hvm.h> > +#include <asm/hvm/nestedhvm.h> > #include <asm/hvm/io.h> > #include <asm/hvm/support.h> > #include <asm/hvm/vmx/vmx.h> > @@ -841,6 +842,60 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val) > virtual_vmcs_exit(vvmcs); > } > > +static int pvh_check_requirements(struct vcpu *v) > +{ > + u64 required; > + > + /* Check for required hardware features */ > + if ( !cpu_has_vmx_ept ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have EPT support\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_pat ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_msr_bitmap ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_secondary_exec_control ) > + { > + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); > + return -EOPNOTSUPP; > + } > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; > + if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) != required ) > + { > + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", > + required); > + return -EOPNOTSUPP; > + } > + > + /* Check for required configuration options */ > + if ( !paging_mode_hap(v->domain) ) > + { > + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); > + return -EINVAL; > + } > + /* > + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, > + * then pv_vcpu.ctrlreg must be added to the pvh struct. > + */ > + if ( v->domain->arch.vtsc ) > + { > + printk(XENLOG_G_INFO > + "At present PVH only supports the default timer mode\n"); > + return -EINVAL; > + } > + > + > + return 0; > +} > + > static int construct_vmcs(struct vcpu *v) > { > struct domain *d = v->domain; > @@ -849,6 +904,13 @@ static int construct_vmcs(struct vcpu *v) > u32 vmexit_ctl = vmx_vmexit_control; > u32 vmentry_ctl = vmx_vmentry_control; > > + if ( is_pvh_domain(d) ) > + { > + int rc = pvh_check_requirements(v); > + if ( rc ) > + return rc; > + } > + > vmx_vmcs_enter(v); > > /* VMCS controls. */ > @@ -887,7 +949,32 @@ static int construct_vmcs(struct vcpu *v) > /* Do not enable Monitor Trap Flag unless start single step debug */ > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > > + if ( is_pvh_domain(d) ) > + { > + /* Disable virtual apics, TPR */ > + v->arch.hvm_vmx.secondary_exec_control &> + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES > + | SECONDARY_EXEC_APIC_REGISTER_VIRT > + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; > + > + /* Disable wbinvd (only necessary for MMIO), > + * unrestricted guest (real mode for EPT) */ > + v->arch.hvm_vmx.secondary_exec_control &> + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST > + | SECONDARY_EXEC_WBINVD_EXITING); > + > + /* Start in 64-bit mode. > + * PVH 32bitfixme. */ > + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ > + > + ASSERT(v->arch.hvm_vmx.exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS); > + ASSERT(v->arch.hvm_vmx.exec_control & CPU_BASED_ACTIVATE_MSR_BITMAP); > + ASSERT(!(v->arch.hvm_vmx.exec_control & CPU_BASED_RDTSC_EXITING)); > + } > + > vmx_update_cpu_exec_control(v); > + > __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); > __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); > > @@ -923,6 +1010,17 @@ static int construct_vmcs(struct vcpu *v) > vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, MSR_TYPE_R | MSR_TYPE_W); > if ( cpu_has_vmx_pat && paging_mode_hap(d) ) > vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, MSR_TYPE_R | MSR_TYPE_W); > + if ( is_pvh_domain(d) ) > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, MSR_TYPE_R | MSR_TYPE_W); > + > + /* > + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, > + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to specify > + * save/restore area to save/restore at every VM exit and entry. > + * Instead, let the intercept functions save them into > + * vmx_msr_state fields. See comment in vmx_restore_host_msrs(). > + * See also vmx_restore_guest_msrs(). > + */ > } > > /* I/O access bitmap. */ > @@ -1011,7 +1109,11 @@ static int construct_vmcs(struct vcpu *v) > __vmwrite(GUEST_DS_AR_BYTES, 0xc093); > __vmwrite(GUEST_FS_AR_BYTES, 0xc093); > __vmwrite(GUEST_GS_AR_BYTES, 0xc093); > - __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ > + if ( is_pvh_domain(d) ) > + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); > + else > + __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ > > /* Guest IDT. */ > __vmwrite(GUEST_IDTR_BASE, 0); > @@ -1041,12 +1143,29 @@ static int construct_vmcs(struct vcpu *v) > | (1U << TRAP_no_device); > vmx_update_exception_bitmap(v); > > + /* In HVM domains, this happens on the realmode->paging > + * transition. Since PVH never goes through this transition, we > + * need to do it at start-of-day. */ > + if ( is_pvh_domain(d) ) > + vmx_update_debug_state(v); > + > v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; > + > + /* PVH domains always start in paging mode */ > + if ( is_pvh_domain(d) ) > + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | X86_CR0_WP; > + > hvm_update_guest_cr(v, 0); > > - v->arch.hvm_vcpu.guest_cr[4] = 0; > + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? > + (real_cr4_to_pv_guest_cr4(mmu_cr4_features) > + & ~HVM_CR4_GUEST_RESERVED_BITS(v)) > + : 0; > hvm_update_guest_cr(v, 4); > > + if ( is_pvh_domain(d) ) > + v->arch.hvm_vmx.vmx_realmode = 0; > + > if ( cpu_has_vmx_tpr_shadow ) > { > __vmwrite(VIRTUAL_APIC_PAGE_ADDR, > @@ -1076,9 +1195,14 @@ static int construct_vmcs(struct vcpu *v) > > vmx_vmcs_exit(v); > > - paging_update_paging_modes(v); /* will update HOST & GUEST_CR3 as reqd */ > + /* PVH: paging mode is updated by arch_set_info_guest(). */ > + if ( is_hvm_vcpu(v) ) > + { > + /* will update HOST & GUEST_CR3 as reqd */ > + paging_update_paging_modes(v); > > - vmx_vlapic_msr_changed(v); > + vmx_vlapic_msr_changed(v); > + } > > return 0; > }
George Dunlap
2013-Nov-04 16:20 UTC
Re: [PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
On 04/11/13 12:15, George Dunlap wrote:> Allow the guest to set up a few more things when bringing up a vcpu. > > This includes cr3 and gs_base. > > Also set up wallclock, and only initialize a vcpu once. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > v14: > - Share more of the codepath, removing a potential bug that might happen > if paging functions are called with "is_initialised" set to zero. > - Put cr3 in v->arch.guest_table, so the ref counting happens properly. > This should fix the "zombie domains" problem. > v13: > - Get rid of separate pvh call, and fold gs_base write into hvm_set_info_guest > - Check pvh parameters for validity at the top of arch_set_info_guest > - Fix comment about PVH and set_info_guest > > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/domain.c | 30 ++++++++++++++++++++++++++++-- > xen/arch/x86/hvm/vmx/vmx.c | 7 ++++++- > xen/include/asm-x86/hvm/hvm.h | 6 +++--- > xen/include/public/arch-x86/xen.h | 11 +++++++++++ > 4 files changed, 48 insertions(+), 6 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 8c2a57f..c80ef4c 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -691,6 +691,18 @@ int arch_set_info_guest( > (c(ldt_ents) > 8192) ) > return -EINVAL; > } > + else if ( is_pvh_vcpu(v) ) > + { > + /* PVH 32bitfixme */ > + ASSERT(!compat); > + > + if ( c(ctrlreg[1]) || c(ldt_base) || c(ldt_ents) || > + c(user_regs.cs) || c(user_regs.ss) || c(user_regs.es) || > + c(user_regs.ds) || c(user_regs.fs) || c(user_regs.gs) || > + c.nat->gdt_ents || c.nat->fs_base || c.nat->gs_base_user ) > + return -EINVAL; > + > + } > > v->fpu_initialised = !!(flags & VGCF_I387_VALID); > > @@ -728,8 +740,21 @@ int arch_set_info_guest( > > if ( has_hvm_container_vcpu(v) ) > { > - hvm_set_info_guest(v); > - goto out; > + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); > + > + if ( is_hvm_vcpu(v) || v->is_initialised ) > + goto out; > + > + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); > + cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); > + > + v->arch.cr3 = page_to_maddr(cr3_page); > + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; > + v->arch.guest_table = pagetable_from_page(cr3_page); > + > + ASSERT(paging_mode_enabled(d)); > + > + goto pvh_skip_pv_stuff; > } > > init_int80_direct_trap(v); > @@ -934,6 +959,7 @@ int arch_set_info_guest( > > clear_bit(_VPF_in_reset, &v->pause_flags); > > + pvh_skip_pv_stuff: > if ( v->vcpu_id == 0 ) > update_domain_wallclock_time(d); > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index fdb560e..94e9e21 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -1401,7 +1401,7 @@ static void vmx_set_uc_mode(struct vcpu *v) > hvm_asid_flush_vcpu(v); > } > > -static void vmx_set_info_guest(struct vcpu *v) > +static void vmx_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel) > { > unsigned long intr_shadow; > > @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) > __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); > } > > + /* PVH 32bitfixme */ > + if ( is_pvh_vcpu(v) ) > + __vmwrite(GUEST_GS_BASE, gs_base_kernel); > + > + > vmx_vmcs_exit(v); > } > > diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h > index 3376418..d6bfcf2 100644 > --- a/xen/include/asm-x86/hvm/hvm.h > +++ b/xen/include/asm-x86/hvm/hvm.h > @@ -157,7 +157,7 @@ struct hvm_function_table { > int (*msr_write_intercept)(unsigned int msr, uint64_t msr_content); > void (*invlpg_intercept)(unsigned long vaddr); > void (*set_uc_mode)(struct vcpu *v); > - void (*set_info_guest)(struct vcpu *v); > + void (*set_info_guest)(struct vcpu *v, uint64_t gs_base_kernel); > void (*set_rdtsc_exiting)(struct vcpu *v, bool_t); > > /* Nested HVM */ > @@ -431,10 +431,10 @@ void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent); > void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent); > void hvm_unmap_guest_frame(void *p, bool_t permanent); > > -static inline void hvm_set_info_guest(struct vcpu *v) > +static inline void hvm_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel) > { > if ( hvm_funcs.set_info_guest ) > - return hvm_funcs.set_info_guest(v); > + return hvm_funcs.set_info_guest(v, gs_base_kernel); > } > > int hvm_debug_op(struct vcpu *v, int32_t op); > diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h > index 908ef87..42b818e 100644 > --- a/xen/include/public/arch-x86/xen.h > +++ b/xen/include/public/arch-x86/xen.h > @@ -154,6 +154,17 @@ typedef uint64_t tsc_timestamp_t; /* RDTSC timestamp */ > /* > * The following is all CPU context. Note that the fpu_ctxt block is filled > * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used. > + * > + * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise > + * for HVM and PVH guests, not all information in this structure is updated: > + * > + * - For HVM guests, the structures read include: fpu_ctxt (if > + * VGCT_I387_VALID is set), flags, user_regs, debugreg[*] > + * > + * - PVH guests are the same as HVM guests, but additionally set cr3, > + * and for 64-bit guests, gs_base_kernel. Additionally, the following > + * entries must be 0: ctrlreg[1], ldt_base, ldg_ents, user_regs.{cs, > + * ss, es, ds, fs, gs), gdt_ents, fs_base, and gs_base_user. > */ > struct vcpu_guest_context { > /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
Jan Beulich
2013-Nov-04 16:20 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > The goal of this patch is to classify conditionals more clearly, as to > whether they relate to pv guests, hvm-only guests, or guests with an > "hvm container" (which will eventually include PVH). > > This patch introduces an enum for guest type, as well as two new macros > for switching behavior on and off: is_pv_* and has_hvm_container_*. At the > moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that > it seems to me different to take a path because something does *not* have PV > structures as to take a path because it *does* have HVM structures, even if > the > two happen to coincide 100% at the moment. The exact usage is occasionally > a bit > fuzzy though, and a judgement call just needs to be made on which is > clearer. > > In general, a switch should use is_pv_* (or !is_pv_*) if the code in > question > relates directly to a PV guest. Examples include use of pv_vcpu structs or > other behavior directly related to PV domains. > > hvm_container is more of a fuzzy concept, but in general:So sadly this is still being retained, despite its redundancy. Jan
George Dunlap
2013-Nov-04 16:20 UTC
Re: [PATCH v14 12/17] pvh: Use PV handlers for cpuid, and IO
On 04/11/13 12:15, George Dunlap wrote:> For cpuid, this means putting hooks into the vmexit handler to call it instead > of the hvm one. > > For IO, this now means putting a hook into the emulation code to call the PV > guest_io_{read,write} functions. > > NB at this point this won''t do the full "copy and execute on the stack > with full GPRs" work-around; this may need to be sorted out for dom0 to allow > these instructions to happen in guest context. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > v14-hvmio: > - Use HVM path to emulate IO, calling into PV just for the final IO > - Don''t support forced invalid ops any more (so we can avoid PV emulation altogether) > v13: > - Remove unnecessary privilege check in PIO path, update related comment > - Move ? and : to end of line rather than beginning, as requested > > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/hvm/emulate.c | 75 ++++++++++++++++++++++++++++++++++----- > xen/arch/x86/hvm/vmx/vmx.c | 3 +- > xen/arch/x86/traps.c | 6 ++-- > xen/include/asm-x86/processor.h | 2 ++ > xen/include/asm-x86/traps.h | 8 +++++ > 5 files changed, 81 insertions(+), 13 deletions(-) > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c > index a41eaa1..0d767c2 100644 > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -16,14 +16,14 @@ > #include <xen/paging.h> > #include <xen/trace.h> > #include <asm/event.h> > +#include <asm/traps.h> > #include <asm/xstate.h> > #include <asm/hvm/emulate.h> > #include <asm/hvm/hvm.h> > #include <asm/hvm/trace.h> > #include <asm/hvm/support.h> > > -static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) > -{ > +static void trace_io_assist(int is_mmio, int dir, int data_valid, paddr_t addr, unsigned int data) { > unsigned int size, event; > unsigned char buffer[12]; > > @@ -31,22 +31,28 @@ static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) > return; > > if ( is_mmio ) > - event = p->dir ? TRC_HVM_IOMEM_READ : TRC_HVM_IOMEM_WRITE; > + event = dir ? TRC_HVM_IOMEM_READ : TRC_HVM_IOMEM_WRITE; > else > - event = p->dir ? TRC_HVM_IOPORT_READ : TRC_HVM_IOPORT_WRITE; > + event = dir ? TRC_HVM_IOPORT_READ : TRC_HVM_IOPORT_WRITE; > > - *(uint64_t *)buffer = p->addr; > - size = (p->addr != (u32)p->addr) ? 8 : 4; > + *(uint64_t *)buffer = addr; > + size = (addr != (u32)addr) ? 8 : 4; > if ( size == 8 ) > event |= TRC_64_FLAG; > > - if ( !p->data_is_ptr ) > + if ( data_valid ) > { > - *(uint32_t *)&buffer[size] = p->data; > + *(uint32_t *)&buffer[size] = data; > size += 4; > } > > trace_var(event, 0/*!cycles*/, size, buffer); > + > +} > + > +static void hvmtrace_io_assist(int is_mmio, ioreq_t *p) > +{ > + trace_io_assist(is_mmio, p->dir, !p->data_is_ptr, p->addr, p->data); > } > > static int hvmemul_do_io( > @@ -140,6 +146,9 @@ static int hvmemul_do_io( > } > } > > + if ( is_pvh_vcpu(curr) ) > + ASSERT(vio->io_state == HVMIO_none); > + > switch ( vio->io_state ) > { > case HVMIO_none: > @@ -284,11 +293,59 @@ static int hvmemul_do_io( > return X86EMUL_OKAY; > } > > +static int pvhemul_do_pio( > + unsigned long port, int size, paddr_t ram_gpa, int dir, void *p_data) > +{ > + paddr_t value = ram_gpa; > + struct vcpu *curr = current; > + struct cpu_user_regs *regs = guest_cpu_user_regs(); > + > + /* > + * Weird-sized accesses have undefined behaviour: we discard writes > + * and read all-ones. > + */ > + if ( unlikely((size > sizeof(long)) || (size & (size - 1))) ) > + { > + gdprintk(XENLOG_WARNING, "bad mmio size %d\n", size); > + ASSERT(p_data != NULL); /* cannot happen with a REP prefix */ > + if ( dir == IOREQ_READ ) > + memset(p_data, ~0, size); > + return X86EMUL_UNHANDLEABLE; > + } > + > + if ( dir == IOREQ_WRITE ) { > + if ( (p_data != NULL) ) > + { > + memcpy(&value, p_data, size); > + p_data = NULL; > + } > + > + if ( dir == IOREQ_WRITE ) > + trace_io_assist(0, dir, 1, port, value); > + > + guest_io_write(port, size, value, curr, regs); > + } > + else > + { > + value = guest_io_read(port, size, curr, regs); > + trace_io_assist(0, dir, 1, port, value); > + if ( (p_data != NULL) ) > + memcpy(p_data, &value, size); > + memcpy(®s->eax, &value, size); > + } > + > + return X86EMUL_OKAY; > +} > + > + > int hvmemul_do_pio( > unsigned long port, unsigned long *reps, int size, > paddr_t ram_gpa, int dir, int df, void *p_data) > { > - return hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data); > + return is_hvm_vcpu(current) ? > + hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data) : > + pvhemul_do_pio(port, size, ram_gpa, dir, p_data); > + > } > > static int hvmemul_do_mmio( > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index 94e9e21..5d1e367 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -56,6 +56,7 @@ > #include <asm/apic.h> > #include <asm/hvm/nestedhvm.h> > #include <asm/event.h> > +#include <asm/traps.h> > > enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised }; > > @@ -2694,8 +2695,8 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) > break; > } > case EXIT_REASON_CPUID: > + is_pvh_vcpu(v) ? pv_cpuid(regs) : vmx_do_cpuid(regs); > update_guest_eip(); /* Safe: CPUID */ > - vmx_do_cpuid(regs); > break; > case EXIT_REASON_HLT: > update_guest_eip(); /* Safe: HLT */ > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index edb7a6a..6c278bc 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -729,7 +729,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx, > return 1; > } > > -static void pv_cpuid(struct cpu_user_regs *regs) > +void pv_cpuid(struct cpu_user_regs *regs) > { > uint32_t a, b, c, d; > > @@ -1681,7 +1681,7 @@ static int pci_cfg_ok(struct domain *d, int write, int size) > return 1; > } > > -static uint32_t guest_io_read( > +uint32_t guest_io_read( > unsigned int port, unsigned int bytes, > struct vcpu *v, struct cpu_user_regs *regs) > { > @@ -1748,7 +1748,7 @@ static uint32_t guest_io_read( > return data; > } > > -static void guest_io_write( > +void guest_io_write( > unsigned int port, unsigned int bytes, uint32_t data, > struct vcpu *v, struct cpu_user_regs *regs) > { > diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h > index 893afa3..551036d 100644 > --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -567,6 +567,8 @@ void microcode_set_module(unsigned int); > int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); > int microcode_resume_cpu(int cpu); > > +void pv_cpuid(struct cpu_user_regs *regs); > + > #endif /* !__ASSEMBLY__ */ > > #endif /* __ASM_X86_PROCESSOR_H */ > diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h > index 82cbcee..a26b318 100644 > --- a/xen/include/asm-x86/traps.h > +++ b/xen/include/asm-x86/traps.h > @@ -49,4 +49,12 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid, > extern int send_guest_trap(struct domain *d, uint16_t vcpuid, > unsigned int trap_nr); > > +uint32_t guest_io_read( > + unsigned int port, unsigned int bytes, > + struct vcpu *v, struct cpu_user_regs *regs); > +void guest_io_write( > + unsigned int port, unsigned int bytes, uint32_t data, > + struct vcpu *v, struct cpu_user_regs *regs); > + > + > #endif /* ASM_TRAP_H */
George Dunlap
2013-Nov-04 16:21 UTC
Re: [PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
On 04/11/13 12:14, George Dunlap wrote:> Things kept: > * cacheattr_region lists > * irq-related structures > * paging > * tm_list > * hvm params > > Things disabled for now: > * compat xlation > > Things disabled: > * Emulated timers and clock sources > * IO/MMIO io requests > * msix tables > * hvm_funcs > * nested HVM > * Fast-path for emulated lapic accesses > > Getting rid of the hvm_params struct required a couple other places to > check for its existence before attempting to read the params. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > v14: > - Also free the params struct for pvh domains, since we''ve allocated it > - Fail io for pvh VMs further down the stack, as we will be using the emulation > code before calling into the pv pio handlers > v13: > - Removed unnecessary comment > - Allocate params for pvh domains; remove null checks necessary in last patch > - Add ASSERT(!is_pvh) to handle_pio > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/hvm/emulate.c | 11 +++++++++- > xen/arch/x86/hvm/hvm.c | 50 +++++++++++++++++++++++++++++++++++++------ > xen/arch/x86/hvm/irq.c | 3 +++ > xen/arch/x86/hvm/vmx/intr.c | 3 ++- > 4 files changed, 58 insertions(+), 9 deletions(-) > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c > index f39c173..a41eaa1 100644 > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -57,12 +57,21 @@ static int hvmemul_do_io( > int value_is_ptr = (p_data == NULL); > struct vcpu *curr = current; > struct hvm_vcpu_io *vio; > - ioreq_t *p = get_ioreq(curr); > + ioreq_t *p; > unsigned long ram_gfn = paddr_to_pfn(ram_gpa); > p2m_type_t p2mt; > struct page_info *ram_page; > int rc; > > + /* PVH doesn''t have an ioreq infrastructure */ > + if ( is_pvh_vcpu(curr) ) > + { > + gdprintk(XENLOG_WARNING, "Unexpected io from PVH guest\n"); > + return X86EMUL_UNHANDLEABLE; > + } > + > + p = get_ioreq(curr); > + > /* Check for paged out page */ > ram_page = get_page_from_gfn(curr->domain, ram_gfn, &p2mt, P2M_UNSHARE); > if ( p2m_is_paging(p2mt) ) > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 87a6f42..72ca936 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -301,6 +301,10 @@ u64 hvm_get_guest_tsc_adjust(struct vcpu *v) > > void hvm_migrate_timers(struct vcpu *v) > { > + /* PVH doesn''t use rtc and emulated timers, it uses pvclock mechanism. */ > + if ( is_pvh_vcpu(v) ) > + return; > + > rtc_migrate_timers(v); > pt_migrate(v); > } > @@ -342,10 +346,13 @@ void hvm_do_resume(struct vcpu *v) > { > ioreq_t *p; > > - pt_restore_timer(v); > - > check_wakeup_from_wait(); > > + if ( is_pvh_vcpu(v) ) > + goto check_inject_trap; > + > + pt_restore_timer(v); > + > /* NB. Optimised for common case (p->state == STATE_IOREQ_NONE). */ > p = get_ioreq(v); > while ( p->state != STATE_IOREQ_NONE ) > @@ -368,6 +375,7 @@ void hvm_do_resume(struct vcpu *v) > } > } > > + check_inject_trap: > /* Inject pending hw/sw trap */ > if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ) > { > @@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d) > if ( rc != 0 ) > goto fail0; > > + rc = -ENOMEM; > d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); > + if ( !d->arch.hvm_domain.params ) > + goto fail1; > + > + if ( is_pvh_domain(d) ) > + return 0; > + > d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler); > - rc = -ENOMEM; > - if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler ) > + if ( !d->arch.hvm_domain.io_handler ) > goto fail1; > d->arch.hvm_domain.io_handler->num_slot = 0; > > @@ -578,6 +592,11 @@ int hvm_domain_initialise(struct domain *d) > > void hvm_domain_relinquish_resources(struct domain *d) > { > + xfree(d->arch.hvm_domain.params); > + > + if ( is_pvh_domain(d) ) > + return; > + > if ( hvm_funcs.nhvm_domain_relinquish_resources ) > hvm_funcs.nhvm_domain_relinquish_resources(d); > > @@ -596,12 +615,15 @@ void hvm_domain_relinquish_resources(struct domain *d) > } > > xfree(d->arch.hvm_domain.io_handler); > - xfree(d->arch.hvm_domain.params); > } > > void hvm_domain_destroy(struct domain *d) > { > hvm_destroy_cacheattr_region_list(d); > + > + if ( is_pvh_domain(d) ) > + return; > + > hvm_funcs.domain_destroy(d); > rtc_deinit(d); > stdvga_deinit(d); > @@ -1103,7 +1125,9 @@ int hvm_vcpu_initialise(struct vcpu *v) > goto fail1; > > /* NB: vlapic_init must be called before hvm_funcs.vcpu_initialise */ > - if ( (rc = vlapic_init(v)) != 0 ) /* teardown: vlapic_destroy */ > + if ( is_hvm_vcpu(v) ) > + rc = vlapic_init(v); > + if ( rc != 0 ) /* teardown: vlapic_destroy */ > goto fail2; > > if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */ > @@ -1118,6 +1142,14 @@ int hvm_vcpu_initialise(struct vcpu *v) > > v->arch.hvm_vcpu.inject_trap.vector = -1; > > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ > + /* This for hvm_long_mode_enabled(v). */ > + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; > + return 0; > + } > + > rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ > if ( rc != 0 ) > goto fail4; > @@ -1189,7 +1221,10 @@ void hvm_vcpu_destroy(struct vcpu *v) > > tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet); > hvm_vcpu_cacheattr_destroy(v); > - vlapic_destroy(v); > + > + if ( is_hvm_vcpu(v) ) > + vlapic_destroy(v); > + > hvm_funcs.vcpu_destroy(v); > > /* Event channel is already freed by evtchn_destroy(). */ > @@ -1390,6 +1425,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, > /* For the benefit of 32-bit WinXP (& older Windows) on AMD CPUs, > * a fast path for LAPIC accesses, skipping the p2m lookup. */ > if ( !nestedhvm_vcpu_in_guestmode(v) > + && is_hvm_vcpu(v) > && gfn == PFN_DOWN(vlapic_base_address(vcpu_vlapic(v))) ) > { > if ( !handle_mmio() ) > diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c > index 6a6fb68..677fbcd 100644 > --- a/xen/arch/x86/hvm/irq.c > +++ b/xen/arch/x86/hvm/irq.c > @@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v) > && vcpu_info(v, evtchn_upcall_pending) ) > return hvm_intack_vector(plat->irq.callback_via.vector); > > + if ( is_pvh_vcpu(v) ) > + return hvm_intack_none; > + > if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output ) > return hvm_intack_pic(0); > > diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c > index 1942e31..7757910 100644 > --- a/xen/arch/x86/hvm/vmx/intr.c > +++ b/xen/arch/x86/hvm/vmx/intr.c > @@ -236,7 +236,8 @@ void vmx_intr_assist(void) > } > > /* Crank the handle on interrupt state. */ > - pt_vector = pt_update_irq(v); > + if ( is_hvm_vcpu(v) ) > + pt_vector = pt_update_irq(v); > > do { > unsigned long intr_info;
George Dunlap
2013-Nov-04 16:26 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
On 04/11/13 16:20, Jan Beulich wrote:>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> The goal of this patch is to classify conditionals more clearly, as to >> whether they relate to pv guests, hvm-only guests, or guests with an >> "hvm container" (which will eventually include PVH). >> >> This patch introduces an enum for guest type, as well as two new macros >> for switching behavior on and off: is_pv_* and has_hvm_container_*. At the >> moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that >> it seems to me different to take a path because something does *not* have PV >> structures as to take a path because it *does* have HVM structures, even if >> the >> two happen to coincide 100% at the moment. The exact usage is occasionally >> a bit >> fuzzy though, and a judgement call just needs to be made on which is >> clearer. >> >> In general, a switch should use is_pv_* (or !is_pv_*) if the code in >> question >> relates directly to a PV guest. Examples include use of pv_vcpu structs or >> other behavior directly related to PV domains. >> >> hvm_container is more of a fuzzy concept, but in general: > So sadly this is still being retained, despite its redundancy.I understood our discussion at XenSummit to be that I should send the series as-is, once I had fixed the outstanding bugs. I''m inclined to consider Tim''s suggestion (as I understand it), that we get rid of the separate PVH mode, but instead have a number of features which can be enabled and disabled (apic, qemu, &c). But that shouldn''t affect the interface to Linux. -George
Jan Beulich
2013-Nov-04 16:37 UTC
Re: [PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Things kept: > * cacheattr_region lists > * irq-related structures > * paging > * tm_list > * hvm params > > Things disabled for now: > * compat xlation > > Things disabled: > * Emulated timers and clock sources > * IO/MMIO io requests > * msix tables > * hvm_funcs > * nested HVM > * Fast-path for emulated lapic accesses > > Getting rid of the hvm_params struct required a couple other places to > check for its existence before attempting to read the params. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>Reviewed-by: Jan Beulich <jbeulich@suse.com> with a minor comment:> @@ -1118,6 +1142,14 @@ int hvm_vcpu_initialise(struct vcpu *v) > > v->arch.hvm_vcpu.inject_trap.vector = -1; > > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ > + /* This for hvm_long_mode_enabled(v). */I think I read (earlier today) in a reply on your previous patch series that you were going to adjust this bogus comment...> + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; > + return 0; > + } > + > rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ > if ( rc != 0 ) > goto fail4;Jan
George Dunlap
2013-Nov-04 16:39 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
On 04/11/13 16:20, Jan Beulich wrote:>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> The goal of this patch is to classify conditionals more clearly, as to >> whether they relate to pv guests, hvm-only guests, or guests with an >> "hvm container" (which will eventually include PVH). >> >> This patch introduces an enum for guest type, as well as two new macros >> for switching behavior on and off: is_pv_* and has_hvm_container_*. At the >> moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that >> it seems to me different to take a path because something does *not* have PV >> structures as to take a path because it *does* have HVM structures, even if >> the >> two happen to coincide 100% at the moment. The exact usage is occasionally >> a bit >> fuzzy though, and a judgement call just needs to be made on which is >> clearer. >> >> In general, a switch should use is_pv_* (or !is_pv_*) if the code in >> question >> relates directly to a PV guest. Examples include use of pv_vcpu structs or >> other behavior directly related to PV domains. >> >> hvm_container is more of a fuzzy concept, but in general: > So sadly this is still being retained, despite its redundancy.Given that I''m going to have to respin anyway to address outstanding comments on v13, probably the thing I could use the most is a close review of patch 12, about using the HVM paths for IO emulation, then calling into the PV handlers. -George
>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > @@ -887,7 +949,32 @@ static int construct_vmcs(struct vcpu *v) > /* Do not enable Monitor Trap Flag unless start single step debug */ > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > > + if ( is_pvh_domain(d) ) > + { > + /* Disable virtual apics, TPR */ > + v->arch.hvm_vmx.secondary_exec_control &= > + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES > + | SECONDARY_EXEC_APIC_REGISTER_VIRT > + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; > + > + /* Disable wbinvd (only necessary for MMIO), > + * unrestricted guest (real mode for EPT) */ > + v->arch.hvm_vmx.secondary_exec_control &= > + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST > + | SECONDARY_EXEC_WBINVD_EXITING);I think I commented on this before - when a PVH guest has a physical device assigned, why is WBINVD interception then not necessary just like it would be for HVM?> @@ -1041,12 +1143,29 @@ static int construct_vmcs(struct vcpu *v) > | (1U << TRAP_no_device); > vmx_update_exception_bitmap(v); > > + /* In HVM domains, this happens on the realmode->paging > + * transition. Since PVH never goes through this transition, we > + * need to do it at start-of-day. */ > + if ( is_pvh_domain(d) ) > + vmx_update_debug_state(v); > + > v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; > + > + /* PVH domains always start in paging mode */ > + if ( is_pvh_domain(d) ) > + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | X86_CR0_WP; > + > hvm_update_guest_cr(v, 0); > > - v->arch.hvm_vcpu.guest_cr[4] = 0; > + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? > + (real_cr4_to_pv_guest_cr4(mmu_cr4_features) > + & ~HVM_CR4_GUEST_RESERVED_BITS(v)) > + : 0; > hvm_update_guest_cr(v, 4); > > + if ( is_pvh_domain(d) ) > + v->arch.hvm_vmx.vmx_realmode = 0;Rather than doing this here, wouldn''t it be more clean to suppress this getting set to 1 in the first place? Jan
Jan Beulich
2013-Nov-04 16:53 UTC
Re: [PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: > @@ -728,8 +740,21 @@ int arch_set_info_guest( > > if ( has_hvm_container_vcpu(v) ) > { > - hvm_set_info_guest(v); > - goto out; > + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel);I''m afraid this isn''t correct - so far gs_base_kernel didn''t get used for HVM guests, i.e. you''re changing behavior here (even if only in a - presumably - benign way).> + > + if ( is_hvm_vcpu(v) || v->is_initialised ) > + goto out; > + > + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);I''d recommend against using this PV construct - the 32-bit counterpart won''t be correct to be used here once 32-bit support gets added.> @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) > __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); > } > > + /* PVH 32bitfixme */ > + if ( is_pvh_vcpu(v) ) > + __vmwrite(GUEST_GS_BASE, gs_base_kernel);Oh, I see, you suppress this here. I''d really suggest adjusting the caller, then you don''t need to do anything here afaict.> + > +Or if you need to, please add just a single blank line here. Jan
Konrad Rzeszutek Wilk
2013-Nov-04 16:59 UTC
Re: [PATCH v14 00/20] Introduce PVH domU support
On Mon, Nov 04, 2013 at 12:14:49PM +0000, George Dunlap wrote:> Updates: > - Fixed bugs in v14: > Zombie domains, FreeBSD crash, Crash at 4GiB, HVM crash > (Thank you to Roger Pau Mone for fixes to the last 3) > - Completely eliminated PV emulation codepathOdd, you dropped Mukesh email from the patch series - so he can''t jump on answering questions right away.> > == RFC => > We had talked about accepting the patch series as-is once I had the > known bugs fixed; but I couldn''t help making an attempt at using the > HVM IO emulation codepaths so that we could completely eliminate > having to use the PV emulation code, in turn eliminating some of the > uglier "support" patches required to make the PV emulation code > capable of running on a PVH guest. The idea for "admin" pio ranges > would be that we would use the vmx hardware to allow the guest direct > access, rather than the "re-execute with guest GPRs" trick that PV > uses. (This functionality is not implememted by this patch series, so > we would need to make sure it was sorted for the dom0 series.) > > The result looks somewhat cleaner to me. On the other hand, because > string in & out instructions use the full emulation code, it means > opening up an extra 6k lines of code to PVH guests, including all the > complexity of the ioreq path. (It doesn''t actually send ioreqs, but > since it shares much of the path, it shares much of the complexity.) > Additionally, I''m not sure I''ve done it entirely correctly: the guest > boots and the io instructions it executes seem to be handled > correctly, but it may not be using the corner cases.The case I think Mukesh was hitting was the ''speaker_io'' path. But perhaps I am misremembering it?> > This also means no support for "legacy" forced invalid ops -- only native > cpuid is supported in this series.OK.> > I have the fixes in another series, if people think it would be better > to check in exactly what we had with bug fixes ASAP. > > Other "open issues" on the design (which need not stop the series > going in) include: > > - Whether a completely separate mode is necessary, or whether having > just having HVM mode with some flags to disable / change certain > functionality would be better > > - Interface-wise: Right now PVH is special-cased for bringing up > CPUs. Is this what we want to do going forward, or would it be better > to try to make it more like PV (which was tried before and is hard), or more > like HVM (which would involve having emulated APICs, &c &c).How is it hard? From the Linux standpoint it is just an hypercall?> > == Summay => > This patch series is a reworking of a series developed by Mukesh > Rathor at Oracle. The entirety of the design and development was done > by him; I have only reworked, reorganized, and simplified things in a > way that I think makes more sense. The vast majority of the credit > for this effort therefore goes to him. This version is labelled v14 > because it is based on his most recent series, v11. > > Because this is based on his work, I retain the "Signed-off-by" in > patches which are based on his code. This is not meant to imply that > he supports the modified version, only that he is involved in > certifying that the origin of the code for copyright purposes. > > This patch series is broken down into several broad strokes: > * Miscellaneous fixes or tweaks > * Code motion, so future patches are simpler > * Introduction of the "hvm_container" concept, which will form the > basis for sharing codepaths between hvm and pvh > * Start with PVH as an HVM container > * Disable unneeded HVM functionality > * Enable PV functionality > * Disable not-yet-implemented functionality > * Enable toolstack changes required to make PVH guests > > This patch series can also be pulled from this git tree: > git://xenbits.xen.org/people/gdunlap/xen.git out/pvh-v14 > > The kernel code for PVH guests can be found here: > git://oss.oracle.com/git/mrathor/linux.git pvh.v9-muk-1 > (That repo/branch also contains a config file, pvh-config-file) > > Changes in v14 can be found inline; major changes since v13 include: > > * Various bug fixes > > * Use HVM emulation for IO instructions > > * ...thus removing many of the changes required to allow the PV > emulation codepath to work for PVH guests > > Changes in v13 can be found inline; major changes since v12 include: > > * Include Mukesh''s toolstack patches (v4) > > * Allocate hvm_param struct for PVH domains; remove patch disabling > memevents > > For those who have been following the series as it develops, here is a > summary of the major changes from Mukesh''s series (v11->v12): > > * Introduction of "has_hvm_container_*()" macros, rather than using > "!is_pv_*". The patch which introduces this also does the vast > majority of the "heavy lifting" in terms of defining PVH. > > * Effort is made to use as much common code as possible. No separate > vmcs constructor, no separate vmexit handlers. More of a "start > with everything and disable if necessary" approach rather than > "start with nothing and enable as needed" approach. > > * One exception is arch_set_info_guest(), where a small amount of code > duplication meant a lot fewer "if(!is_pvh_domain())"s in awkward > places > > * I rely on things being disabled at a higher level and passed down. > For instance, I no longer explicitly disable rdtsc exiting in > construct_vmcs(), since that will happen automatically when we''re in > NEVER_EMULATE mode (which is currently enforced for PVH). Similarly > for nested vmx and things relating to HAP mode. > > * I have also done a slightly more extensive audit of is_pv_* and > is_hvm_* and tried to do more restrictions. > > * I changed the "enable PVH by setting PV + HAP", replacing it instead > with a separate flag, just like the HVM case, since it makes sense > to plan on using shadow in the future (although it is > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > CC: Mukesh Rathor <mukesh.rathor@oracle.com> > CC: Jan Beulich <beulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > CC: Ian Jackson <ian.jackson@citrix.com> > CC: Ian Campbell <ian.campbell@citrix.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote:> On Mon, Nov 04, 2013 at 12:14:49PM +0000, George Dunlap wrote: >> Updates: >> - Fixed bugs in v14: >> Zombie domains, FreeBSD crash, Crash at 4GiB, HVM crash >> (Thank you to Roger Pau Mone for fixes to the last 3) >> - Completely eliminated PV emulation codepath > > Odd, you dropped Mukesh email from the patch series - so he can''t > jump on answering questions right away.The mail I received has Mukesh cc''d in all the patches...> >> == RFC =>> >> We had talked about accepting the patch series as-is once I had the >> known bugs fixed; but I couldn''t help making an attempt at using the >> HVM IO emulation codepaths so that we could completely eliminate >> having to use the PV emulation code, in turn eliminating some of the >> uglier "support" patches required to make the PV emulation code >> capable of running on a PVH guest. The idea for "admin" pio ranges >> would be that we would use the vmx hardware to allow the guest direct >> access, rather than the "re-execute with guest GPRs" trick that PV >> uses. (This functionality is not implememted by this patch series, so >> we would need to make sure it was sorted for the dom0 series.) >> >> The result looks somewhat cleaner to me. On the other hand, because >> string in & out instructions use the full emulation code, it means >> opening up an extra 6k lines of code to PVH guests, including all the >> complexity of the ioreq path. (It doesn''t actually send ioreqs, but >> since it shares much of the path, it shares much of the complexity.) >> Additionally, I''m not sure I''ve done it entirely correctly: the guest >> boots and the io instructions it executes seem to be handled >> correctly, but it may not be using the corner cases. > The case I think Mukesh was hitting was the ''speaker_io'' path. But > perhaps I am misremembering it?Well looking at the trace, it looks like the PVH kernel he gave me is actually attempting to enumerate the PCI space (writing a large range of values to cf8 then reading cfc). A full set of accesses is below: vcpu 0 IO address summary: 21:[w] 1 0.00s 0.00% 5387 cyc { 5387| 5387| 5387} 70:[w] 8 0.00s 0.00% 1434 cyc { 916| 1005| 3651} 71:[r] 8 0.00s 0.00% 1803 cyc { 1017| 1496| 5100} a1:[w] 1 0.00s 0.00% 1357 cyc { 1357| 1357| 1357} cf8:[r] 3 0.00s 0.00% 1202 cyc { 1088| 1150| 1369} cf8:[w] 16850 0.01s 0.00% 966 cyc { 896| 937| 1073} cfa:[w] 1 0.00s 0.00% 932 cyc { 932| 932| 932} cfb:[w] 2 0.00s 0.00% 2517 cyc { 2001| 3033| 3033} cfc:[r] 16560 0.01s 0.00% 1174 cyc { 1118| 1150| 1227} cfe:[r] 288 0.00s 0.00% 1380 cyc { 1032| 1431| 1499} vcpu 1 IO address summary: 60:[r] 16 0.00s 0.00% 1141 cyc { 1011| 1014| 2093} 64:[r] 18276 0.01s 0.01% 1579 cyc { 1408| 1443| 2629} vcpu 2 IO address summary: 70:[w] 33 0.00s 0.00% 1192 cyc { 855| 920| 2306} 71:[r] 31 0.00s 0.00% 1177 cyc { 988| 1032| 1567} 71:[w] 2 0.00s 0.00% 1079 cyc { 1014| 1144| 1144} 2e9:[r] 3 0.00s 0.00% 1697 cyc { 1002| 1011| 3080} 2e9:[w] 3 0.00s 0.00% 998 cyc { 902| 952| 1141} 2f9:[r] 3 0.00s 0.00% 1725 cyc { 996| 1020| 3160} 2f9:[w] 3 0.00s 0.00% 990 cyc { 905| 935| 1130} 3e9:[r] 3 0.00s 0.00% 1595 cyc { 1011| 1026| 2749} 3e9:[w] 3 0.00s 0.00% 1012 cyc { 920| 976| 1142} 3f9:[r] 3 0.00s 0.00% 2480 cyc { 988| 1079| 5375} 3f9:[w] 3 0.00s 0.00% 1064 cyc { 913| 1035| 1245} (No i/o from vcpu 3.) Presumably some of these are just "the BIOS may be lying, check anyway" probes, which should be harmless for domUs.> >> This also means no support for "legacy" forced invalid ops -- only native >> cpuid is supported in this series. > OK.(FWIW, support for legacy forced invalid ops was requested by Tim.)>> I have the fixes in another series, if people think it would be better >> to check in exactly what we had with bug fixes ASAP. >> >> Other "open issues" on the design (which need not stop the series >> going in) include: >> >> - Whether a completely separate mode is necessary, or whether having >> just having HVM mode with some flags to disable / change certain >> functionality would be better >> >> - Interface-wise: Right now PVH is special-cased for bringing up >> CPUs. Is this what we want to do going forward, or would it be better >> to try to make it more like PV (which was tried before and is hard), or more >> like HVM (which would involve having emulated APICs, &c &c). > How is it hard? From the Linux standpoint it is just an hypercall?This is my understanding of a discussion that happened between Tim and Mukesh just as I was joining the conversation. My understanding was that the issue had to do with pre-loading segments and DTs, which for PV guests is easy because Xen controls the tables themselves, but is harder to do in a reasonable way for HVM guests because the guest controls the tables. Mukesh had initially implemented it the full PV way (or mostly PV), but Tim was concerned about some kind of potential consistency issue. But I didn''t read the discussion very carefully, as I was just trying to get my head around the series as a whole at that time. The suggestion to just use an HVM-style method was made at the XenSummit by Glauber Costa. Glauber is a bit more of a KVM guy, so tends to lean towards "just behave like the real hardware". Nonetheless, I think his concern about adding an extra interface is a valid one, and worth keeping in mind. -George
At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote:> On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: > >> This also means no support for "legacy" forced invalid ops -- only native > >> cpuid is supported in this series. > > OK. > > (FWIW, support for legacy forced invalid ops was requested by Tim.)I was worried about existing PV kernel code that used the fake-CPUID, which would break if the ''core'' kernel code went from PV to PVH. But I guess I could be convinced that such kernel code is buggy? Really, the high-order bit was consistency. The version I commented on supported them for user-space but not for kernel, which seemed like risking trouble for no benefit.> >> - Interface-wise: Right now PVH is special-cased for bringing up > >> CPUs. Is this what we want to do going forward, or would it be better > >> to try to make it more like PV (which was tried before and is hard), or more > >> like HVM (which would involve having emulated APICs, &c &c). > > How is it hard? From the Linux standpoint it is just an hypercall? > > This is my understanding of a discussion that happened between Tim and > Mukesh just as I was joining the conversation. My understanding was > that the issue had to do with pre-loading segments and DTs, which for PV > guests is easy because Xen controls the tables themselves, but is harder > to do in a reasonable way for HVM guests because the guest controls the > tables. Mukesh had initially implemented it the full PV way (or mostly > PV), but Tim was concerned about some kind of potential consistency > issue. But I didn''t read the discussion very carefully, as I was just > trying to get my head around the series as a whole at that time.I don''t think the PV code would be very hard. As I said, we already have code to load all that descriptor state for a HVM vcpu; it should just be a question of calling it all in the right order. Cheers, Tim.
Jan Beulich
2013-Nov-05 08:42 UTC
Re: [PATCH v14 12/17] pvh: Use PV handlers for cpuid, and IO
>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: > @@ -140,6 +146,9 @@ static int hvmemul_do_io( > } > } > > + if ( is_pvh_vcpu(curr) ) > + ASSERT(vio->io_state == HVMIO_none);Can we really get here for PVH?> +static int pvhemul_do_pio( > + unsigned long port, int size, paddr_t ram_gpa, int dir, void *p_data) > +{ > + paddr_t value = ram_gpa; > + struct vcpu *curr = current; > + struct cpu_user_regs *regs = guest_cpu_user_regs(); > + > + /* > + * Weird-sized accesses have undefined behaviour: we discard writes > + * and read all-ones. > + */ > + if ( unlikely((size > sizeof(long)) || (size & (size - 1))) )I think you can safely ASSERT() here - PIO instructions never have operand sizes not matching the criteria above.> + { > + gdprintk(XENLOG_WARNING, "bad mmio size %d\n", size); > + ASSERT(p_data != NULL); /* cannot happen with a REP prefix */ > + if ( dir == IOREQ_READ ) > + memset(p_data, ~0, size); > + return X86EMUL_UNHANDLEABLE; > + } > + > + if ( dir == IOREQ_WRITE ) { > + if ( (p_data != NULL) )Coding style (two instances).> + { > + memcpy(&value, p_data, size); > + p_data = NULL; > + } > + > + if ( dir == IOREQ_WRITE ) > + trace_io_assist(0, dir, 1, port, value);Indentation (or really pointless if()).> + > + guest_io_write(port, size, value, curr, regs); > + } > + else > + { > + value = guest_io_read(port, size, curr, regs); > + trace_io_assist(0, dir, 1, port, value); > + if ( (p_data != NULL) )Coding style again (sort of at least).> + memcpy(p_data, &value, size); > + memcpy(®s->eax, &value, size);What is this being matched by in (a) the HVM equivalent and (b) the write code path? And even if needed, this surely wouldn''t be correct for the size == 4 case (where the upper 32 bits of any destination register get zeroed). Hmm, now that I take a second look, I see that this apparently originates from handle_pio() (which however does the reading of ->eax as well), so the above comment actually points out a bug there (which I''m going to prepare a patch for right away).> + } > + > + return X86EMUL_OKAY; > +} > + > + > int hvmemul_do_pio( > unsigned long port, unsigned long *reps, int size, > paddr_t ram_gpa, int dir, int df, void *p_data) > { > - return hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data); > + return is_hvm_vcpu(current) ? > + hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data) : > + pvhemul_do_pio(port, size, ram_gpa, dir, p_data);You''re losing "reps" and "df" here. Jan
Roger Pau Monné
2013-Nov-05 10:57 UTC
Re: [PATCH v14 02/17] libxc: Move temporary grant table mapping to end of memory
Ccing tools maintainers. On 04/11/13 13:14, George Dunlap wrote:> From: Roger Pau Monné <roger.pau@citrix.com> > > In order to set up the grant table for HVM guests, libxc needs to map > the grant table temporarily. At the moment, it does this by adding the > grant page to the HVM guest's p2m table in the MMIO hole (at gfn 0xFFFFE), > then mapping that gfn, setting up the table, then unmapping the gfn and > removing it from the p2m table. > > This breaks with PVH guests with 4G or more of ram, because there is > no MMIO hole; so it ends up clobbering a valid RAM p2m entry, then > leaving a "hole" when it removes the grant map from the p2m table. > Since the guest thinks this is normal ram, when it maps it and tries > to access the page, it crashes. > > This patch maps the page at max_gfn+1 instead. > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > tools/libxc/xc_dom.h | 3 --- > tools/libxc/xc_dom_boot.c | 14 ++++++++++++-- > 2 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h > index 86e23ee..935b49e 100644 > --- a/tools/libxc/xc_dom.h > +++ b/tools/libxc/xc_dom.h > @@ -18,9 +18,6 @@ > > #define INVALID_P2M_ENTRY ((xen_pfn_t)-1) > > -/* Scrach PFN for temporary mappings in HVM */ > -#define SCRATCH_PFN_GNTTAB 0xFFFFE > - > /* --- typedefs and structs ---------------------------------------- */ > > typedef uint64_t xen_vaddr_t; > diff --git a/tools/libxc/xc_dom_boot.c b/tools/libxc/xc_dom_boot.c > index 71e1897..fdfeaf8 100644 > --- a/tools/libxc/xc_dom_boot.c > +++ b/tools/libxc/xc_dom_boot.c > @@ -361,17 +361,27 @@ int xc_dom_gnttab_hvm_seed(xc_interface *xch, domid_t domid, > domid_t xenstore_domid) > { > int rc; > + xen_pfn_t max_gfn; > struct xen_add_to_physmap xatp = { > .domid = domid, > .space = XENMAPSPACE_grant_table, > .idx = 0, > - .gpfn = SCRATCH_PFN_GNTTAB > }; > struct xen_remove_from_physmap xrfp = { > .domid = domid, > - .gpfn = SCRATCH_PFN_GNTTAB > }; > > + max_gfn = xc_domain_maximum_gpfn(xch, domid); > + if ( max_gfn <= 0 ) { > + xc_dom_panic(xch, XC_INTERNAL_ERROR, > + "%s: failed to get max gfn " > + "[errno=%d]\n", > + __FUNCTION__, errno); > + return -1; > + } > + xatp.gpfn = max_gfn + 1; > + xrfp.gpfn = max_gfn + 1; > + > rc = do_memory_op(xch, XENMEM_add_to_physmap, &xatp, sizeof(xatp)); > if ( rc != 0 ) > { >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Nov-05 11:01 UTC
Re: [PATCH v14 02/17] libxc: Move temporary grant table mapping to end of memory
On Tue, 2013-11-05 at 11:57 +0100, Roger Pau Monné wrote:> Ccing tools maintainers. > > On 04/11/13 13:14, George Dunlap wrote: > > From: Roger Pau Monné <roger.pau@citrix.com> > > > > In order to set up the grant table for HVM guests, libxc needs to map > > the grant table temporarily. At the moment, it does this by adding the > > grant page to the HVM guest's p2m table in the MMIO hole (at gfn 0xFFFFE), > > then mapping that gfn, setting up the table, then unmapping the gfn and > > removing it from the p2m table. > > > > This breaks with PVH guests with 4G or more of ram, because there is > > no MMIO hole; so it ends up clobbering a valid RAM p2m entry, then > > leaving a "hole" when it removes the grant map from the p2m table. > > Since the guest thinks this is normal ram, when it maps it and tries > > to access the page, it crashes. > > > > This patch maps the page at max_gfn+1 instead. > > > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>Acked-by: Ian Campbell <ian.campbell@citrix.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
At 12:14 +0000 on 04 Nov (1383563694), George Dunlap wrote:> --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -239,6 +239,17 @@ struct domain *domain_create( > > if ( domcr_flags & DOMCRF_hvm ) > d->guest_type = guest_type_hvm; > + else if ( domcr_flags & DOMCRF_pvh ) > + { > + if ( !(domcr_flags & DOMCRF_hap) ) > + { > + err = -EOPNOTSUPP; > + printk(XENLOG_INFO "PVH guest must have HAP on\n"); > + goto fail; > + } > + d->guest_type = guest_type_pvh; > + printk("Creating PVH guest d%d\n", d->domain_id); > + }This check seems like it should be in arch-specific code. If it were in arch_domain_create(), it would also correctly handle the case where the tools asked for PVH+HAP but HAP wasn''t available. Tim.
Tim Deegan
2013-Nov-06 23:54 UTC
Re: [PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
At 12:14 +0000 on 04 Nov (1383563695), George Dunlap wrote:> Things kept: > * cacheattr_region lists > * irq-related structures > * paging > * tm_list > * hvm params > > Things disabled for now: > * compat xlation > > Things disabled: > * Emulated timers and clock sources > * IO/MMIO io requests > * msix tables > * hvm_funcs > * nested HVM > * Fast-path for emulated lapic accesses > > Getting rid of the hvm_params struct required a couple other places to > check for its existence before attempting to read the params. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>> @@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d) > if ( rc != 0 ) > goto fail0; > > + rc = -ENOMEM; > d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); > + if ( !d->arch.hvm_domain.params ) > + goto fail1; > + > + if ( is_pvh_domain(d) ) > + return 0;Doesn''t this skip hvm_init_cacheattr_region_list() and paging_enable(), which are on your list of things to keep for PVH guests? Tim.
At 12:14 +0000 on 04 Nov (1383563696), George Dunlap wrote:> + if ( is_pvh_domain(d) ) > + { > + /* Disable virtual apics, TPR */ > + v->arch.hvm_vmx.secondary_exec_control &= > + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES > + | SECONDARY_EXEC_APIC_REGISTER_VIRT > + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; > + > + /* Disable wbinvd (only necessary for MMIO), > + * unrestricted guest (real mode for EPT) */ > + v->arch.hvm_vmx.secondary_exec_control &= > + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST > + | SECONDARY_EXEC_WBINVD_EXITING);WBINVD exiting is used for supporting _real_ MMIO, which PVH guetst will still have, right?> + if ( is_pvh_domain(d) ) > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, MSR_TYPE_R | MSR_TYPE_W); > + > + /* > + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, > + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to specify > + * save/restore area to save/restore at every VM exit and entry. > + * Instead, let the intercept functions save them into > + * vmx_msr_state fields. See comment in vmx_restore_host_msrs(). > + * See also vmx_restore_guest_msrs(). > + */Why are these MSRs special for PVH guests? Are PVH guests restricted in how they can use SHADOW_GS? Tim.
At 16:42 +0000 on 04 Nov (1383579750), Jan Beulich wrote:> >>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > > + if ( is_pvh_domain(d) ) > > + v->arch.hvm_vmx.vmx_realmode = 0; > > Rather than doing this here, wouldn''t it be more clean to suppress > this getting set to 1 in the first place?I think this can probably just be dropped -- the call to hvm_update_guest_cr(v, 0) above should have DTRT with the vmx_realmode flag. Tim.
At 12:14 +0000 on 04 Nov (1383563689), George Dunlap wrote:> Other "open issues" on the design (which need not stop the series > going in) include: > > - Whether a completely separate mode is necessary, or whether having > just having HVM mode with some flags to disable / change certain > functionality would be better > > - Interface-wise: Right now PVH is special-cased for bringing up > CPUs. Is this what we want to do going forward, or would it be better > to try to make it more like PV (which was tried before and is hard), or more > like HVM (which would involve having emulated APICs, &c &c).As we discussed in Edinburgh, I think that (a) we should just have HVM mode with some special flags and (b) completing the PV cpu bringup path would be fine. But we agreed to let this design go in and fix those things afterwards. So on that basis, patches 1-4 and 8-20 are Acked-by: Tim Deegan <tim@xen.org>. I commented on 5, 6 and 7 separately. Cheers, Tim.
Jan Beulich
2013-Nov-07 09:00 UTC
Re: [PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
>>> On 07.11.13 at 00:54, Tim Deegan <tim@xen.org> wrote: >> @@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d) >> if ( rc != 0 ) >> goto fail0; >> >> + rc = -ENOMEM; >> d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); >> + if ( !d->arch.hvm_domain.params ) >> + goto fail1; >> + >> + if ( is_pvh_domain(d) ) >> + return 0; > > Doesn''t this skip hvm_init_cacheattr_region_list() and > paging_enable(), which are on your list of things to keep for PVH guests?No, patch 02 moved this up before the patch context seen here. Jan
On 04/11/13 16:14, Jan Beulich wrote:>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -522,27 +522,27 @@ int hvm_domain_initialise(struct domain *d) >> spin_lock_init(&d->arch.hvm_domain.irq_lock); >> spin_lock_init(&d->arch.hvm_domain.uc_lock); >> >> - INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list); >> - spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock); > While I can see the need for moving stuff so that it gets done > earlier - why do these two lines need to be moved _down_? > Even if PVH wasn''t using the MSI-X support code HVM needs, > I can''t see them doing any harm.Right -- sorry, when shuffling things around a lot (which I did while trying to figure out how I had broken HVM guests in v13) you miss this sort of thing. I''ll un-do this hunk. -George
George Dunlap
2013-Nov-07 10:55 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
On 04/11/13 16:20, Jan Beulich wrote:>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> The goal of this patch is to classify conditionals more clearly, as to >> whether they relate to pv guests, hvm-only guests, or guests with an >> "hvm container" (which will eventually include PVH). >> >> This patch introduces an enum for guest type, as well as two new macros >> for switching behavior on and off: is_pv_* and has_hvm_container_*. At the >> moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that >> it seems to me different to take a path because something does *not* have PV >> structures as to take a path because it *does* have HVM structures, even if >> the >> two happen to coincide 100% at the moment. The exact usage is occasionally >> a bit >> fuzzy though, and a judgement call just needs to be made on which is >> clearer. >> >> In general, a switch should use is_pv_* (or !is_pv_*) if the code in >> question >> relates directly to a PV guest. Examples include use of pv_vcpu structs or >> other behavior directly related to PV domains. >> >> hvm_container is more of a fuzzy concept, but in general: > So sadly this is still being retained, despite its redundancy.Jan, Given that the long-term plan is to get rid of the extra mode entirely (at which point has_hvm_container will also go away), are you OK with this patchset going in as it is? There are a lot of individual places to change, and I''m a bit afraid that a large change like that will introduce some bugs I''ll have to spend more time tracking down... I''d rather just do all the changes when we get rid of the extra mode, if possible. -George
Jan Beulich
2013-Nov-07 11:04 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
>>> On 07.11.13 at 11:55, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 04/11/13 16:20, Jan Beulich wrote: >>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >>> hvm_container is more of a fuzzy concept, but in general: >> So sadly this is still being retained, despite its redundancy. > > Given that the long-term plan is to get rid of the extra mode entirely > (at which point has_hvm_container will also go away), are you OK with > this patchset going in as it is?Yes, I am. I solely expressed that I''m not particularly happy with it. Jan
George Dunlap
2013-Nov-07 11:11 UTC
Re: [PATCH v14 04/17] Introduce pv guest type and has_hvm_container macros
On 07/11/13 11:04, Jan Beulich wrote:>>>> On 07.11.13 at 11:55, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 04/11/13 16:20, Jan Beulich wrote: >>>>>> On 04.11.13 at 13:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: >>>> hvm_container is more of a fuzzy concept, but in general: >>> So sadly this is still being retained, despite its redundancy. >> Given that the long-term plan is to get rid of the extra mode entirely >> (at which point has_hvm_container will also go away), are you OK with >> this patchset going in as it is? > Yes, I am. I solely expressed that I''m not particularly happy with it.OK -- just wanting to make sure this wouldn''t be interpreted as a "Nack". :-) -George
On 06/11/13 23:28, Tim Deegan wrote:> At 12:14 +0000 on 04 Nov (1383563694), George Dunlap wrote: >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -239,6 +239,17 @@ struct domain *domain_create( >> >> if ( domcr_flags & DOMCRF_hvm ) >> d->guest_type = guest_type_hvm; >> + else if ( domcr_flags & DOMCRF_pvh ) >> + { >> + if ( !(domcr_flags & DOMCRF_hap) ) >> + { >> + err = -EOPNOTSUPP; >> + printk(XENLOG_INFO "PVH guest must have HAP on\n"); >> + goto fail; >> + } >> + d->guest_type = guest_type_pvh; >> + printk("Creating PVH guest d%d\n", d->domain_id); >> + } > This check seems like it should be in arch-specific code. If it > were in arch_domain_create(), it would also correctly handle the > case where the tools asked for PVH+HAP but HAP wasn''t available.Looking at the HVM case, this (and Jan''s comments on the vmcs patch from v13) should probably be handled the same way: i.e., at start of day, see if we have the necessary hardware support to run in pvh mode; and set "pvh_enabled" (analog to hvm_enabled) accordingly. Then we can check this in hvm_domain_initialise() just as we do for hvm guests. -George
On 07/11/13 00:27, Tim Deegan wrote:> At 12:14 +0000 on 04 Nov (1383563696), George Dunlap wrote: >> + if ( is_pvh_domain(d) ) >> + { >> + /* Disable virtual apics, TPR */ >> + v->arch.hvm_vmx.secondary_exec_control &>> + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES >> + | SECONDARY_EXEC_APIC_REGISTER_VIRT >> + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); >> + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; >> + >> + /* Disable wbinvd (only necessary for MMIO), >> + * unrestricted guest (real mode for EPT) */ >> + v->arch.hvm_vmx.secondary_exec_control &>> + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST >> + | SECONDARY_EXEC_WBINVD_EXITING); > WBINVD exiting is used for supporting _real_ MMIO, which PVH guetst > will still have, right? > >> + if ( is_pvh_domain(d) ) >> + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, MSR_TYPE_R | MSR_TYPE_W); >> + >> + /* >> + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, >> + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to specify >> + * save/restore area to save/restore at every VM exit and entry. >> + * Instead, let the intercept functions save them into >> + * vmx_msr_state fields. See comment in vmx_restore_host_msrs(). >> + * See also vmx_restore_guest_msrs(). >> + */ > Why are these MSRs special for PVH guests? Are PVH guests restricted > in how they can use SHADOW_GS?Your real question is, why is GS_BASE *less* restricted for PVH mode: in HVM mode (as far as I can tell), we exit on accesses to MSR_SHADOW_GS_BASE. It looks like the others are trapped because updating them is rare and saving / restoring them on every context switch would be expensive. But according to a comment in vmx.c: /* * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can * be updated at any time via SWAPGS, which we cannot trap. */ So SHADOW_GS_BASE is read and written on every context switch. Is it OK for PVH not to exit here? If so, do we actually need to do it in HVM mode, or is that an artifact of doing things differently once upon a time? FWIW, at the moment, it looks like the trap for SHADOW_GS_BASE is pointless for HVM as well -- all the handler does is pass through the read or write without doing anything else -- not even updating v->arch.hvm_vmx.shadow_gs. SHADOW_GS_BASE is saved & restored unconditionally on a context switch, so I think we probably could just stop intercepting it. Or, for this series, I think I''ll take out the special case, and separately send a patch to disable the intercept for SHADOW_GS_BASE for all HVM domains. -George -George
On 07/11/13 14:50, George Dunlap wrote:> On 07/11/13 00:27, Tim Deegan wrote: >> At 12:14 +0000 on 04 Nov (1383563696), George Dunlap wrote: >>> + if ( is_pvh_domain(d) ) >>> + { >>> + /* Disable virtual apics, TPR */ >>> + v->arch.hvm_vmx.secondary_exec_control &>>> + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES >>> + | SECONDARY_EXEC_APIC_REGISTER_VIRT >>> + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); >>> + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; >>> + >>> + /* Disable wbinvd (only necessary for MMIO), >>> + * unrestricted guest (real mode for EPT) */ >>> + v->arch.hvm_vmx.secondary_exec_control &>>> + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST >>> + | SECONDARY_EXEC_WBINVD_EXITING); >> WBINVD exiting is used for supporting _real_ MMIO, which PVH guetst >> will still have, right? >> >>> + if ( is_pvh_domain(d) ) >>> + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, >>> MSR_TYPE_R | MSR_TYPE_W); >>> + >>> + /* >>> + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, >>> MSR_LSTAR, >>> + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to >>> specify >>> + * save/restore area to save/restore at every VM exit >>> and entry. >>> + * Instead, let the intercept functions save them into >>> + * vmx_msr_state fields. See comment in >>> vmx_restore_host_msrs(). >>> + * See also vmx_restore_guest_msrs(). >>> + */ >> Why are these MSRs special for PVH guests? Are PVH guests restricted >> in how they can use SHADOW_GS? > > Your real question is, why is GS_BASE *less* restricted for PVH mode: > in HVM mode (as far as I can tell), we exit on accesses to > MSR_SHADOW_GS_BASE.As far as this exiting goes, Paul and I looked at it and considered it bogus in context. We have turned it off in XenServer trunk and are waiting for XenRT to test it thoroughly before formally upstreaming the change. A partner has indicated that it leads to an order of magnitude performance degradation for 64bit windows which appears to rewrite GS_BASE on every context switch. ~Andrew> It looks like the others are trapped because updating them is rare > and saving / restoring them on every context switch would be > expensive. But according to a comment in vmx.c: > > /* > * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can > * be updated at any time via SWAPGS, which we cannot trap. > */ > > So SHADOW_GS_BASE is read and written on every context switch. > > Is it OK for PVH not to exit here? If so, do we actually need to do > it in HVM mode, or is that an artifact of doing things differently > once upon a time? > > FWIW, at the moment, it looks like the trap for SHADOW_GS_BASE is > pointless for HVM as well -- all the handler does is pass through the > read or write without doing anything else -- not even updating > v->arch.hvm_vmx.shadow_gs. SHADOW_GS_BASE is saved & restored > unconditionally on a context switch, so I think we probably could just > stop intercepting it. > > Or, for this series, I think I''ll take out the special case, and > separately send a patch to disable the intercept for SHADOW_GS_BASE > for all HVM domains. > > -George > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 07/11/13 15:40, Andrew Cooper wrote:> On 07/11/13 14:50, George Dunlap wrote: >> On 07/11/13 00:27, Tim Deegan wrote: >>> At 12:14 +0000 on 04 Nov (1383563696), George Dunlap wrote: >>>> + if ( is_pvh_domain(d) ) >>>> + { >>>> + /* Disable virtual apics, TPR */ >>>> + v->arch.hvm_vmx.secondary_exec_control &>>>> + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES >>>> + | SECONDARY_EXEC_APIC_REGISTER_VIRT >>>> + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); >>>> + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; >>>> + >>>> + /* Disable wbinvd (only necessary for MMIO), >>>> + * unrestricted guest (real mode for EPT) */ >>>> + v->arch.hvm_vmx.secondary_exec_control &>>>> + ~(SECONDARY_EXEC_UNRESTRICTED_GUEST >>>> + | SECONDARY_EXEC_WBINVD_EXITING); >>> WBINVD exiting is used for supporting _real_ MMIO, which PVH guetst >>> will still have, right? >>> >>>> + if ( is_pvh_domain(d) ) >>>> + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, >>>> MSR_TYPE_R | MSR_TYPE_W); >>>> + >>>> + /* >>>> + * PVH: We don''t disable intercepts for MSRs: MSR_STAR, >>>> MSR_LSTAR, >>>> + * MSR_CSTAR, and MSR_SYSCALL_MASK because we need to >>>> specify >>>> + * save/restore area to save/restore at every VM exit >>>> and entry. >>>> + * Instead, let the intercept functions save them into >>>> + * vmx_msr_state fields. See comment in >>>> vmx_restore_host_msrs(). >>>> + * See also vmx_restore_guest_msrs(). >>>> + */ >>> Why are these MSRs special for PVH guests? Are PVH guests restricted >>> in how they can use SHADOW_GS? >> Your real question is, why is GS_BASE *less* restricted for PVH mode: >> in HVM mode (as far as I can tell), we exit on accesses to >> MSR_SHADOW_GS_BASE. > As far as this exiting goes, Paul and I looked at it and considered it > bogus in context. We have turned it off in XenServer trunk and are > waiting for XenRT to test it thoroughly before formally upstreaming the > change. > > A partner has indicated that it leads to an order of magnitude > performance degradation for 64bit windows which appears to rewrite > GS_BASE on every context switch.Excellent -- I''ll take it off my list. -George
George Dunlap
2013-Nov-07 15:51 UTC
Re: [PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
On 04/11/13 16:53, Jan Beulich wrote:>>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> @@ -728,8 +740,21 @@ int arch_set_info_guest( >> >> if ( has_hvm_container_vcpu(v) ) >> { >> - hvm_set_info_guest(v); >> - goto out; >> + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); > I''m afraid this isn''t correct - so far gs_base_kernel didn''t get used > for HVM guests, i.e. you''re changing behavior here (even if only > in a - presumably - benign way). > >> + >> + if ( is_hvm_vcpu(v) || v->is_initialised ) >> + goto out; >> + >> + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); > I''d recommend against using this PV construct - the 32-bit > counterpart won''t be correct to be used here once 32-bit > support gets added.So the plan would be that once we support 32-bit, I''d just copy the code from below: if ( !compat ) cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); else cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]); But since we know that compat is false here, it seems a bit silly to have the if() statement. But there should be a "PVH 32bitfixme" here -- is that enough for now?> >> @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) >> __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); >> } >> >> + /* PVH 32bitfixme */ >> + if ( is_pvh_vcpu(v) ) >> + __vmwrite(GUEST_GS_BASE, gs_base_kernel); > Oh, I see, you suppress this here. I''d really suggest adjusting the > caller, then you don''t need to do anything here afaict.What do you mean "adjusting the caller"? What we want for HVM guests is for this field to be entirely left alone, isn''t it? If we set GUEST_GS_BASE unconditionally here, the only way to effect "no change" is to read it and pass in the existing value, which seems kind of pointless. -George
Jan Beulich
2013-Nov-07 16:10 UTC
Re: [PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
>>> On 07.11.13 at 16:51, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 04/11/13 16:53, Jan Beulich wrote: >>>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: >>> @@ -728,8 +740,21 @@ int arch_set_info_guest( >>> >>> if ( has_hvm_container_vcpu(v) ) >>> { >>> - hvm_set_info_guest(v); >>> - goto out; >>> + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); >> I''m afraid this isn''t correct - so far gs_base_kernel didn''t get used >> for HVM guests, i.e. you''re changing behavior here (even if only >> in a - presumably - benign way). >> >>> + >>> + if ( is_hvm_vcpu(v) || v->is_initialised ) >>> + goto out; >>> + >>> + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); >> I''d recommend against using this PV construct - the 32-bit >> counterpart won''t be correct to be used here once 32-bit >> support gets added. > > So the plan would be that once we support 32-bit, I''d just copy the code > from below: > > if ( !compat ) > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); > else > cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]); > > But since we know that compat is false here, it seems a bit silly to > have the if() statement. > > But there should be a "PVH 32bitfixme" here -- is that enough for now?Sure, but that wasn''t my point. I was recommending against *_cr3_to_pfn() altogether here. Just use the control register value shifted right by 12 bits.>>> @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) >>> __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); >>> } >>> >>> + /* PVH 32bitfixme */ >>> + if ( is_pvh_vcpu(v) ) >>> + __vmwrite(GUEST_GS_BASE, gs_base_kernel); >> Oh, I see, you suppress this here. I''d really suggest adjusting the >> caller, then you don''t need to do anything here afaict. > > What do you mean "adjusting the caller"? What we want for HVM guests is > for this field to be entirely left alone, isn''t it? > > If we set GUEST_GS_BASE unconditionally here, the only way to effect "no > change" is to read it and pass in the existing value, which seems kind > of pointless.Oh, right, I didn''t pay attention to the calling path also being used for HVM, and us not wanting to write zero here in that case. Or maybe I did, but concluded that the code can be used only for initial state setup, in which case writing zero would be benign. Jan
George Dunlap
2013-Nov-07 16:33 UTC
Re: [PATCH v14 11/17] pvh: Set up more PV stuff in set_info_guest
On 07/11/13 16:10, Jan Beulich wrote:>>>> On 07.11.13 at 16:51, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 04/11/13 16:53, Jan Beulich wrote: >>>>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: >>>> @@ -728,8 +740,21 @@ int arch_set_info_guest( >>>> >>>> if ( has_hvm_container_vcpu(v) ) >>>> { >>>> - hvm_set_info_guest(v); >>>> - goto out; >>>> + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); >>> I''m afraid this isn''t correct - so far gs_base_kernel didn''t get used >>> for HVM guests, i.e. you''re changing behavior here (even if only >>> in a - presumably - benign way). >>> >>>> + >>>> + if ( is_hvm_vcpu(v) || v->is_initialised ) >>>> + goto out; >>>> + >>>> + cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); >>> I''d recommend against using this PV construct - the 32-bit >>> counterpart won''t be correct to be used here once 32-bit >>> support gets added. >> So the plan would be that once we support 32-bit, I''d just copy the code >> from below: >> >> if ( !compat ) >> cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); >> else >> cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]); >> >> But since we know that compat is false here, it seems a bit silly to >> have the if() statement. >> >> But there should be a "PVH 32bitfixme" here -- is that enough for now? > Sure, but that wasn''t my point. I was recommending against > *_cr3_to_pfn() altogether here. Just use the control register > value shifted right by 12 bits.Oh, right, I see.> >>>> @@ -1426,6 +1426,11 @@ static void vmx_set_info_guest(struct vcpu *v) >>>> __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow); >>>> } >>>> >>>> + /* PVH 32bitfixme */ >>>> + if ( is_pvh_vcpu(v) ) >>>> + __vmwrite(GUEST_GS_BASE, gs_base_kernel); >>> Oh, I see, you suppress this here. I''d really suggest adjusting the >>> caller, then you don''t need to do anything here afaict. >> What do you mean "adjusting the caller"? What we want for HVM guests is >> for this field to be entirely left alone, isn''t it? >> >> If we set GUEST_GS_BASE unconditionally here, the only way to effect "no >> change" is to read it and pass in the existing value, which seems kind >> of pointless. > Oh, right, I didn''t pay attention to the calling path also being > used for HVM, and us not wanting to write zero here in that > case. Or maybe I did, but concluded that the code can be used > only for initial state setup, in which case writing zero would be > benign.I guess we could do that... but since we''re talking about changing the interface here anyway, it''s kind of nitpicking at this point. -George
George Dunlap
2013-Nov-07 16:50 UTC
Re: [PATCH v14 12/17] pvh: Use PV handlers for cpuid, and IO
On 05/11/13 08:42, Jan Beulich wrote:>>>> On 04.11.13 at 13:15, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> @@ -140,6 +146,9 @@ static int hvmemul_do_io( >> } >> } >> >> + if ( is_pvh_vcpu(curr) ) >> + ASSERT(vio->io_state == HVMIO_none); > Can we really get here for PVH?Nope -- sorry I missed that one. :-)> >> +static int pvhemul_do_pio( >> + unsigned long port, int size, paddr_t ram_gpa, int dir, void *p_data) >> +{ >> + paddr_t value = ram_gpa; >> + struct vcpu *curr = current; >> + struct cpu_user_regs *regs = guest_cpu_user_regs(); >> + >> + /* >> + * Weird-sized accesses have undefined behaviour: we discard writes >> + * and read all-ones. >> + */ >> + if ( unlikely((size > sizeof(long)) || (size & (size - 1))) ) > I think you can safely ASSERT() here - PIO instructions never have > operand sizes not matching the criteria above. > >> + { >> + gdprintk(XENLOG_WARNING, "bad mmio size %d\n", size); >> + ASSERT(p_data != NULL); /* cannot happen with a REP prefix */ >> + if ( dir == IOREQ_READ ) >> + memset(p_data, ~0, size); >> + return X86EMUL_UNHANDLEABLE; >> + } >> + >> + if ( dir == IOREQ_WRITE ) { >> + if ( (p_data != NULL) ) > Coding style (two instances). > >> + { >> + memcpy(&value, p_data, size); >> + p_data = NULL; >> + } >> + >> + if ( dir == IOREQ_WRITE ) >> + trace_io_assist(0, dir, 1, port, value); > Indentation (or really pointless if()).Oops...> >> + >> + guest_io_write(port, size, value, curr, regs); >> + } >> + else >> + { >> + value = guest_io_read(port, size, curr, regs); >> + trace_io_assist(0, dir, 1, port, value); >> + if ( (p_data != NULL) ) > Coding style again (sort of at least). > >> + memcpy(p_data, &value, size); >> + memcpy(®s->eax, &value, size); > What is this being matched by in (a) the HVM equivalent and (b) > the write code path? And even if needed, this surely wouldn''t > be correct for the size == 4 case (where the upper 32 bits of > any destination register get zeroed). > > Hmm, now that I take a second look, I see that this apparently > originates from handle_pio() (which however does the reading > of ->eax as well), so the above comment actually points out a > bug there (which I''m going to prepare a patch for right away). > >> + } >> + >> + return X86EMUL_OKAY; >> +} >> + >> + >> int hvmemul_do_pio( >> unsigned long port, unsigned long *reps, int size, >> paddr_t ram_gpa, int dir, int df, void *p_data) >> { >> - return hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data); >> + return is_hvm_vcpu(current) ? >> + hvmemul_do_io(0, port, reps, size, ram_gpa, dir, df, p_data) : >> + pvhemul_do_pio(port, size, ram_gpa, dir, p_data); > You''re losing "reps" and "df" here.Hmm... yes. Time to do some re-thinking on this one. -George
At 11:21 +0000 on 07 Nov (1383819695), George Dunlap wrote:> On 06/11/13 23:28, Tim Deegan wrote: > > At 12:14 +0000 on 04 Nov (1383563694), George Dunlap wrote: > >> --- a/xen/common/domain.c > >> +++ b/xen/common/domain.c > >> @@ -239,6 +239,17 @@ struct domain *domain_create( > >> > >> if ( domcr_flags & DOMCRF_hvm ) > >> d->guest_type = guest_type_hvm; > >> + else if ( domcr_flags & DOMCRF_pvh ) > >> + { > >> + if ( !(domcr_flags & DOMCRF_hap) ) > >> + { > >> + err = -EOPNOTSUPP; > >> + printk(XENLOG_INFO "PVH guest must have HAP on\n"); > >> + goto fail; > >> + } > >> + d->guest_type = guest_type_pvh; > >> + printk("Creating PVH guest d%d\n", d->domain_id); > >> + } > > This check seems like it should be in arch-specific code. If it > > were in arch_domain_create(), it would also correctly handle the > > case where the tools asked for PVH+HAP but HAP wasn''t available. > > Looking at the HVM case, this (and Jan''s comments on the vmcs patch from > v13) should probably be handled the same way: i.e., at start of day, see > if we have the necessary hardware support to run in pvh mode; and set > "pvh_enabled" (analog to hvm_enabled) accordingly. Then we can check > this in hvm_domain_initialise() just as we do for hvm guests.Yep, that sounds like a good idea. Tim.
[SHADOW_GS_BASE] At 14:50 +0000 on 07 Nov (1383832220), George Dunlap wrote:> Or, for this series, I think I''ll take out the special case, and > separately send a patch to disable the intercept for SHADOW_GS_BASE for > all HVM domains.Great! Tim.
Tim Deegan
2013-Nov-07 17:02 UTC
Re: [PATCH v14 06/17] pvh: Disable unneeded features of HVM containers
At 09:00 +0000 on 07 Nov (1383811221), Jan Beulich wrote:> >>> On 07.11.13 at 00:54, Tim Deegan <tim@xen.org> wrote: > >> @@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d) > >> if ( rc != 0 ) > >> goto fail0; > >> > >> + rc = -ENOMEM; > >> d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS); > >> + if ( !d->arch.hvm_domain.params ) > >> + goto fail1; > >> + > >> + if ( is_pvh_domain(d) ) > >> + return 0; > > > > Doesn''t this skip hvm_init_cacheattr_region_list() and > > paging_enable(), which are on your list of things to keep for PVH guests? > > No, patch 02 moved this up before the patch context seen > here.Ah, grand so. Acked-by: Tim Deegan <tim@xen.org>
On 04/11/13 17:34, Tim Deegan wrote:> At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote: >> On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: >>>> This also means no support for "legacy" forced invalid ops -- only native >>>> cpuid is supported in this series. >>> OK. >> (FWIW, support for legacy forced invalid ops was requested by Tim.) > I was worried about existing PV kernel code that used the fake-CPUID, > which would break if the ''core'' kernel code went from PV to PVH. But > I guess I could be convinced that such kernel code is buggy? Really, > the high-order bit was consistency. The version I commented on > supported them for user-space but not for kernel, which seemed like > risking trouble for no benefit.Oh right -- I think Mukesh we do need to support forced invalid ops for user space so that we can use the same xen tools binaries on PV and PVH kernels. Hmm, Mukesh / Konrad, what tools are you actually thinking about here? Are these Oracle-specific tools? I can''t seem to find XEN_CPUID or XEN_EMULATE_PREFIX anywhere in the tools/ directory of the xen repo... -George
On 08/11/13 15:41, George Dunlap wrote:> On 04/11/13 17:34, Tim Deegan wrote: >> At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote: >>> On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: >>>>> This also means no support for "legacy" forced invalid ops -- only >>>>> native >>>>> cpuid is supported in this series. >>>> OK. >>> (FWIW, support for legacy forced invalid ops was requested by Tim.) >> I was worried about existing PV kernel code that used the fake-CPUID, >> which would break if the ''core'' kernel code went from PV to PVH. But >> I guess I could be convinced that such kernel code is buggy? Really, >> the high-order bit was consistency. The version I commented on >> supported them for user-space but not for kernel, which seemed like >> risking trouble for no benefit. > > Oh right -- I think Mukesh we do need to support forced invalid ops > for user space so that we can use the same xen tools binaries on PV > and PVH kernels. > > Hmm, Mukesh / Konrad, what tools are you actually thinking about > here? Are these Oracle-specific tools? I can''t seem to find > XEN_CPUID or XEN_EMULATE_PREFIX anywhere in the tools/ directory of > the xen repo...It''s misc/xen-detect.c, for the curious (which helpfully does not use the macros above). Handling forced invalid ops it is, then. I think it would make sense to handle them in the plain HVM case as well -- that way xen-detect could also work inside of PVHVM domains. -George
Konrad Rzeszutek Wilk
2013-Nov-08 15:58 UTC
Re: [PATCH v14 00/20] Introduce PVH domU support
On Fri, Nov 08, 2013 at 03:41:23PM +0000, George Dunlap wrote:> On 04/11/13 17:34, Tim Deegan wrote: > >At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote: > >>On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: > >>>>This also means no support for "legacy" forced invalid ops -- only native > >>>>cpuid is supported in this series. > >>>OK. > >>(FWIW, support for legacy forced invalid ops was requested by Tim.) > >I was worried about existing PV kernel code that used the fake-CPUID, > >which would break if the ''core'' kernel code went from PV to PVH. But > >I guess I could be convinced that such kernel code is buggy? Really, > >the high-order bit was consistency. The version I commented on > >supported them for user-space but not for kernel, which seemed like > >risking trouble for no benefit. > > Oh right -- I think Mukesh we do need to support forced invalid ops > for user space so that we can use the same xen tools binaries on PV > and PVH kernels. > > Hmm, Mukesh / Konrad, what tools are you actually thinking about > here? Are these Oracle-specific tools? I can''t seem to find > XEN_CPUID or XEN_EMULATE_PREFIX anywhere in the tools/ directory of > the xen repo...xen/tools/misc/xen-detect.c> > -George
At 15:53 +0000 on 08 Nov (1383922404), George Dunlap wrote:> On 08/11/13 15:41, George Dunlap wrote: > > On 04/11/13 17:34, Tim Deegan wrote: > >> At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote: > >>> On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: > >>>>> This also means no support for "legacy" forced invalid ops -- only > >>>>> native > >>>>> cpuid is supported in this series. > >>>> OK. > >>> (FWIW, support for legacy forced invalid ops was requested by Tim.) > >> I was worried about existing PV kernel code that used the fake-CPUID, > >> which would break if the ''core'' kernel code went from PV to PVH. But > >> I guess I could be convinced that such kernel code is buggy? Really, > >> the high-order bit was consistency. The version I commented on > >> supported them for user-space but not for kernel, which seemed like > >> risking trouble for no benefit. > > > > Oh right -- I think Mukesh we do need to support forced invalid ops > > for user space so that we can use the same xen tools binaries on PV > > and PVH kernels. > > > > Hmm, Mukesh / Konrad, what tools are you actually thinking about > > here? Are these Oracle-specific tools? I can''t seem to find > > XEN_CPUID or XEN_EMULATE_PREFIX anywhere in the tools/ directory of > > the xen repo... > > It''s misc/xen-detect.c, for the curious (which helpfully does not use > the macros above). > > Handling forced invalid ops it is, then.The xen-detect code tries real CPUID first, so not supporting the fake CPUID doesn''t actually affect it at all. That''s why I could be convinced that detection failures were a guest-side bug. It would affect older code that _only_ tired the fake CPUID, or code that for some reason needed to behave differently on PV vs HVM (though it''s not at all clear that on PVH such code should use the ''PV'' behaviour. Tim.
On 08/11/13 17:01, Tim Deegan wrote:> At 15:53 +0000 on 08 Nov (1383922404), George Dunlap wrote: >> On 08/11/13 15:41, George Dunlap wrote: >>> On 04/11/13 17:34, Tim Deegan wrote: >>>> At 17:23 +0000 on 04 Nov (1383582187), George Dunlap wrote: >>>>> On 04/11/13 16:59, Konrad Rzeszutek Wilk wrote: >>>>>>> This also means no support for "legacy" forced invalid ops -- only >>>>>>> native >>>>>>> cpuid is supported in this series. >>>>>> OK. >>>>> (FWIW, support for legacy forced invalid ops was requested by Tim.) >>>> I was worried about existing PV kernel code that used the fake-CPUID, >>>> which would break if the ''core'' kernel code went from PV to PVH. But >>>> I guess I could be convinced that such kernel code is buggy? Really, >>>> the high-order bit was consistency. The version I commented on >>>> supported them for user-space but not for kernel, which seemed like >>>> risking trouble for no benefit. >>> Oh right -- I think Mukesh we do need to support forced invalid ops >>> for user space so that we can use the same xen tools binaries on PV >>> and PVH kernels. >>> >>> Hmm, Mukesh / Konrad, what tools are you actually thinking about >>> here? Are these Oracle-specific tools? I can''t seem to find >>> XEN_CPUID or XEN_EMULATE_PREFIX anywhere in the tools/ directory of >>> the xen repo... >> It''s misc/xen-detect.c, for the curious (which helpfully does not use >> the macros above). >> >> Handling forced invalid ops it is, then. > The xen-detect code tries real CPUID first, so not supporting the fake > CPUID doesn''t actually affect it at all. That''s why I could be > convinced that detection failures were a guest-side bug. > > It would affect older code that _only_ tired the fake CPUID, or code > that for some reason needed to behave differently on PV vs HVM (though > it''s not at all clear that on PVH such code should use the ''PV'' > behaviour.Yes, taking a closer look, xen-detect in a PVH domain will behave the same as in an HVM domain: # ./xen-detect Running in HVM context on Xen v4.4. I don''t think this is a big deal; to be robust your system should be able to operate properly in an HVM (or PVHVM) domU. I''ll leave this out for now, but I''ll put a note in the cover letter and in the doc file so we don''t forget about it. -George
On 04/11/13 13:14, George Dunlap wrote:> Updates: > - Fixed bugs in v14: > Zombie domains, FreeBSD crash, Crash at 4GiB, HVM crash > (Thank you to Roger Pau Mone for fixes to the last 3) > - Completely eliminated PV emulation codepath > > == RFC => > We had talked about accepting the patch series as-is once I had the > known bugs fixed; but I couldn''t help making an attempt at using the > HVM IO emulation codepaths so that we could completely eliminate > having to use the PV emulation code, in turn eliminating some of the > uglier "support" patches required to make the PV emulation code > capable of running on a PVH guest. The idea for "admin" pio ranges > would be that we would use the vmx hardware to allow the guest direct > access, rather than the "re-execute with guest GPRs" trick that PV > uses. (This functionality is not implememted by this patch series, so > we would need to make sure it was sorted for the dom0 series.) > > The result looks somewhat cleaner to me. On the other hand, because > string in & out instructions use the full emulation code, it means > opening up an extra 6k lines of code to PVH guests, including all the > complexity of the ioreq path. (It doesn''t actually send ioreqs, but > since it shares much of the path, it shares much of the complexity.) > Additionally, I''m not sure I''ve done it entirely correctly: the guest > boots and the io instructions it executes seem to be handled > correctly, but it may not be using the corner cases. > > This also means no support for "legacy" forced invalid ops -- only native > cpuid is supported in this series. > > I have the fixes in another series, if people think it would be better > to check in exactly what we had with bug fixes ASAP. > > Other "open issues" on the design (which need not stop the series > going in) include: > > - Whether a completely separate mode is necessary, or whether having > just having HVM mode with some flags to disable / change certain > functionality would be better > > - Interface-wise: Right now PVH is special-cased for bringing up > CPUs. Is this what we want to do going forward, or would it be better > to try to make it more like PV (which was tried before and is hard), or more > like HVM (which would involve having emulated APICs, &c &c). > > == Summay => > This patch series is a reworking of a series developed by Mukesh > Rathor at Oracle. The entirety of the design and development was done > by him; I have only reworked, reorganized, and simplified things in a > way that I think makes more sense. The vast majority of the credit > for this effort therefore goes to him. This version is labelled v14 > because it is based on his most recent series, v11. > > Because this is based on his work, I retain the "Signed-off-by" in > patches which are based on his code. This is not meant to imply that > he supports the modified version, only that he is involved in > certifying that the origin of the code for copyright purposes. > > This patch series is broken down into several broad strokes: > * Miscellaneous fixes or tweaks > * Code motion, so future patches are simpler > * Introduction of the "hvm_container" concept, which will form the > basis for sharing codepaths between hvm and pvh > * Start with PVH as an HVM container > * Disable unneeded HVM functionality > * Enable PV functionality > * Disable not-yet-implemented functionality > * Enable toolstack changes required to make PVH guests > > This patch series can also be pulled from this git tree: > git://xenbits.xen.org/people/gdunlap/xen.git out/pvh-v14 > > The kernel code for PVH guests can be found here: > git://oss.oracle.com/git/mrathor/linux.git pvh.v9-muk-1 > (That repo/branch also contains a config file, pvh-config-file) > > Changes in v14 can be found inline; major changes since v13 include: > > * Various bug fixes > > * Use HVM emulation for IO instructions > > * ...thus removing many of the changes required to allow the PV > emulation codepath to work for PVH guests > > Changes in v13 can be found inline; major changes since v12 include: > > * Include Mukesh''s toolstack patches (v4) > > * Allocate hvm_param struct for PVH domains; remove patch disabling > memevents > > For those who have been following the series as it develops, here is a > summary of the major changes from Mukesh''s series (v11->v12): > > * Introduction of "has_hvm_container_*()" macros, rather than using > "!is_pv_*". The patch which introduces this also does the vast > majority of the "heavy lifting" in terms of defining PVH. > > * Effort is made to use as much common code as possible. No separate > vmcs constructor, no separate vmexit handlers. More of a "start > with everything and disable if necessary" approach rather than > "start with nothing and enable as needed" approach. > > * One exception is arch_set_info_guest(), where a small amount of code > duplication meant a lot fewer "if(!is_pvh_domain())"s in awkward > places > > * I rely on things being disabled at a higher level and passed down. > For instance, I no longer explicitly disable rdtsc exiting in > construct_vmcs(), since that will happen automatically when we''re in > NEVER_EMULATE mode (which is currently enforced for PVH). Similarly > for nested vmx and things relating to HAP mode. > > * I have also done a slightly more extensive audit of is_pv_* and > is_hvm_* and tried to do more restrictions. > > * I changed the "enable PVH by setting PV + HAP", replacing it instead > with a separate flag, just like the HVM case, since it makes sense > to plan on using shadow in the future (although it is > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > CC: Mukesh Rathor <mukesh.rathor@oracle.com> > CC: Jan Beulich <beulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > CC: Ian Jackson <ian.jackson@citrix.com> > CC: Ian Campbell <ian.campbell@citrix.com>I''ve tested this new series, and all the bugs I''ve found in the previous version are gone: Tested-by: Roger Pau Monné <roger.pau@citrix.com>