This patch series is a reworking of a series developed by Mukesh Rathor at Oracle. The entirety of the design and development was done by him; I have only reworked, reorganized, and simplified things in a way that I think makes more sense. The vast majority of the credit for this effort therefore goes to him. This version is labelled v13 because it is based on his most recent series, v11. Because this is based on his work, I retain the "Signed-off-by" in patches which are based on his code. This is not meant to imply that he supports the modified version, only that he is involved in certifying that the origin of the code for copyright purposes. This is still an RFC, and in several places I have kept code as I found it even though I didn''t understand it or didn''t agree with it because I wanted to be able to discuss it. This patch series is broken down into several broad strokes: * Miscellaneous fixes or tweaks * Code motion, so future patches are simpler * Introduction of the "hvm_container" concept, which will form the basis for sharing codepaths between hvm and pvh * Start with PVH as an HVM container * Disable unneeded HVM functionality * Enable PV functionality * Disable not-yet-implemented functionality * Enable toolstack changes required to make PVH guests This patch series can also be pulled from this git tree: git://xenbits.xen.org/people/gdunlap/xen.git out/pvh-v13 The kernel code for PVH guests can be found here: git://oss.oracle.com/git/mrathor/linux.git pvh.v9-muk-1 (That repo/branch also contains a config file, pvh-config-file) Changes in v13 can be found inline; major changes since v12 include: * Include Mukesh''s toolstack patches (v4) * Allocate hvm_param struct for PVH domains; remove patch disabling memevents For those who have been following the series as it develops, here is a summary of the major changes from Mukesh''s series (v11->v12): * Introduction of "has_hvm_container_*()" macros, rather than using "!is_pv_*". The patch which introduces this also does the vast majority of the "heavy lifting" in terms of defining PVH. * Effort is made to use as much common code as possible. No separate vmcs constructor, no separate vmexit handlers. More of a "start with everything and disable if necessary" approach rather than "start with nothing and enable as needed" approach. * One exception is arch_set_info_guest(), where a small amount of code duplication meant a lot fewer "if(!is_pvh_domain())"s in awkward places * I rely on things being disabled at a higher level and passed down. For instance, I no longer explicitly disable rdtsc exiting in construct_vmcs(), since that will happen automatically when we''re in NEVER_EMULATE mode (which is currently enforced for PVH). Similarly for nested vmx and things relating to HAP mode. * I have also done a slightly more extensive audit of is_pv_* and is_hvm_* and tried to do more restrictions. * I changed the "enable PVH by setting PV + HAP", replacing it instead with a separate flag, just like the HVM case, since it makes sense to plan on using shadow in the future (although it is Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> CC: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Jan Beulich <beulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> CC: Ian Jackson <ian.jackson@citrix.com> CC: Ian Campbell <ian.campbell@citrix.com>
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 01/20] Allow vmx_update_debug_state to be called when v!=current
Removing the assert allows the PVH code to call this during vmcs
construction in a later patch, making the code more robust by removing
duplicate code.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
----
v13: Add vmx_vmcs_{enter,exit}
CC: Mukesh Rathor <mukesh.rathor@oracle.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/vmx/vmx.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index d6a7710..f9b589b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1041,8 +1041,6 @@ void vmx_update_debug_state(struct vcpu *v)
 {
     unsigned long mask;
 
-    ASSERT(v == current);
-
     mask = 1u << TRAP_int3;
     if ( !cpu_has_monitor_trap_flag )
         mask |= 1u << TRAP_debug;
@@ -1051,7 +1049,10 @@ void vmx_update_debug_state(struct vcpu *v)
         v->arch.hvm_vmx.exception_bitmap |= mask;
     else
         v->arch.hvm_vmx.exception_bitmap &= ~mask;
+    
+    vmx_vmcs_enter(v);
     vmx_update_exception_bitmap(v);
+    vmx_vmcs_exit(v);
 }
 
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
-- 
1.7.9.5
There are many functions where PVH requires some code in common with
HVM.  Rearrange some of these functions so that the code is together.
In general, the HVM code that PVH also uses includes:
 - cacheattr functionality
 - paging
 - hvm_funcs
 - hvm_assert_evtchn_irq tasklet
 - tm_list
 - hvm_params
And code that PVH shares with PV but not with PVH:
 - updating the domain wallclock
 - setting v->is_initialized
There should be no end-to-end changes in behavior.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
- Don''t bother calling tasklet_kill in failure path of
  hvm_vcpu_initialize
- Allocate hvm_params for PVH domains
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/domain.c  |   11 +++---
 xen/arch/x86/hvm/hvm.c |   92 +++++++++++++++++++++++++-----------------------
 2 files changed, 53 insertions(+), 50 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 316ef04..5ebc68c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -945,17 +945,16 @@ int arch_set_info_guest(
 
     clear_bit(_VPF_in_reset, &v->pause_flags);
 
-    if ( v->vcpu_id == 0 )
-        update_domain_wallclock_time(d);
-
-    /* Don''t redo final setup */
-    v->is_initialised = 1;
-
     if ( paging_mode_enabled(d) )
         paging_update_paging_modes(v);
 
     update_cr3(v);
 
+    if ( v->vcpu_id == 0 )
+        update_domain_wallclock_time(d);
+
+    /* Don''t redo final setup */
+    v->is_initialised = 1;
  out:
     if ( flags & VGCF_online )
         clear_bit(_VPF_down, &v->pause_flags);
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index e0e0f5d..aaf956a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -522,27 +522,27 @@ int hvm_domain_initialise(struct domain *d)
     spin_lock_init(&d->arch.hvm_domain.irq_lock);
     spin_lock_init(&d->arch.hvm_domain.uc_lock);
 
-    INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list);
-    spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock);
+    hvm_init_cacheattr_region_list(d);
+
+    rc = paging_enable(d, PG_refcounts|PG_translate|PG_external);
+    if ( rc != 0 )
+        goto fail0;
 
     d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS);
     d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler);
     rc = -ENOMEM;
     if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler )
-        goto fail0;
+        goto fail1;
     d->arch.hvm_domain.io_handler->num_slot = 0;
 
+    INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list);
+    spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock);
+
     hvm_init_guest_time(d);
 
     d->arch.hvm_domain.params[HVM_PARAM_HPET_ENABLED] = 1;
     d->arch.hvm_domain.params[HVM_PARAM_TRIPLE_FAULT_REASON] =
SHUTDOWN_reboot;
 
-    hvm_init_cacheattr_region_list(d);
-
-    rc = paging_enable(d, PG_refcounts|PG_translate|PG_external);
-    if ( rc != 0 )
-        goto fail1;
-
     vpic_init(d);
 
     rc = vioapic_init(d);
@@ -569,10 +569,10 @@ int hvm_domain_initialise(struct domain *d)
     stdvga_deinit(d);
     vioapic_deinit(d);
  fail1:
-    hvm_destroy_cacheattr_region_list(d);
- fail0:
     xfree(d->arch.hvm_domain.io_handler);
     xfree(d->arch.hvm_domain.params);
+ fail0:
+    hvm_destroy_cacheattr_region_list(d);
     return rc;
 }
 
@@ -601,11 +601,11 @@ void hvm_domain_relinquish_resources(struct domain *d)
 
 void hvm_domain_destroy(struct domain *d)
 {
+    hvm_destroy_cacheattr_region_list(d);
     hvm_funcs.domain_destroy(d);
     rtc_deinit(d);
     stdvga_deinit(d);
     vioapic_deinit(d);
-    hvm_destroy_cacheattr_region_list(d);
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
@@ -1091,24 +1091,46 @@ int hvm_vcpu_initialise(struct vcpu *v)
 {
     int rc;
     struct domain *d = v->domain;
-    domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
+    domid_t dm_domid;
 
     hvm_asid_flush_vcpu(v);
 
-    if ( (rc = vlapic_init(v)) != 0 )
+    spin_lock_init(&v->arch.hvm_vcpu.tm_lock);
+    INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list);
+
+    rc = hvm_vcpu_cacheattr_init(v); /* t */
+    if ( rc != 0 )
         goto fail1;
 
-    if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 )
+    if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown:
hvm_funcs.vcpu_destroy */
         goto fail2;
 
-    if ( nestedhvm_enabled(d) 
-         && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) 
+    softirq_tasklet_init(
+        &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
+        (void(*)(unsigned long))hvm_assert_evtchn_irq,
+        (unsigned long)v);
+
+    v->arch.user_regs.eflags = 2;
+
+    v->arch.hvm_vcpu.inject_trap.vector = -1;
+
+    rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
+    if ( rc != 0 )
         goto fail3;
 
+    if ( (rc = vlapic_init(v)) != 0 ) /* teardown: vlapic_destroy */
+        goto fail4;
+
+    if ( nestedhvm_enabled(d) 
+         && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) /* teardown:
nestedhvm_vcpu_destroy */
+        goto fail5;
+
+    dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
+
     /* Create ioreq event channel. */
-    rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL);
+    rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); /* teardown: none
*/
     if ( rc < 0 )
-        goto fail4;
+        goto fail6;
 
     /* Register ioreq event channel. */
     v->arch.hvm_vcpu.xen_port = rc;
@@ -1116,9 +1138,9 @@ int hvm_vcpu_initialise(struct vcpu *v)
     if ( v->vcpu_id == 0 )
     {
         /* Create bufioreq event channel. */
-        rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL);
+        rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); /* teardown:
none */
         if ( rc < 0 )
-            goto fail4;
+            goto fail6;
         d->arch.hvm_domain.params[HVM_PARAM_BUFIOREQ_EVTCHN] = rc;
     }
 
@@ -1127,26 +1149,6 @@ int hvm_vcpu_initialise(struct vcpu *v)
         get_ioreq(v)->vp_eport = v->arch.hvm_vcpu.xen_port;
     spin_unlock(&d->arch.hvm_domain.ioreq.lock);
 
-    spin_lock_init(&v->arch.hvm_vcpu.tm_lock);
-    INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list);
-
-    v->arch.hvm_vcpu.inject_trap.vector = -1;
-
-    rc = setup_compat_arg_xlat(v);
-    if ( rc != 0 )
-        goto fail4;
-
-    rc = hvm_vcpu_cacheattr_init(v);
-    if ( rc != 0 )
-        goto fail5;
-
-    softirq_tasklet_init(
-        &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
-        (void(*)(unsigned long))hvm_assert_evtchn_irq,
-        (unsigned long)v);
-
-    v->arch.user_regs.eflags = 2;
-
     if ( v->vcpu_id == 0 )
     {
         /* NB. All these really belong in hvm_domain_initialise(). */
@@ -1164,14 +1166,16 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     return 0;
 
+ fail6:
+    nestedhvm_vcpu_destroy(v);
  fail5:
-    free_compat_arg_xlat(v);
+    vlapic_destroy(v);
  fail4:
-    nestedhvm_vcpu_destroy(v);
+    free_compat_arg_xlat(v);
  fail3:
     hvm_funcs.vcpu_destroy(v);
  fail2:
-    vlapic_destroy(v);
+    hvm_vcpu_cacheattr_destroy(v);
  fail1:
     return rc;
 }
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
The goal of this patch is to classify conditionals more clearly, as to
whether they relate to pv guests, hvm-only guests, or guests with an
"hvm container" (which will eventually include PVH).
This patch introduces an enum for guest type, as well as two new macros
for switching behavior on and off: is_pv_* and has_hvm_container_*.  At the
moment is_pv_* <=> !has_hvm_container_*.  The purpose of having two is
that
it seems to me different to take a path because something does *not* have PV
structures as to take a path because it *does* have HVM structures, even if the
two happen to coincide 100% at the moment.  The exact usage is occasionally a
bit
fuzzy though, and a judgement call just needs to be made on which is clearer.
In general, a switch should use is_pv_* (or !is_pv_*) if the code in question
relates directly to a PV guest.  Examples include use of pv_vcpu structs or
other behavior directly related to PV domains.
hvm_container is more of a fuzzy concept, but in general:
* Most core HVM behavior will be included in this.  Behavior not
appropriate for PVH mode will be disabled in later patches
* Hypercalls related to HVM guests will *not* be included by default;
functionality needed by PVH guests will be enabled in future patches
* The following functionality are not considered part of the HVM
container, and PVH will end up behaving like PV by default: Event
channel, vtsc offset, code related to emulated timers, nested HVM,
emuirq, PoD
* Some features are left to implement for PVH later: vpmu, shadow mode
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
- Readjust where we choose to use has_hvm_container or !is_pv (and vice versa)
- Do the memset in arch_set_info_guest unconditionally
- Chech for is_pv in do_page_walk
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/acpi/suspend.c        |    2 +-
 xen/arch/x86/cpu/mcheck/vmce.c     |    6 ++--
 xen/arch/x86/debug.c               |    2 +-
 xen/arch/x86/domain.c              |   54 ++++++++++++++++++------------------
 xen/arch/x86/domain_page.c         |   10 +++----
 xen/arch/x86/domctl.c              |   11 ++++----
 xen/arch/x86/efi/runtime.c         |    4 +--
 xen/arch/x86/hvm/vmx/vmcs.c        |    4 +--
 xen/arch/x86/mm.c                  |    6 ++--
 xen/arch/x86/mm/shadow/common.c    |    6 ++--
 xen/arch/x86/mm/shadow/multi.c     |    7 +++--
 xen/arch/x86/physdev.c             |    4 +--
 xen/arch/x86/traps.c               |    5 ++--
 xen/arch/x86/x86_64/mm.c           |    3 +-
 xen/arch/x86/x86_64/traps.c        |    8 +++---
 xen/common/domain.c                |    2 +-
 xen/common/grant_table.c           |    4 +--
 xen/common/kernel.c                |    2 +-
 xen/include/asm-x86/domain.h       |    2 +-
 xen/include/asm-x86/event.h        |    2 +-
 xen/include/asm-x86/guest_access.h |   12 ++++----
 xen/include/asm-x86/guest_pt.h     |    4 +--
 xen/include/xen/sched.h            |   14 ++++++++--
 xen/include/xen/tmem_xen.h         |    2 +-
 24 files changed, 93 insertions(+), 83 deletions(-)
diff --git a/xen/arch/x86/acpi/suspend.c b/xen/arch/x86/acpi/suspend.c
index c690b45..f4373bb 100644
--- a/xen/arch/x86/acpi/suspend.c
+++ b/xen/arch/x86/acpi/suspend.c
@@ -78,7 +78,7 @@ void restore_rest_processor_state(void)
     }
 
     /* Maybe load the debug registers. */
-    BUG_ON(is_hvm_vcpu(curr));
+    BUG_ON(!is_pv_vcpu(curr));
     if ( !is_idle_vcpu(curr) && curr->arch.debugreg[7] )
     {
         write_debugreg(0, curr->arch.debugreg[0]);
diff --git a/xen/arch/x86/cpu/mcheck/vmce.c b/xen/arch/x86/cpu/mcheck/vmce.c
index af3b491..f6c35db 100644
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -83,7 +83,7 @@ int vmce_restore_vcpu(struct vcpu *v, const struct
hvm_vmce_vcpu *ctxt)
     {
         dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA
capabilities"
                 " %#" PRIx64 " for d%d:v%u (supported:
%#Lx)\n",
-                is_hvm_vcpu(v) ? "HVM" : "PV",
ctxt->caps,
+                has_hvm_container_vcpu(v) ? "HVM" : "PV",
ctxt->caps,
                 v->domain->domain_id, v->vcpu_id,
                 guest_mcg_cap & ~MCG_CAP_COUNT);
         return -EPERM;
@@ -357,7 +357,7 @@ int inject_vmce(struct domain *d, int vcpu)
         if ( vcpu != VMCE_INJECT_BROADCAST && vcpu != v->vcpu_id )
             continue;
 
-        if ( (is_hvm_domain(d) ||
+        if ( (has_hvm_container_domain(d) ||
               guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check))
&&
              !test_and_set_bool(v->mce_pending) )
         {
@@ -439,7 +439,7 @@ int unmmap_broken_page(struct domain *d, mfn_t mfn, unsigned
long gfn)
     if (!mfn_valid(mfn_x(mfn)))
         return -EINVAL;
 
-    if ( !is_hvm_domain(d) || !paging_mode_hap(d) )
+    if ( !has_hvm_container_domain(d) || !paging_mode_hap(d) )
         return -ENOSYS;
 
     rc = -1;
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index e67473e..3e21ca8 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len,
struct domain *dp,
 
         pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len);
 
-        mfn = (dp->is_hvm
+        mfn = (has_hvm_container_domain(dp)
                ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn)
                : dbg_pv_va2mfn(addr, dp, pgd3));
         if ( mfn == INVALID_MFN ) 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 5ebc68c..ac18aa7 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -167,7 +167,7 @@ void dump_pageframe_info(struct domain *d)
         spin_unlock(&d->page_alloc_lock);
     }
 
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         p2m_pod_dump_data(d);
 
     spin_lock(&d->page_alloc_lock);
@@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v)
 
     vmce_init_vcpu(v);
 
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
         goto done;
@@ -438,7 +438,7 @@ int vcpu_initialise(struct vcpu *v)
     {
         vcpu_destroy_fpu(v);
 
-        if ( !is_hvm_domain(d) )
+        if ( is_pv_domain(d) )
             xfree(v->arch.pv_vcpu.trap_ctxt);
     }
 
@@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v)
 
     vcpu_destroy_fpu(v);
 
-    if ( is_hvm_vcpu(v) )
+    if ( has_hvm_container_vcpu(v) )
         hvm_vcpu_destroy(v);
     else
         xfree(v->arch.pv_vcpu.trap_ctxt);
@@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
     int rc = -ENOMEM;
 
     d->arch.hvm_domain.hap_enabled -        is_hvm_domain(d) &&
+        has_hvm_container_domain(d) &&
         hvm_funcs.hap_supported &&
         (domcr_flags & DOMCRF_hap);
     d->arch.hvm_domain.mem_sharing_enabled = 0;
@@ -490,7 +490,7 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
                d->domain_id);
     }
 
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     else if ( is_idle_domain(d) )
         rc = 0;
@@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
     mapcache_domain_init(d);
 
     HYPERVISOR_COMPAT_VIRT_START(d) -        is_hvm_domain(d) ? ~0u :
__HYPERVISOR_COMPAT_VIRT_START;
+        is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u;
 
     if ( (rc = paging_domain_init(d, domcr_flags)) != 0 )
         goto fail;
@@ -554,7 +554,7 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
             goto fail;
     }
 
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
     {
         if ( (rc = hvm_domain_initialise(d)) != 0 )
         {
@@ -583,14 +583,14 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
     if ( paging_initialised )
         paging_final_teardown(d);
     free_perdomain_mappings(d);
-    if ( !is_hvm_domain(d) )
+    if ( is_pv_domain(d) )
         free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab);
     return rc;
 }
 
 void arch_domain_destroy(struct domain *d)
 {
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         hvm_domain_destroy(d);
     else
         xfree(d->arch.pv_domain.e820);
@@ -602,7 +602,7 @@ void arch_domain_destroy(struct domain *d)
     paging_final_teardown(d);
 
     free_perdomain_mappings(d);
-    if ( !is_hvm_domain(d) )
+    if ( is_pv_domain(d) )
         free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab);
 
     free_xenheap_page(d->shared_info);
@@ -653,7 +653,7 @@ int arch_set_info_guest(
 #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld))
     flags = c(flags);
 
-    if ( !is_hvm_vcpu(v) )
+    if ( is_pv_vcpu(v) )
     {
         if ( !compat )
         {
@@ -706,7 +706,7 @@ int arch_set_info_guest(
     v->fpu_initialised = !!(flags & VGCF_I387_VALID);
 
     v->arch.flags &= ~TF_kernel_mode;
-    if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ )
+    if ( (flags & VGCF_in_kernel) || has_hvm_container_vcpu(v)/*???*/ )
         v->arch.flags |= TF_kernel_mode;
 
     v->arch.vgc_flags = flags;
@@ -721,7 +721,7 @@ int arch_set_info_guest(
     if ( !compat )
     {
         memcpy(&v->arch.user_regs, &c.nat->user_regs,
sizeof(c.nat->user_regs));
-        if ( !is_hvm_vcpu(v) )
+        if ( is_pv_vcpu(v) )
             memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt,
                    sizeof(c.nat->trap_ctxt));
     }
@@ -737,7 +737,7 @@ int arch_set_info_guest(
 
     v->arch.user_regs.eflags |= 2;
 
-    if ( is_hvm_vcpu(v) )
+    if ( has_hvm_container_vcpu(v) )
     {
         hvm_set_info_guest(v);
         goto out;
@@ -966,7 +966,7 @@ int arch_set_info_guest(
 
 int arch_vcpu_reset(struct vcpu *v)
 {
-    if ( !is_hvm_vcpu(v) )
+    if ( is_pv_vcpu(v) )
     {
         destroy_gdt(v);
         return vcpu_destroy_pagetables(v);
@@ -1316,7 +1316,7 @@ static void update_runstate_area(struct vcpu *v)
 
 static inline int need_full_gdt(struct vcpu *v)
 {
-    return (!is_hvm_vcpu(v) && !is_idle_vcpu(v));
+    return (is_pv_vcpu(v) && !is_idle_vcpu(v));
 }
 
 static void __context_switch(void)
@@ -1442,9 +1442,9 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
     {
         __context_switch();
 
-        if ( !is_hvm_vcpu(next) &&
+        if ( is_pv_vcpu(next) &&
              (is_idle_vcpu(prev) ||
-              is_hvm_vcpu(prev) ||
+              has_hvm_container_vcpu(prev) ||
               is_pv_32on64_vcpu(prev) != is_pv_32on64_vcpu(next)) )
         {
             uint64_t efer = read_efer();
@@ -1455,13 +1455,13 @@ void context_switch(struct vcpu *prev, struct vcpu
*next)
         /* Re-enable interrupts before restoring state which may fault. */
         local_irq_enable();
 
-        if ( !is_hvm_vcpu(next) )
+        if ( is_pv_vcpu(next) )
         {
             load_LDT(next);
             load_segments(next);
         }
 
-        set_cpuid_faulting(!is_hvm_vcpu(next) &&
+        set_cpuid_faulting(is_pv_vcpu(next) &&
                            (next->domain->domain_id != 0));
     }
 
@@ -1544,7 +1544,7 @@ void hypercall_cancel_continuation(void)
     }
     else
     {
-        if ( !is_hvm_vcpu(current) )
+        if ( is_pv_vcpu(current) )
             regs->eip += 2; /* skip re-execute ''syscall'' /
''int $xx'' */
         else
             current->arch.hvm_vcpu.hcall_preempted = 0;
@@ -1581,12 +1581,12 @@ unsigned long hypercall_create_continuation(
         regs->eax  = op;
 
         /* Ensure the hypercall trap instruction is re-executed. */
-        if ( !is_hvm_vcpu(current) )
+        if ( is_pv_vcpu(current) )
             regs->eip -= 2;  /* re-execute ''syscall'' /
''int $xx'' */
         else
             current->arch.hvm_vcpu.hcall_preempted = 1;
 
-        if ( !is_hvm_vcpu(current) ?
+        if ( is_pv_vcpu(current) ?
              !is_pv_32on64_vcpu(current) :
              (hvm_guest_x86_mode(current) == 8) )
         {
@@ -1858,7 +1858,7 @@ int domain_relinquish_resources(struct domain *d)
                 return ret;
         }
 
-        if ( !is_hvm_domain(d) )
+        if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
             {
@@ -1931,7 +1931,7 @@ int domain_relinquish_resources(struct domain *d)
         BUG();
     }
 
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         hvm_domain_relinquish_resources(d);
 
     return 0;
@@ -2015,7 +2015,7 @@ void vcpu_mark_events_pending(struct vcpu *v)
     if ( already_pending )
         return;
 
-    if ( is_hvm_vcpu(v) )
+    if ( has_hvm_container_vcpu(v) )
         hvm_assert_evtchn_irq(v);
     else
         vcpu_kick(v);
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index bc18263..3903952 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -35,7 +35,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
      * then it means we are running on the idle domain''s page table
and must
      * therefore use its mapcache.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) &&
!is_hvm_vcpu(v) )
+    if ( unlikely(pagetable_is_null(v->arch.guest_table)) &&
is_pv_vcpu(v) )
     {
         /* If we really are idling, perform lazy context switch now. */
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
@@ -72,7 +72,7 @@ void *map_domain_page(unsigned long mfn)
 #endif
 
     v = mapcache_current_vcpu();
-    if ( !v || is_hvm_vcpu(v) )
+    if ( !v || !is_pv_vcpu(v) )
         return mfn_to_virt(mfn);
 
     dcache = &v->domain->arch.pv_domain.mapcache;
@@ -177,7 +177,7 @@ void unmap_domain_page(const void *ptr)
     ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
     v = mapcache_current_vcpu();
-    ASSERT(v && !is_hvm_vcpu(v));
+    ASSERT(v && is_pv_vcpu(v));
 
     dcache = &v->domain->arch.pv_domain.mapcache;
     ASSERT(dcache->inuse);
@@ -244,7 +244,7 @@ int mapcache_domain_init(struct domain *d)
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
     unsigned int bitmap_pages;
 
-    if ( is_hvm_domain(d) || is_idle_domain(d) )
+    if ( !is_pv_domain(d) || is_idle_domain(d) )
         return 0;
 
 #ifdef NDEBUG
@@ -275,7 +275,7 @@ int mapcache_vcpu_init(struct vcpu *v)
     unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
     unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
 
-    if ( is_hvm_vcpu(v) || !dcache->inuse )
+    if ( !is_pv_vcpu(v) || !dcache->inuse )
         return 0;
 
     if ( ents > dcache->entries )
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index e75918a..9531a16 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -800,7 +800,7 @@ long arch_do_domctl(
         if ( domctl->cmd == XEN_DOMCTL_get_ext_vcpucontext )
         {
             evc->size = sizeof(*evc);
-            if ( !is_hvm_domain(d) )
+            if ( is_pv_domain(d) )
             {
                 evc->sysenter_callback_cs                          
v->arch.pv_vcpu.sysenter_callback_cs;
@@ -833,7 +833,7 @@ long arch_do_domctl(
             ret = -EINVAL;
             if ( evc->size < offsetof(typeof(*evc), vmce) )
                 goto ext_vcpucontext_out;
-            if ( !is_hvm_domain(d) )
+            if ( is_pv_domain(d) )
             {
                 if ( !is_canonical_address(evc->sysenter_callback_eip) ||
                      !is_canonical_address(evc->syscall32_callback_eip) )
@@ -1246,8 +1246,7 @@ void arch_get_info_guest(struct vcpu *v,
vcpu_guest_context_u c)
     bool_t compat = is_pv_32on64_domain(v->domain);
 #define c(fld) (!compat ? (c.nat->fld) : (c.cmp->fld))
 
-    if ( is_hvm_vcpu(v) )
-        memset(c.nat, 0, sizeof(*c.nat));
+    memset(c.nat, 0, sizeof(*c.nat));
     memcpy(&c.nat->fpu_ctxt, v->arch.fpu_ctxt,
sizeof(c.nat->fpu_ctxt));
     c(flags = v->arch.vgc_flags & ~(VGCF_i387_valid|VGCF_in_kernel));
     if ( v->fpu_initialised )
@@ -1257,7 +1256,7 @@ void arch_get_info_guest(struct vcpu *v,
vcpu_guest_context_u c)
     if ( !compat )
     {
         memcpy(&c.nat->user_regs, &v->arch.user_regs,
sizeof(c.nat->user_regs));
-        if ( !is_hvm_vcpu(v) )
+        if ( is_pv_vcpu(v) )
             memcpy(c.nat->trap_ctxt, v->arch.pv_vcpu.trap_ctxt,
                    sizeof(c.nat->trap_ctxt));
     }
@@ -1272,7 +1271,7 @@ void arch_get_info_guest(struct vcpu *v,
vcpu_guest_context_u c)
     for ( i = 0; i < ARRAY_SIZE(v->arch.debugreg); ++i )
         c(debugreg[i] = v->arch.debugreg[i]);
 
-    if ( is_hvm_vcpu(v) )
+    if ( has_hvm_container_vcpu(v) )
     {
         struct segment_register sreg;
 
diff --git a/xen/arch/x86/efi/runtime.c b/xen/arch/x86/efi/runtime.c
index 37bb535..d7c884b 100644
--- a/xen/arch/x86/efi/runtime.c
+++ b/xen/arch/x86/efi/runtime.c
@@ -52,7 +52,7 @@ unsigned long efi_rs_enter(void)
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
-    if ( !is_hvm_vcpu(current) && !is_idle_vcpu(current) )
+    if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
     {
         struct desc_ptr gdt_desc = {
             .limit = LAST_RESERVED_GDT_BYTE,
@@ -71,7 +71,7 @@ unsigned long efi_rs_enter(void)
 void efi_rs_leave(unsigned long cr3)
 {
     write_cr3(cr3);
-    if ( !is_hvm_vcpu(current) && !is_idle_vcpu(current) )
+    if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
     {
         struct desc_ptr gdt_desc = {
             .limit = LAST_RESERVED_GDT_BYTE,
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 0be5616..cf54d18 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -637,7 +637,7 @@ void vmx_vmcs_exit(struct vcpu *v)
     {
         /* Don''t confuse vmx_do_resume (for @v or @current!) */
         vmx_clear_vmcs(v);
-        if ( is_hvm_vcpu(current) )
+        if ( has_hvm_container_vcpu(current) )
             vmx_load_vmcs(current);
 
         spin_unlock(&v->arch.hvm_vmx.vmcs_lock);
@@ -1462,7 +1462,7 @@ static void vmcs_dump(unsigned char ch)
 
     for_each_domain ( d )
     {
-        if ( !is_hvm_domain(d) )
+        if ( !has_hvm_container_domain(d) )
             continue;
         printk("\n>>> Domain %d <<<\n",
d->domain_id);
         for_each_vcpu ( d, v )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7f0e13..120599a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -181,7 +181,7 @@ static uint32_t base_disallow_mask;
      (rangeset_is_empty((d)->iomem_caps) &&                     \
       rangeset_is_empty((d)->arch.ioport_caps) &&               \
       !has_arch_pdevs(d) &&                                     \
-      !is_hvm_domain(d)) ?                                      \
+      is_pv_domain(d)) ?                                        \
      L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
 
 static void __init init_frametable_chunk(void *start, void *end)
@@ -433,7 +433,7 @@ int page_is_ram_type(unsigned long mfn, unsigned long
mem_type)
 
 unsigned long domain_get_maximum_gpfn(struct domain *d)
 {
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         return p2m_get_hostp2m(d)->max_mapped_pfn;
     /* NB. PV guests specify nr_pfns rather than max_pfn so we adjust here. */
     return (arch_get_max_pfn(d) ?: 1) - 1;
@@ -2379,7 +2379,7 @@ static int __get_page_type(struct page_info *page,
unsigned long type,
     {
         /* Special pages should not be accessible from devices. */
         struct domain *d = page_get_owner(page);
-        if ( d && !is_hvm_domain(d) && unlikely(need_iommu(d))
)
+        if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
         {
             if ( (x & PGT_type_mask) == PGT_writable_page )
                 iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index adffa06..0bfa595 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -367,7 +367,7 @@ const struct x86_emulate_ops *shadow_init_emulation(
     sh_ctxt->ctxt.regs = regs;
     sh_ctxt->ctxt.force_writeback = 0;
 
-    if ( !is_hvm_vcpu(v) )
+    if ( is_pv_vcpu(v) )
     {
         sh_ctxt->ctxt.addr_size = sh_ctxt->ctxt.sp_size = BITS_PER_LONG;
         return &pv_shadow_emulator_ops;
@@ -964,7 +964,7 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn)
     if ( pg->shadow_flags & 
          ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync) 
          || sh_page_has_multiple_shadows(pg)
-         || !is_hvm_domain(v->domain)
+         || is_pv_domain(v->domain)
          || !v->domain->arch.paging.shadow.oos_active )
         return 0;
 
@@ -2753,7 +2753,7 @@ static void sh_update_paging_modes(struct vcpu *v)
     if ( v->arch.paging.mode )
         v->arch.paging.mode->shadow.detach_old_tables(v);
 
-    if ( !is_hvm_domain(d) )
+    if ( is_pv_domain(d) )
     {
         ///
         /// PV guest
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 4c4c2ba..a17386f 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -711,8 +711,9 @@ _sh_propagate(struct vcpu *v,
     // PV guests in 64-bit mode use two different page tables for user vs
     // supervisor permissions, making the guest''s _PAGE_USER bit
irrelevant.
     // It is always shadowed as present...
-    if ( (GUEST_PAGING_LEVELS == 4) && !is_pv_32on64_domain(d) 
-         && !is_hvm_domain(d) )
+    if ( (GUEST_PAGING_LEVELS == 4)
+         && !is_pv_32on64_domain(d) 
+         && is_pv_domain(d) )
     {
         sflags |= _PAGE_USER;
     }
@@ -3918,7 +3919,7 @@ sh_update_cr3(struct vcpu *v, int do_locking)
 #endif
 
     /* Don''t do anything on an uninitialised vcpu */
-    if ( !is_hvm_domain(d) && !v->is_initialised )
+    if ( is_pv_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
         return;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 4835ed7..dab6213 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -310,10 +310,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void)
arg)
             spin_unlock(&v->domain->event_lock);
             break;
         }
-        if ( !is_hvm_domain(v->domain) &&
+        if ( is_pv_domain(v->domain) &&
              v->domain->arch.pv_domain.auto_unmask )
             evtchn_unmask(pirq->evtchn);
-        if ( !is_hvm_domain(v->domain) ||
+        if ( is_pv_domain(v->domain) ||
              domain_pirq_to_irq(v->domain, eoi.irq) > 0 )
             pirq_guest_eoi(pirq);
         if ( is_hvm_domain(v->domain) &&
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 72e8566..58a92a5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -119,6 +119,7 @@ static void show_guest_stack(struct vcpu *v, struct
cpu_user_regs *regs)
     unsigned long *stack, addr;
     unsigned long mask = STACK_SIZE;
 
+    /* Avoid HVM as we don''t know what the stack looks like. */
     if ( is_hvm_vcpu(v) )
         return;
 
@@ -555,7 +556,7 @@ static inline void do_trap(
     }
 
     if ( ((trapnr == TRAP_copro_error) || (trapnr == TRAP_simd_error))
&&
-         is_hvm_vcpu(curr) &&
curr->arch.hvm_vcpu.fpu_exception_callback )
+         has_hvm_container_vcpu(curr) &&
curr->arch.hvm_vcpu.fpu_exception_callback )
     {
         curr->arch.hvm_vcpu.fpu_exception_callback(
             curr->arch.hvm_vcpu.fpu_exception_callback_arg, regs);
@@ -721,7 +722,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx,
             *ebx = 0x40000200;
         *ecx = 0;          /* Features 1 */
         *edx = 0;          /* Features 2 */
-        if ( !is_hvm_vcpu(current) )
+        if ( is_pv_vcpu(current) )
             *ecx |= XEN_CPUID_FEAT1_MMU_PT_UPDATE_PRESERVE_AD;
         break;
 
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 26cb998..5424cc4 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -73,7 +73,8 @@ void *do_page_walk(struct vcpu *v, unsigned long addr)
     l2_pgentry_t l2e, *l2t;
     l1_pgentry_t l1e, *l1t;
 
-    if ( is_hvm_vcpu(v) )
+    /* FIXME: Implement for PVH */
+    if ( !is_pv_vcpu(v) )
         return NULL;
 
     l4t = map_domain_page(mfn);
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 1cc977c..ef5c470 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -86,7 +86,7 @@ void show_registers(struct cpu_user_regs *regs)
     enum context context;
     struct vcpu *v = current;
 
-    if ( is_hvm_vcpu(v) && guest_mode(regs) )
+    if ( has_hvm_container_vcpu(v) && guest_mode(regs) )
     {
         struct segment_register sreg;
         context = CTXT_hvm_guest;
@@ -147,8 +147,8 @@ void vcpu_show_registers(const struct vcpu *v)
     const struct cpu_user_regs *regs = &v->arch.user_regs;
     unsigned long crs[8];
 
-    /* No need to handle HVM for now. */
-    if ( is_hvm_vcpu(v) )
+    /* Only handle PV guests for now */
+    if ( !is_pv_vcpu(v) )
         return;
 
     crs[0] = v->arch.pv_vcpu.ctrlreg[0];
@@ -622,7 +622,7 @@ static void hypercall_page_initialise_ring3_kernel(void
*hypercall_page)
 void hypercall_page_initialise(struct domain *d, void *hypercall_page)
 {
     memset(hypercall_page, 0xCC, PAGE_SIZE);
-    if ( is_hvm_domain(d) )
+    if ( has_hvm_container_domain(d) )
         hvm_hypercall_page_initialise(d, hypercall_page);
     else if ( !is_pv_32bit_domain(d) )
         hypercall_page_initialise_ring3_kernel(hypercall_page);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 5999779..995ba63 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -238,7 +238,7 @@ struct domain *domain_create(
         goto fail;
 
     if ( domcr_flags & DOMCRF_hvm )
-        d->is_hvm = 1;
+        d->guest_type = guest_type_hvm;
 
     if ( domid == 0 )
     {
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index f42bc7a..8884c8d 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -721,7 +721,7 @@ __gnttab_map_grant_ref(
 
     double_gt_lock(lgt, rgt);
 
-    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+    if ( is_pv_domain(ld) && need_iommu(ld) )
     {
         unsigned int wrc, rdc;
         int err = 0;
@@ -931,7 +931,7 @@ __gnttab_unmap_common(
             act->pin -= GNTPIN_hstw_inc;
     }
 
-    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+    if ( is_pv_domain(ld) && need_iommu(ld) )
     {
         unsigned int wrc, rdc;
         int err = 0;
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index b8707d9..2d68a7c 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -305,7 +305,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             if ( current->domain == dom0 )
                 fi.submap |= 1U << XENFEAT_dom0;
 #ifdef CONFIG_X86
-            if ( !is_hvm_vcpu(current) )
+            if ( is_pv_vcpu(current) )
                 fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
                              (1U << XENFEAT_highmem_assist) |
                              (1U << XENFEAT_gnttab_map_avail_bits);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 909f449..0831700 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -16,7 +16,7 @@
 #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d))
 #define is_pv_32on64_vcpu(v)   (is_pv_32on64_domain((v)->domain))
 
-#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \
+#define is_hvm_pv_evtchn_domain(d) (has_hvm_container_domain(d) && \
         d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector)
 #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain))
 
diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h
index 7edeb5b..a82062e 100644
--- a/xen/include/asm-x86/event.h
+++ b/xen/include/asm-x86/event.h
@@ -23,7 +23,7 @@ int hvm_local_events_need_delivery(struct vcpu *v);
 static inline int local_events_need_delivery(void)
 {
     struct vcpu *v = current;
-    return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) :
+    return (has_hvm_container_vcpu(v) ? hvm_local_events_need_delivery(v) :
             (vcpu_info(v, evtchn_upcall_pending) &&
              !vcpu_info(v, evtchn_upcall_mask)));
 }
diff --git a/xen/include/asm-x86/guest_access.h
b/xen/include/asm-x86/guest_access.h
index ca700c9..88edb3f 100644
--- a/xen/include/asm-x86/guest_access.h
+++ b/xen/include/asm-x86/guest_access.h
@@ -14,27 +14,27 @@
 
 /* Raw access functions: no type checking. */
 #define raw_copy_to_guest(dst, src, len)        \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
      copy_to_user((dst), (src), (len)))
 #define raw_copy_from_guest(dst, src, len)      \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
      copy_from_user((dst), (src), (len)))
 #define raw_clear_guest(dst,  len)              \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
      clear_user((dst), (len)))
 #define __raw_copy_to_guest(dst, src, len)      \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
      __copy_to_user((dst), (src), (len)))
 #define __raw_copy_from_guest(dst, src, len)    \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
      __copy_from_user((dst), (src), (len)))
 #define __raw_clear_guest(dst,  len)            \
-    (is_hvm_vcpu(current) ?                     \
+    (has_hvm_container_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
      clear_user((dst), (len)))
 
diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
index b62bc6a..d2a8250 100644
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -196,7 +196,7 @@ guest_supports_superpages(struct vcpu *v)
     /* The _PAGE_PSE bit must be honoured in HVM guests, whenever
      * CR4.PSE is set or the guest is in PAE or long mode. 
      * It''s also used in the dummy PT for vcpus with CR4.PG cleared.
*/
-    return (!is_hvm_vcpu(v)
+    return (is_pv_vcpu(v)
             ? opt_allow_superpage
             : (GUEST_PAGING_LEVELS != 2 
                || !hvm_paging_enabled(v)
@@ -214,7 +214,7 @@ guest_supports_nx(struct vcpu *v)
 {
     if ( GUEST_PAGING_LEVELS == 2 || !cpu_has_nx )
         return 0;
-    if ( !is_hvm_vcpu(v) )
+    if ( is_pv_vcpu(v) )
         return cpu_has_nx;
     return hvm_nx_enabled(v);
 }
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 0013a8d..1025838 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -238,6 +238,10 @@ struct mem_event_per_domain
     struct mem_event_domain access;
 };
 
+enum guest_type {
+    guest_type_pv, guest_type_hvm
+};
+
 struct domain
 {
     domid_t          domain_id;
@@ -285,8 +289,8 @@ struct domain
     struct rangeset *iomem_caps;
     struct rangeset *irq_caps;
 
-    /* Is this an HVM guest? */
-    bool_t           is_hvm;
+    enum guest_type guest_type;
+
 #ifdef HAS_PASSTHROUGH
     /* Does this guest need iommu mappings? */
     bool_t           need_iommu;
@@ -738,8 +742,12 @@ void watchdog_domain_destroy(struct domain *d);
 
 #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist))
 
-#define is_hvm_domain(d) ((d)->is_hvm)
+#define is_pv_domain(d) ((d)->guest_type == guest_type_pv)
+#define is_pv_vcpu(v)   (is_pv_domain((v)->domain))
+#define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm)
 #define is_hvm_vcpu(v)   (is_hvm_domain(v->domain))
+#define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv)
+#define has_hvm_container_vcpu(v)   (has_hvm_container_domain((v)->domain))
 #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \
                            cpumask_weight((v)->cpu_affinity) == 1)
 #ifdef HAS_PASSTHROUGH
diff --git a/xen/include/xen/tmem_xen.h b/xen/include/xen/tmem_xen.h
index ad1ddd5..9fb7446 100644
--- a/xen/include/xen/tmem_xen.h
+++ b/xen/include/xen/tmem_xen.h
@@ -442,7 +442,7 @@ typedef XEN_GUEST_HANDLE_PARAM(char) tmem_cli_va_param_t;
 static inline int tmh_get_tmemop_from_client(tmem_op_t *op, tmem_cli_op_t uops)
 {
 #ifdef CONFIG_COMPAT
-    if ( is_hvm_vcpu(current) ?
+    if ( has_hvm_container_vcpu(current) ?
          hvm_guest_x86_mode(current) != 8 :
          is_pv_32on64_vcpu(current) )
     {
-- 
1.7.9.5
Introduce new PVH guest type, flags to create it, and ways to identify it.
To begin with, it will inherit functionality marked hvm_container.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v13: Changed if/else in getdomaininfo into a switch statement as requested.
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/common/domain.c         |   11 +++++++++++
 xen/common/domctl.c         |   24 +++++++++++++++++++++---
 xen/include/public/domctl.h |    8 +++++++-
 xen/include/xen/sched.h     |   11 ++++++++++-
 4 files changed, 49 insertions(+), 5 deletions(-)
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 995ba63..19d96c2 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -239,6 +239,17 @@ struct domain *domain_create(
 
     if ( domcr_flags & DOMCRF_hvm )
         d->guest_type = guest_type_hvm;
+    else if ( domcr_flags & DOMCRF_pvh )
+    {
+        if ( !(domcr_flags & DOMCRF_hap) )
+        {
+            err = -EOPNOTSUPP;
+            printk(XENLOG_INFO "PVH guest must have HAP on\n");
+            goto fail;
+        }
+        d->guest_type = guest_type_pvh;
+        printk("Creating PVH guest d%d\n", d->domain_id);
+    }
 
     if ( domid == 0 )
     {
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 9760d50..2814b7c 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -185,8 +185,17 @@ void getdomaininfo(struct domain *d, struct
xen_domctl_getdomaininfo *info)
         (d->debugger_attached           ? XEN_DOMINF_debugged : 0) |
         d->shutdown_code << XEN_DOMINF_shutdownshift;
 
-    if ( is_hvm_domain(d) )
+    switch(d->guest_type)
+    {
+    case guest_type_hvm:
         info->flags |= XEN_DOMINF_hvm_guest;
+        break;
+    case guest_type_pvh:
+        info->flags |= XEN_DOMINF_pvh_guest;
+        break;
+    default:
+        break;
+    }
 
     xsm_security_domaininfo(d, info);
 
@@ -412,8 +421,11 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
         ret = -EINVAL;
         if ( supervisor_mode_kernel ||
              (op->u.createdomain.flags &
-             ~(XEN_DOMCTL_CDF_hvm_guest | XEN_DOMCTL_CDF_hap |
-               XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off)) )
+             ~(XEN_DOMCTL_CDF_hvm_guest
+               | XEN_DOMCTL_CDF_pvh_guest
+               | XEN_DOMCTL_CDF_hap
+               | XEN_DOMCTL_CDF_s3_integrity
+               | XEN_DOMCTL_CDF_oos_off)) )
             break;
 
         dom = op->domain;
@@ -440,9 +452,15 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
             rover = dom;
         }
 
+        if ( (op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest)
+             && (op->u.createdomain.flags &
XEN_DOMCTL_CDF_pvh_guest) )
+            return -EINVAL;
+
         domcr_flags = 0;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest )
             domcr_flags |= DOMCRF_hvm;
+        if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_pvh_guest )
+            domcr_flags |= DOMCRF_pvh;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap )
             domcr_flags |= DOMCRF_hap;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity )
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 4c5b2bb..e471179 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -47,7 +47,7 @@ struct xen_domctl_createdomain {
     /* IN parameters */
     uint32_t ssidref;
     xen_domain_handle_t handle;
- /* Is this an HVM guest (as opposed to a PV guest)? */
+ /* Is this an HVM guest (as opposed to a PVH or PV guest)? */
 #define _XEN_DOMCTL_CDF_hvm_guest     0
 #define XEN_DOMCTL_CDF_hvm_guest      (1U<<_XEN_DOMCTL_CDF_hvm_guest)
  /* Use hardware-assisted paging if available? */
@@ -59,6 +59,9 @@ struct xen_domctl_createdomain {
  /* Disable out-of-sync shadow page tables? */
 #define _XEN_DOMCTL_CDF_oos_off       3
 #define XEN_DOMCTL_CDF_oos_off        (1U<<_XEN_DOMCTL_CDF_oos_off)
+ /* Is this a PVH guest (as opposed to an HVM or PV guest)? */
+#define _XEN_DOMCTL_CDF_pvh_guest     4
+#define XEN_DOMCTL_CDF_pvh_guest      (1U<<_XEN_DOMCTL_CDF_pvh_guest)
     uint32_t flags;
 };
 typedef struct xen_domctl_createdomain xen_domctl_createdomain_t;
@@ -89,6 +92,9 @@ struct xen_domctl_getdomaininfo {
  /* Being debugged.  */
 #define _XEN_DOMINF_debugged  6
 #define XEN_DOMINF_debugged   (1U<<_XEN_DOMINF_debugged)
+/* domain is PVH */
+#define _XEN_DOMINF_pvh_guest 7
+#define XEN_DOMINF_pvh_guest   (1U<<_XEN_DOMINF_pvh_guest)
  /* XEN_DOMINF_shutdown guest-supplied code.  */
 #define XEN_DOMINF_shutdownmask 255
 #define XEN_DOMINF_shutdownshift 16
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1025838..cef5025 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -238,8 +238,12 @@ struct mem_event_per_domain
     struct mem_event_domain access;
 };
 
+/*
+ * PVH is a PV guest running in an HVM container.  is_hvm_* checks
+ * will be false, but has_hvm_container_* checks will be true.
+ */
 enum guest_type {
-    guest_type_pv, guest_type_hvm
+    guest_type_pv, guest_type_pvh, guest_type_hvm
 };
 
 struct domain
@@ -474,6 +478,9 @@ struct domain *domain_create(
  /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */
 #define _DOMCRF_oos_off         4
 #define DOMCRF_oos_off          (1U<<_DOMCRF_oos_off)
+ /* DOMCRF_pvh: Create PV domain in HVM container. */
+#define _DOMCRF_pvh            5
+#define DOMCRF_pvh             (1U<<_DOMCRF_pvh)
 
 /*
  * rcu_lock_domain_by_id() is more efficient than get_domain_by_id().
@@ -744,6 +751,8 @@ void watchdog_domain_destroy(struct domain *d);
 
 #define is_pv_domain(d) ((d)->guest_type == guest_type_pv)
 #define is_pv_vcpu(v)   (is_pv_domain((v)->domain))
+#define is_pvh_domain(d) ((d)->guest_type == guest_type_pvh)
+#define is_pvh_vcpu(v)   (is_pvh_domain((v)->domain))
 #define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm)
 #define is_hvm_vcpu(v)   (is_hvm_domain(v->domain))
 #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv)
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 05/20] pvh: Disable unneeded features of HVM containers
Things kept:
* cacheattr_region lists
* irq-related structures
* paging
* tm_list
* hvm params
Things disabled for now:
* compat xlation
Things disabled:
* Emulated timers and clock sources
* IO/MMIO emulation
* msix tables
* hvm_funcs
* nested HVM
* Fast-path for emulated lapic accesses
Getting rid of the hvm_params struct required a couple other places to
check for its existence before attempting to read the params.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Removed unnecessary comment
 - Allocate params for hvm domains; remove null checks necessary in last patch
 - Add ASSERT(!is_pvh) to handle_pio
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/hvm.c      |   43 ++++++++++++++++++++++++++++++++++++++-----
 xen/arch/x86/hvm/io.c       |    7 +++++++
 xen/arch/x86/hvm/irq.c      |    3 +++
 xen/arch/x86/hvm/vmx/intr.c |    3 ++-
 4 files changed, 50 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index aaf956a..370bd4d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -301,6 +301,10 @@ u64 hvm_get_guest_tsc_adjust(struct vcpu *v)
 
 void hvm_migrate_timers(struct vcpu *v)
 {
+    /* PVH doesn''t use rtc and emulated timers, it uses pvclock
mechanism. */
+    if ( is_pvh_vcpu(v) )
+        return;
+
     rtc_migrate_timers(v);
     pt_migrate(v);
 }
@@ -342,10 +346,13 @@ void hvm_do_resume(struct vcpu *v)
 {
     ioreq_t *p;
 
-    pt_restore_timer(v);
-
     check_wakeup_from_wait();
 
+    if ( is_pvh_vcpu(v) )
+        goto check_inject_trap;
+
+    pt_restore_timer(v);
+
     /* NB. Optimised for common case (p->state == STATE_IOREQ_NONE). */
     p = get_ioreq(v);
     while ( p->state != STATE_IOREQ_NONE )
@@ -368,6 +375,7 @@ void hvm_do_resume(struct vcpu *v)
         }
     }
 
+  check_inject_trap:
     /* Inject pending hw/sw trap */
     if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ) 
     {
@@ -528,10 +536,16 @@ int hvm_domain_initialise(struct domain *d)
     if ( rc != 0 )
         goto fail0;
 
+    rc = -ENOMEM;
     d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS);
+    if ( !d->arch.hvm_domain.params )
+        goto fail1;
+
+    if ( is_pvh_domain(d) )
+        return 0;
+
     d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler);
-    rc = -ENOMEM;
-    if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler )
+    if ( !d->arch.hvm_domain.io_handler )
         goto fail1;
     d->arch.hvm_domain.io_handler->num_slot = 0;
 
@@ -578,6 +592,9 @@ int hvm_domain_initialise(struct domain *d)
 
 void hvm_domain_relinquish_resources(struct domain *d)
 {
+    if ( is_pvh_domain(d) )
+        return;
+
     if ( hvm_funcs.nhvm_domain_relinquish_resources )
         hvm_funcs.nhvm_domain_relinquish_resources(d);
 
@@ -602,6 +619,10 @@ void hvm_domain_relinquish_resources(struct domain *d)
 void hvm_domain_destroy(struct domain *d)
 {
     hvm_destroy_cacheattr_region_list(d);
+
+    if ( is_pvh_domain(d) )
+        return;
+
     hvm_funcs.domain_destroy(d);
     rtc_deinit(d);
     stdvga_deinit(d);
@@ -1114,6 +1135,14 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     v->arch.hvm_vcpu.inject_trap.vector = -1;
 
+    if ( is_pvh_vcpu(v) )
+    {
+        v->arch.hvm_vcpu.hcall_64bit = 1;    /* PVH 32bitfixme. */
+        /* This for hvm_long_mode_enabled(v). */
+        v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME;
+        return 0;
+    }
+
     rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
     if ( rc != 0 )
         goto fail3;
@@ -1188,7 +1217,10 @@ void hvm_vcpu_destroy(struct vcpu *v)
 
     tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);
     hvm_vcpu_cacheattr_destroy(v);
-    vlapic_destroy(v);
+
+    if ( is_hvm_vcpu(v) )
+        vlapic_destroy(v);
+
     hvm_funcs.vcpu_destroy(v);
 
     /* Event channel is already freed by evtchn_destroy(). */
@@ -1389,6 +1421,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa,
     /* For the benefit of 32-bit WinXP (& older Windows) on AMD CPUs,
      * a fast path for LAPIC accesses, skipping the p2m lookup. */
     if ( !nestedhvm_vcpu_in_guestmode(v)
+         && is_hvm_vcpu(v)
          && gfn == PFN_DOWN(vlapic_base_address(vcpu_vlapic(v))) )
     {
         if ( !handle_mmio() )
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 4ae2c0c..6edc5c9 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -175,6 +175,10 @@ int handle_mmio(void)
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     int rc;
 
+    /* No MMIO for PVH vcpus */
+    if ( is_pvh_vcpu(curr) )
+        return 0;
+
     hvm_emulate_prepare(&ctxt, guest_cpu_user_regs());
 
     rc = hvm_emulate_one(&ctxt);
@@ -228,6 +232,9 @@ int handle_pio(uint16_t port, int size, int dir)
     unsigned long data, reps = 1;
     int rc;
 
+    /* PIO for PVH is handled by the PV handlers */
+    ASSERT(!is_pvh_vcpu(curr));
+
     if ( dir == IOREQ_WRITE )
         data = guest_cpu_user_regs()->eax;
 
diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 6a6fb68..677fbcd 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v)
          && vcpu_info(v, evtchn_upcall_pending) )
         return hvm_intack_vector(plat->irq.callback_via.vector);
 
+    if ( is_pvh_vcpu(v) )
+        return hvm_intack_none;
+
     if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output )
         return hvm_intack_pic(0);
 
diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
index e6d5b46..482413e 100644
--- a/xen/arch/x86/hvm/vmx/intr.c
+++ b/xen/arch/x86/hvm/vmx/intr.c
@@ -225,7 +225,8 @@ void vmx_intr_assist(void)
     }
 
     /* Crank the handle on interrupt state. */
-    pt_vector = pt_update_irq(v);
+    if ( is_hvm_vcpu(v) )
+        pt_vector = pt_update_irq(v);
 
     do {
         intack = hvm_vcpu_has_pending_irq(v);
-- 
1.7.9.5
Changes:
* Enforce HAP mode for now
* Disable exits related to virtual interrupts or emulated APICs
* Disable changing paging mode
 - "unrestricted guest" (i.e., real mode for EPT) disabled
 - write guest EFER disabled
* Start in 64-bit mode
* Force TSC mode to be "none"
* Paging mode update to happen in arch_set_info_guest
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Fix up default cr0 settings
 - Get rid of some unnecessary PVH-related changes
 - Return EOPNOTSUPP instead of ENOSYS if hardware features are not present
 - Remove an unnecessary variable from pvh_check_requirements
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/vmx/vmcs.c |  130 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 126 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index cf54d18..53fccdf 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -828,6 +828,60 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding,
u64 val)
     virtual_vmcs_exit(vvmcs);
 }
 
+static int pvh_check_requirements(struct vcpu *v)
+{
+    u64 required;
+
+    /* Check for required hardware features */
+    if ( !cpu_has_vmx_ept )
+    {
+        printk(XENLOG_G_INFO "PVH: CPU does not have EPT support\n");
+        return -EOPNOTSUPP;
+    }
+    if ( !cpu_has_vmx_pat )
+    {
+        printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n");
+        return -EOPNOTSUPP;
+    }
+    if ( !cpu_has_vmx_msr_bitmap )
+    {
+        printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n");
+        return -EOPNOTSUPP;
+    }
+    if ( !cpu_has_vmx_secondary_exec_control )
+    {
+        printk(XENLOG_G_INFO "CPU Secondary exec is required to run
PVH\n");
+        return -EOPNOTSUPP;
+    }
+    required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
+    if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) !=
required )
+    {
+        printk(XENLOG_G_INFO "PVH: required CR4 features not
available:%lx\n",
+               required);
+        return -EOPNOTSUPP;
+    }
+
+    /* Check for required configuration options */
+    if ( !paging_mode_hap(v->domain) )
+    {
+        printk(XENLOG_G_INFO "HAP is required for PVH guest.\n");
+        return -EINVAL;
+    }
+    /*
+     * If rdtsc exiting is turned on and it goes thru emulate_privileged_op,
+     * then pv_vcpu.ctrlreg must be added to the pvh struct.
+     */
+    if ( v->domain->arch.vtsc )
+    {
+        printk(XENLOG_G_INFO
+               "At present PVH only supports the default timer
mode\n");
+        return -EINVAL;
+    }
+
+
+    return 0;
+}
+
 static int construct_vmcs(struct vcpu *v)
 {
     struct domain *d = v->domain;
@@ -836,6 +890,13 @@ static int construct_vmcs(struct vcpu *v)
     u32 vmexit_ctl = vmx_vmexit_control;
     u32 vmentry_ctl = vmx_vmentry_control;
 
+    if ( is_pvh_domain(d) )
+    {
+        int rc = pvh_check_requirements(v);
+        if ( rc )
+            return rc;
+    }
+
     vmx_vmcs_enter(v);
 
     /* VMCS controls. */
@@ -874,7 +935,32 @@ static int construct_vmcs(struct vcpu *v)
     /* Do not enable Monitor Trap Flag unless start single step debug */
     v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
 
+    if ( is_pvh_domain(d) )
+    {
+        /* Disable virtual apics, TPR */
+        v->arch.hvm_vmx.secondary_exec_control &= 
+            ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES
+              | SECONDARY_EXEC_APIC_REGISTER_VIRT
+              | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+        v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW;
+
+        /* Disable wbinvd (only necessary for MMIO),
+         * unrestricted guest (real mode for EPT) */
+        v->arch.hvm_vmx.secondary_exec_control &= 
+            ~(SECONDARY_EXEC_UNRESTRICTED_GUEST
+              | SECONDARY_EXEC_WBINVD_EXITING);
+
+        /* Start in 64-bit mode.
+         * PVH 32bitfixme. */
+        vmentry_ctl |= VM_ENTRY_IA32E_MODE;       /* GUEST_EFER.LME/LMA ignored
*/
+
+        ASSERT(v->arch.hvm_vmx.exec_control &
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
+        ASSERT(v->arch.hvm_vmx.exec_control &
CPU_BASED_ACTIVATE_MSR_BITMAP);
+        ASSERT(!(v->arch.hvm_vmx.exec_control &
CPU_BASED_RDTSC_EXITING));
+    }
+
     vmx_update_cpu_exec_control(v);
+
     __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
     __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
 
@@ -910,6 +996,17 @@ static int construct_vmcs(struct vcpu *v)
         vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, MSR_TYPE_R |
MSR_TYPE_W);
         if ( cpu_has_vmx_pat && paging_mode_hap(d) )
             vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, MSR_TYPE_R |
MSR_TYPE_W);
+        if ( is_pvh_domain(d) )
+            vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, MSR_TYPE_R |
MSR_TYPE_W);
+
+        /*
+         * PVH: We don''t disable intercepts for MSRs: MSR_STAR,
MSR_LSTAR,
+         *      MSR_CSTAR, and MSR_SYSCALL_MASK because we need to specify
+         *      save/restore area to save/restore at every VM exit and entry.
+         *      Instead, let the intercept functions save them into
+         *      vmx_msr_state fields. See comment in vmx_restore_host_msrs().
+         *      See also vmx_restore_guest_msrs().
+         */
     }
 
     /* I/O access bitmap. */
@@ -998,7 +1095,11 @@ static int construct_vmcs(struct vcpu *v)
     __vmwrite(GUEST_DS_AR_BYTES, 0xc093);
     __vmwrite(GUEST_FS_AR_BYTES, 0xc093);
     __vmwrite(GUEST_GS_AR_BYTES, 0xc093);
-    __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */
+    if ( is_pvh_domain(d) )
+        /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */
+        __vmwrite(GUEST_CS_AR_BYTES, 0xa09b);
+    else
+        __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */
 
     /* Guest IDT. */
     __vmwrite(GUEST_IDTR_BASE, 0);
@@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v)
               | (1U << TRAP_no_device);
     vmx_update_exception_bitmap(v);
 
+    /* In HVM domains, this happens on the realmode->paging
+     * transition.  Since PVH never goes through this transition, we
+     * need to do it at start-of-day. */
+    if ( is_pvh_domain(d) )
+        vmx_update_debug_state(v);
+
     v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET;
+
+    /* PVH domains always start in paging mode */
+    if ( is_pvh_domain(d) )
+        v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE |
X86_CR0_WP;
+
     hvm_update_guest_cr(v, 0);
 
-    v->arch.hvm_vcpu.guest_cr[4] = 0;
+    v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ?
+        real_cr4_to_pv_guest_cr4(mmu_cr4_features)
+        : 0;
     hvm_update_guest_cr(v, 4);
 
+    if ( is_pvh_domain(d) )
+        v->arch.hvm_vmx.vmx_realmode = 0;
+
     if ( cpu_has_vmx_tpr_shadow )
     {
         __vmwrite(VIRTUAL_APIC_PAGE_ADDR,
@@ -1063,9 +1180,14 @@ static int construct_vmcs(struct vcpu *v)
 
     vmx_vmcs_exit(v);
 
-    paging_update_paging_modes(v); /* will update HOST & GUEST_CR3 as reqd
*/
+    /* PVH: paging mode is updated by arch_set_info_guest(). */
+    if ( is_hvm_vcpu(v) )
+    {
+        /* will update HOST & GUEST_CR3 as reqd */
+        paging_update_paging_modes(v);
 
-    vmx_vlapic_msr_changed(v);
+        vmx_vlapic_msr_changed(v);
+    }
 
     return 0;
 }
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 07/20] pvh: Do not allow PVH guests to change paging modes
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v13: Removed unnecessary else.
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/hvm.c |   10 ++++++++++
 1 file changed, 10 insertions(+)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 370bd4d..7f7b9f5 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1765,6 +1765,16 @@ int hvm_set_cr0(unsigned long value)
          (value & (X86_CR0_PE | X86_CR0_PG)) == X86_CR0_PG )
         goto gpf;
 
+    
+
+    /* A pvh is not expected to change to real mode. */
+    if ( is_pvh_vcpu(v)
+         && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG |
X86_CR0_PE) )
+    {
+        printk(XENLOG_G_WARNING
+               "PVH attempting to turn off PE/PG. CR0:%lx\n", value);
+        goto gpf;
+    }
     if ( (value & X86_CR0_PG) && !(old_value & X86_CR0_PG) )
     {
         if ( v->arch.hvm_vcpu.guest_efer & EFER_LME )
-- 
1.7.9.5
Hypercalls where we now have unrestricted access:
* memory_op
* console_io
* vcpu_op
* mmuext_op
We also restrict PVH domain access to HVMOP_*_param to reading and
writing HVM_PARAM_CALLBACK_IRQ.
Most hvm_op functions require "is_hvm_domain()" and will default to
-EINVAL; exceptions are HVMOP_get_time and HVMOP_xentrace.
Finally, we restrict setting IOPL permissions for a PVH domain.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Minor code tweaks, as suggested during review
 - return -ENOSYS for set_iopl and set_iobitmap calls
 - Allow HVMOP_set_param for HVM_PARAM_CALLBACK_IRQ.  We still don''t
   allow other values to be written.
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/hvm.c  |   48 ++++++++++++++++++++++++++++++++++++++---------
 xen/arch/x86/hvm/mtrr.c |    1 +
 xen/arch/x86/physdev.c  |   10 ++++++++++
 xen/common/kernel.c     |   13 +++++++++++--
 4 files changed, 61 insertions(+), 11 deletions(-)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7f7b9f5..fd2e117 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3375,6 +3375,24 @@ static hvm_hypercall_t *const
hvm_hypercall32_table[NR_hypercalls] = {
     HYPERCALL(tmem_op)
 };
 
+/* PVH 32bitfixme. */
+static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] = {
+    HYPERCALL(platform_op),
+    HYPERCALL(memory_op),
+    HYPERCALL(xen_version),
+    HYPERCALL(console_io),
+    [ __HYPERVISOR_grant_table_op ]  = (hvm_hypercall_t *)hvm_grant_table_op,
+    HYPERCALL(vcpu_op),
+    HYPERCALL(mmuext_op),
+    HYPERCALL(xsm_op),
+    HYPERCALL(sched_op),
+    HYPERCALL(event_channel_op),
+    [ __HYPERVISOR_physdev_op ]      = (hvm_hypercall_t *)hvm_physdev_op,
+    HYPERCALL(hvm_op),
+    HYPERCALL(sysctl),
+    HYPERCALL(domctl)
+};
+
 int hvm_do_hypercall(struct cpu_user_regs *regs)
 {
     struct vcpu *curr = current;
@@ -3401,7 +3419,9 @@ int hvm_do_hypercall(struct cpu_user_regs *regs)
     if ( (eax & 0x80000000) && is_viridian_domain(curr->domain)
)
         return viridian_hypercall(regs);
 
-    if ( (eax >= NR_hypercalls) || !hvm_hypercall32_table[eax] )
+    if ( (eax >= NR_hypercalls) ||
+         (is_pvh_vcpu(curr) ? !pvh_hypercall64_table[eax]
+                            : !hvm_hypercall32_table[eax]) )
     {
         regs->eax = -ENOSYS;
         return HVM_HCALL_completed;
@@ -3416,16 +3436,20 @@ int hvm_do_hypercall(struct cpu_user_regs *regs)
                     regs->r10, regs->r8, regs->r9);
 
         curr->arch.hvm_vcpu.hcall_64bit = 1;
-        regs->rax = hvm_hypercall64_table[eax](regs->rdi,
-                                               regs->rsi,
-                                               regs->rdx,
-                                               regs->r10,
-                                               regs->r8,
-                                               regs->r9); 
+        if ( is_pvh_vcpu(curr) )
+            regs->rax = pvh_hypercall64_table[eax](regs->rdi,
regs->rsi,
+                                                   regs->rdx, regs->r10,
+                                                   regs->r8, regs->r9);
+        else
+            regs->rax = hvm_hypercall64_table[eax](regs->rdi,
regs->rsi,
+                                                   regs->rdx, regs->r10,
+                                                   regs->r8, regs->r9);
         curr->arch.hvm_vcpu.hcall_64bit = 0;
     }
     else
     {
+        ASSERT(!is_pvh_vcpu(curr));   /* PVH 32bitfixme. */
+
         HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%u(%x, %x, %x, %x, %x,
%x)", eax,
                     (uint32_t)regs->ebx, (uint32_t)regs->ecx,
                     (uint32_t)regs->edx, (uint32_t)regs->esi,
@@ -3850,7 +3874,11 @@ long do_hvm_op(unsigned long op,
XEN_GUEST_HANDLE_PARAM(void) arg)
             return -ESRCH;
 
         rc = -EINVAL;
-        if ( !is_hvm_domain(d) )
+        if ( !has_hvm_container_domain(d) )
+            goto param_fail;
+
+        if ( is_pvh_domain(d)
+             && (a.index != HVM_PARAM_CALLBACK_IRQ) )
             goto param_fail;
 
         rc = xsm_hvm_param(XSM_TARGET, d, op);
@@ -4024,7 +4052,9 @@ long do_hvm_op(unsigned long op,
XEN_GUEST_HANDLE_PARAM(void) arg)
 
             if ( rc == 0 ) 
             {
-                d->arch.hvm_domain.params[a.index] = a.value;
+                /* PVH domains don''t have a params struct */
+                if ( d->arch.hvm_domain.params )
+                    d->arch.hvm_domain.params[a.index] = a.value;
 
                 switch( a.index )
                 {
diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index ef51a8d..9785cef 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -578,6 +578,7 @@ int32_t hvm_set_mem_pinned_cacheattr(
 {
     struct hvm_mem_pinned_cacheattr_range *range;
 
+    /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */
     if ( !((type == PAT_TYPE_UNCACHABLE) ||
            (type == PAT_TYPE_WRCOMB) ||
            (type == PAT_TYPE_WRTHROUGH) ||
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index dab6213..7d787dc 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -519,6 +519,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void)
arg)
 
     case PHYSDEVOP_set_iopl: {
         struct physdev_set_iopl set_iopl;
+        
+        ret = -ENOSYS;
+        if ( is_pvh_vcpu(current) )
+            break;
+
         ret = -EFAULT;
         if ( copy_from_guest(&set_iopl, arg, 1) != 0 )
             break;
@@ -532,6 +537,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void)
arg)
 
     case PHYSDEVOP_set_iobitmap: {
         struct physdev_set_iobitmap set_iobitmap;
+
+        ret = -ENOSYS;
+        if ( is_pvh_vcpu(current) )
+            break;
+
         ret = -EFAULT;
         if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 )
             break;
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 2d68a7c..6a708f3 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -305,14 +305,23 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             if ( current->domain == dom0 )
                 fi.submap |= 1U << XENFEAT_dom0;
 #ifdef CONFIG_X86
-            if ( is_pv_vcpu(current) )
+            switch(d->guest_type) {
+            case guest_type_pv:
                 fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
                              (1U << XENFEAT_highmem_assist) |
                              (1U << XENFEAT_gnttab_map_avail_bits);
-            else
+                break;
+            case guest_type_pvh:
+                fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
+                             (1U << XENFEAT_supervisor_mode_kernel) |
+                             (1U << XENFEAT_hvm_callback_vector);
+                break;
+            case guest_type_hvm:
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
                              (1U << XENFEAT_hvm_pirqs);
+                break;
+            }
 #endif
             break;
         default:
-- 
1.7.9.5
Allow PV e820 map to be set and read from a PVH domain.  This requires
moving the pv e820 struct out from the pv-specific domain struct and
into the arch domain struct.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/domain.c        |    9 +++------
 xen/arch/x86/mm.c            |   26 ++++++++++++--------------
 xen/include/asm-x86/domain.h |   10 +++++-----
 3 files changed, 20 insertions(+), 25 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ac18aa7..5b7e1b2 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
         if ( (rc = iommu_domain_init(d)) != 0 )
             goto fail;
     }
+    spin_lock_init(&d->arch.e820_lock);
 
     if ( has_hvm_container_domain(d) )
     {
@@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
         }
     }
     else
-    {
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
 
-        spin_lock_init(&d->arch.pv_domain.e820_lock);
-    }
-
     /* initialize default tsc behavior in case tools don''t */
     tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
     spin_lock_init(&d->arch.vtsc_lock);
@@ -592,8 +589,8 @@ void arch_domain_destroy(struct domain *d)
 {
     if ( has_hvm_container_domain(d) )
         hvm_domain_destroy(d);
-    else
-        xfree(d->arch.pv_domain.e820);
+
+    xfree(d->arch.e820);
 
     free_domain_pirqs(d);
     if ( !is_idle_domain(d) )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 120599a..f25df92 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4760,11 +4760,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void)
arg)
             return -EFAULT;
         }
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
-        xfree(d->arch.pv_domain.e820);
-        d->arch.pv_domain.e820 = e820;
-        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
+        xfree(d->arch.e820);
+        d->arch.e820 = e820;
+        d->arch.nr_e820 = fmap.map.nr_entries;
+        spin_unlock(&d->arch.e820_lock);
 
         rcu_unlock_domain(d);
         return rc;
@@ -4778,26 +4778,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void)
arg)
         if ( copy_from_guest(&map, arg, 1) )
             return -EFAULT;
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
 
         /* Backwards compatibility. */
-        if ( (d->arch.pv_domain.nr_e820 == 0) ||
-             (d->arch.pv_domain.e820 == NULL) )
+        if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -ENOSYS;
         }
 
-        map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820);
-        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
-                           map.nr_entries) ||
+        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
+        if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) ||
              __copy_to_guest(arg, &map, 1) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -EFAULT;
         }
 
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_unlock(&d->arch.e820_lock);
         return 0;
     }
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0831700..a60d25c 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -234,11 +234,6 @@ struct pv_domain
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
-
-    /* Pseudophysical e820 map (XENMEM_memory_map).  */
-    spinlock_t e820_lock;
-    struct e820entry *e820;
-    unsigned int nr_e820;
 };
 
 struct arch_domain
@@ -313,6 +308,11 @@ struct arch_domain
                                 (possibly other cases in the future */
     uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
     uint64_t vtsc_usercount; /* not used for hvm */
+
+    /* Pseudophysical e820 map (XENMEM_memory_map).  */
+    spinlock_t e820_lock;
+    struct e820entry *e820;
+    unsigned int nr_e820;
 } __cacheline_aligned;
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 10/20] pvh: Support guest_kernel_mode for PVH
This will be requred in order to do emulated_privops for PVH.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v13: Make more general by using has_hvm_container rather than is_pvh.
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/hvm.c            |    8 ++++++++
 xen/arch/x86/x86_64/traps.c       |    2 +-
 xen/include/asm-x86/domain.h      |    2 +-
 xen/include/asm-x86/x86_64/regs.h |   11 +++++++----
 4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index fd2e117..af9567f 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4739,6 +4739,14 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v)
     return hvm_funcs.nhvm_intr_blocked(v);
 }
 
+bool_t hvm_kernel_mode(struct vcpu *v)
+{
+    struct segment_register seg;
+
+    hvm_get_segment_register(v, x86_seg_ss, &seg);
+    return (seg.attr.fields.dpl == 0);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index ef5c470..8644aaf 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -142,7 +142,7 @@ void show_registers(struct cpu_user_regs *regs)
     }
 }
 
-void vcpu_show_registers(const struct vcpu *v)
+void vcpu_show_registers(struct vcpu *v)
 {
     const struct cpu_user_regs *regs = &v->arch.user_regs;
     unsigned long crs[8];
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index a60d25c..e1877ca 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -447,7 +447,7 @@ struct arch_vcpu
 #define hvm_svm         hvm_vcpu.u.svm
 
 void vcpu_show_execution_state(struct vcpu *);
-void vcpu_show_registers(const struct vcpu *);
+void vcpu_show_registers(struct vcpu *);
 
 /* Clean up CR4 bits that are not under guest control. */
 unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
diff --git a/xen/include/asm-x86/x86_64/regs.h
b/xen/include/asm-x86/x86_64/regs.h
index 3cdc702..231ae4a 100644
--- a/xen/include/asm-x86/x86_64/regs.h
+++ b/xen/include/asm-x86/x86_64/regs.h
@@ -10,10 +10,13 @@
 #define ring_2(r)    (((r)->cs & 3) == 2)
 #define ring_3(r)    (((r)->cs & 3) == 3)
 
-#define guest_kernel_mode(v, r)                                 \
-    (!is_pv_32bit_vcpu(v) ?                                     \
-     (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) :       
\
-     (ring_1(r)))
+bool_t hvm_kernel_mode(struct vcpu *);
+
+#define guest_kernel_mode(v, r)                                   \
+    (has_hvm_container_vcpu(v) ? hvm_kernel_mode(v) :             \
+     (!is_pv_32bit_vcpu(v) ?                                      \
+      (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) :       
\
+      (ring_1(r))))
 
 #define permit_softint(dpl, v, r) \
     ((dpl) >= (guest_kernel_mode(v, r) ? 1 : 3))
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 11/20] pvh: Support read_segment_register for PVH
This will be necessary to do PV-style emulated operations for PVH guests.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Put read_selector next to other segment-related calls
 - rename pvh_get_selector to hvm_read_selector, to make the naming consistent
CC: Jan Beulich <jan.beulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/domain.c         |    8 ++++----
 xen/arch/x86/hvm/vmx/vmx.c    |   40 ++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/traps.c          |   26 ++++++++++++--------------
 xen/arch/x86/x86_64/traps.c   |   16 ++++++++--------
 xen/include/asm-x86/hvm/hvm.h |    6 ++++++
 xen/include/asm-x86/system.h  |   19 +++++++++++++++----
 6 files changed, 85 insertions(+), 30 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 5b7e1b2..4b4c66d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1217,10 +1217,10 @@ static void save_segments(struct vcpu *v)
     struct cpu_user_regs *regs = &v->arch.user_regs;
     unsigned int dirty_segment_mask = 0;
 
-    regs->ds = read_segment_register(ds);
-    regs->es = read_segment_register(es);
-    regs->fs = read_segment_register(fs);
-    regs->gs = read_segment_register(gs);
+    regs->ds = read_segment_register(v, regs, ds);
+    regs->es = read_segment_register(v, regs, es);
+    regs->fs = read_segment_register(v, regs, fs);
+    regs->gs = read_segment_register(v, regs, gs);
 
     if ( regs->ds )
         dirty_segment_mask |= DIRTY_DS;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index f9b589b..5392223 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -664,6 +664,45 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
         .fields = { .type = 0xb, .s = 0, .dpl = 0, .p = 1, .avl = 0,    \
                     .l = 0, .db = 0, .g = 0, .pad = 0 } }).bytes)
 
+u16 vmx_read_selector(struct vcpu *v, enum x86_segment seg)
+{
+    u16 sel = 0;
+
+    vmx_vmcs_enter(v);
+    switch ( seg )
+    {
+    case x86_seg_cs:
+        sel = __vmread(GUEST_CS_SELECTOR);
+        break;
+
+    case x86_seg_ss:
+        sel = __vmread(GUEST_SS_SELECTOR);
+        break;
+
+    case x86_seg_es:
+        sel = __vmread(GUEST_ES_SELECTOR);
+        break;
+
+    case x86_seg_ds:
+        sel = __vmread(GUEST_DS_SELECTOR);
+        break;
+
+    case x86_seg_fs:
+        sel = __vmread(GUEST_FS_SELECTOR);
+        break;
+
+    case x86_seg_gs:
+        sel = __vmread(GUEST_GS_SELECTOR);
+        break;
+
+    default:
+        BUG();
+    }
+    vmx_vmcs_exit(v);
+
+    return sel;
+}
+
 void vmx_get_segment_register(struct vcpu *v, enum x86_segment seg,
                               struct segment_register *reg)
 {
@@ -1526,6 +1565,7 @@ static struct hvm_function_table __initdata
vmx_function_table = {
     .guest_x86_mode       = vmx_guest_x86_mode,
     .get_segment_register = vmx_get_segment_register,
     .set_segment_register = vmx_set_segment_register,
+    .read_selector        = vmx_read_selector,
     .get_shadow_gs_base   = vmx_get_shadow_gs_base,
     .update_host_cr3      = vmx_update_host_cr3,
     .update_guest_cr      = vmx_update_guest_cr,
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 58a92a5..1eac9ff 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1835,8 +1835,6 @@ static inline uint64_t guest_misc_enable(uint64_t val)
     }                                                                       \
     (eip) += sizeof(_x); _x; })
 
-#define read_sreg(regs, sr) read_segment_register(sr)
-
 static int is_cpufreq_controller(struct domain *d)
 {
     return ((cpufreq_controller == FREQCTL_dom0_kernel) &&
@@ -1881,7 +1879,7 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
         goto fail;
 
     /* emulating only opcodes not allowing SS to be default */
-    data_sel = read_sreg(regs, ds);
+    data_sel = read_segment_register(v, regs, ds);
 
     /* Legacy prefixes. */
     for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) )
@@ -1899,17 +1897,17 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
             data_sel = regs->cs;
             continue;
         case 0x3e: /* DS override */
-            data_sel = read_sreg(regs, ds);
+            data_sel = read_segment_register(v, regs, ds);
             continue;
         case 0x26: /* ES override */
-            data_sel = read_sreg(regs, es);
+            data_sel = read_segment_register(v, regs, es);
             continue;
         case 0x64: /* FS override */
-            data_sel = read_sreg(regs, fs);
+            data_sel = read_segment_register(v, regs, fs);
             lm_ovr = lm_seg_fs;
             continue;
         case 0x65: /* GS override */
-            data_sel = read_sreg(regs, gs);
+            data_sel = read_segment_register(v, regs, gs);
             lm_ovr = lm_seg_gs;
             continue;
         case 0x36: /* SS override */
@@ -1956,7 +1954,7 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
 
         if ( !(opcode & 2) )
         {
-            data_sel = read_sreg(regs, es);
+            data_sel = read_segment_register(v, regs, es);
             lm_ovr = lm_seg_none;
         }
 
@@ -2689,22 +2687,22 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
             ASSERT(opnd_sel);
             continue;
         case 0x3e: /* DS override */
-            opnd_sel = read_sreg(regs, ds);
+            opnd_sel = read_segment_register(v, regs, ds);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x26: /* ES override */
-            opnd_sel = read_sreg(regs, es);
+            opnd_sel = read_segment_register(v, regs, es);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x64: /* FS override */
-            opnd_sel = read_sreg(regs, fs);
+            opnd_sel = read_segment_register(v, regs, fs);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x65: /* GS override */
-            opnd_sel = read_sreg(regs, gs);
+            opnd_sel = read_segment_register(v, regs, gs);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
@@ -2757,7 +2755,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
                             switch ( modrm & 7 )
                             {
                             default:
-                                opnd_sel = read_sreg(regs, ds);
+                                opnd_sel = read_segment_register(v, regs, ds);
                                 break;
                             case 4: case 5:
                                 opnd_sel = regs->ss;
@@ -2785,7 +2783,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
                             break;
                         }
                         if ( !opnd_sel )
-                            opnd_sel = read_sreg(regs, ds);
+                            opnd_sel = read_segment_register(v, regs, ds);
                         switch ( modrm & 7 )
                         {
                         case 0: case 2: case 4:
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 8644aaf..3dfb309 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -123,10 +123,10 @@ void show_registers(struct cpu_user_regs *regs)
         fault_crs[0] = read_cr0();
         fault_crs[3] = read_cr3();
         fault_crs[4] = read_cr4();
-        fault_regs.ds = read_segment_register(ds);
-        fault_regs.es = read_segment_register(es);
-        fault_regs.fs = read_segment_register(fs);
-        fault_regs.gs = read_segment_register(gs);
+        fault_regs.ds = read_segment_register(v, regs, ds);
+        fault_regs.es = read_segment_register(v, regs, es);
+        fault_regs.fs = read_segment_register(v, regs, fs);
+        fault_regs.gs = read_segment_register(v, regs, gs);
     }
 
     print_xen_info();
@@ -239,10 +239,10 @@ void do_double_fault(struct cpu_user_regs *regs)
     crs[2] = read_cr2();
     crs[3] = read_cr3();
     crs[4] = read_cr4();
-    regs->ds = read_segment_register(ds);
-    regs->es = read_segment_register(es);
-    regs->fs = read_segment_register(fs);
-    regs->gs = read_segment_register(gs);
+    regs->ds = read_segment_register(current, regs, ds);
+    regs->es = read_segment_register(current, regs, es);
+    regs->fs = read_segment_register(current, regs, fs);
+    regs->gs = read_segment_register(current, regs, gs);
 
     printk("CPU:    %d\n", cpu);
     _show_registers(regs, crs, CTXT_hypervisor, NULL);
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 3376418..9437ff7 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -114,6 +114,7 @@ struct hvm_function_table {
                                  struct segment_register *reg);
     void (*set_segment_register)(struct vcpu *v, enum x86_segment seg,
                                  struct segment_register *reg);
+    u16 (*read_selector)(struct vcpu *v, enum x86_segment seg);
     unsigned long (*get_shadow_gs_base)(struct vcpu *v);
 
     /* 
@@ -321,6 +322,11 @@ hvm_set_segment_register(struct vcpu *v, enum x86_segment
seg,
     hvm_funcs.set_segment_register(v, seg, reg);
 }
 
+static inline u16 hvm_read_selector(struct vcpu *v, enum x86_segment seg)
+{
+    return hvm_funcs.read_selector(v, seg);
+}
+
 static inline unsigned long hvm_get_shadow_gs_base(struct vcpu *v)
 {
     return hvm_funcs.get_shadow_gs_base(v);
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 6ab7d56..34f7bbd 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -4,10 +4,21 @@
 #include <xen/lib.h>
 #include <xen/bitops.h>
 
-#define read_segment_register(name)                             \
-({  u16 __sel;                                                  \
-    asm volatile ( "movw %%" STR(name) ",%0" :
"=r" (__sel) );  \
-    __sel;                                                      \
+/*
+ * We need vcpu because during context switch, going from PV to PVH,
+ * in save_segments() current has been updated to next, and no longer pointing
+ * to the PV, but the intention is to get selector for the PV. Checking
+ * is_pvh_vcpu(current) will yield incorrect results in such a case.
+ */
+#define read_segment_register(vcpu, regs, name)                   \
+({  u16 __sel;                                                    \
+    struct cpu_user_regs *_regs = (regs);                         \
+                                                                  \
+    if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) )                 \
+        __sel = hvm_read_selector(vcpu, x86_seg_##name);          \
+    else                                                          \
+        asm volatile ( "movw %%" #name ",%0" :
"=r" (__sel) );    \
+    __sel;                                                        \
 })
 
 #define wbinvd() \
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 12/20] pvh: read_descriptor for PVH guests.
This is in preparation for enabling emulated privops for PVH guests.
This one unfortunately has a bit more of an impedance mismatch:
* For PV, the selector is hard-coded in by passing #name in the
  #define.  For PVH, we have to do a switch statement.
* For PV, given the desrciptor, it can read the resulting base, limit,
  &c directly; for PVH, we have to read the values currently loaded in the
vmcs.
This leads to a rather awkward construct where we *both* read the
descriptor, *and* specify a selector, and introduce an intermediate function,
read_descriptor_sel.
Unfortunately, without introducing a rather pointless switch()
statement to the PV path similar to the one in the PVH path, there''s
no way to make this less awkward.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v13: Removed stray space
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/traps.c       |   73 ++++++++++++++++++++++++++++++++++++++------
 xen/include/asm-x86/desc.h |    4 ++-
 2 files changed, 66 insertions(+), 11 deletions(-)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 1eac9ff..0463697 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1518,6 +1518,49 @@ static int read_descriptor(unsigned int sel,
     return 1;
 }
 
+static int read_descriptor_sel(unsigned int sel,
+                               enum x86_segment which_sel,
+                               struct vcpu *v,
+                               const struct cpu_user_regs *regs,
+                               unsigned long *base,
+                               unsigned long *limit,
+                               unsigned int *ar,
+                               unsigned int vm86attr)
+{
+    struct segment_register seg;
+    bool_t long_mode;
+
+    if ( !is_pvh_vcpu(v) )
+        return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);
+
+    hvm_get_segment_register(v, x86_seg_cs, &seg);
+    long_mode = seg.attr.fields.l;
+
+    if ( which_sel != x86_seg_cs )
+        hvm_get_segment_register(v, which_sel, &seg);
+
+    /* "ar" is returned packed as in segment_attributes_t. Fix it up.
*/
+    *ar = seg.attr.bytes;
+    *ar = (*ar & 0xff) | ((*ar & 0xf00) << 4);
+    *ar <<= 8;
+
+    if ( long_mode )
+    {
+        *limit = ~0UL;
+
+        if ( which_sel < x86_seg_fs )
+        {
+            *base = 0UL;
+            return 1;
+        }
+   }
+   else
+       *limit = seg.limit;
+
+   *base = seg.base;
+    return 1;
+}
+
 static int read_gate_descriptor(unsigned int gate_sel,
                                 const struct vcpu *v,
                                 unsigned int *sel,
@@ -1845,6 +1888,7 @@ static int is_cpufreq_controller(struct domain *d)
 
 static int emulate_privileged_op(struct cpu_user_regs *regs)
 {
+    enum x86_segment which_sel;
     struct vcpu *v = current;
     unsigned long *reg, eip = regs->eip;
     u8 opcode, modrm_reg = 0, modrm_rm = 0, rep_prefix = 0, lock = 0, rex = 0;
@@ -1867,9 +1911,10 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
     void (*io_emul)(struct cpu_user_regs *) __attribute__((__regparm__(1)));
     uint64_t val, msr_content;
 
-    if ( !read_descriptor(regs->cs, v, regs,
-                          &code_base, &code_limit, &ar,
-                          _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) )
+    if ( !read_descriptor_sel(regs->cs, x86_seg_cs, v, regs,
+                              &code_base, &code_limit, &ar,
+                              _SEGMENT_CODE|_SEGMENT_S|
+                              _SEGMENT_DPL|_SEGMENT_P) )
         goto fail;
     op_default = op_bytes = (ar & (_SEGMENT_L|_SEGMENT_DB)) ? 4 : 2;
     ad_default = ad_bytes = (ar & _SEGMENT_L) ? 8 : op_default;
@@ -1880,6 +1925,7 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
 
     /* emulating only opcodes not allowing SS to be default */
     data_sel = read_segment_register(v, regs, ds);
+    which_sel = x86_seg_ds;
 
     /* Legacy prefixes. */
     for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) )
@@ -1895,23 +1941,29 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
             continue;
         case 0x2e: /* CS override */
             data_sel = regs->cs;
+            which_sel = x86_seg_cs;
             continue;
         case 0x3e: /* DS override */
             data_sel = read_segment_register(v, regs, ds);
+            which_sel = x86_seg_ds;
             continue;
         case 0x26: /* ES override */
             data_sel = read_segment_register(v, regs, es);
+            which_sel = x86_seg_es;
             continue;
         case 0x64: /* FS override */
             data_sel = read_segment_register(v, regs, fs);
+            which_sel = x86_seg_fs;
             lm_ovr = lm_seg_fs;
             continue;
         case 0x65: /* GS override */
             data_sel = read_segment_register(v, regs, gs);
+            which_sel = x86_seg_gs;
             lm_ovr = lm_seg_gs;
             continue;
         case 0x36: /* SS override */
             data_sel = regs->ss;
+            which_sel = x86_seg_ss;
             continue;
         case 0xf0: /* LOCK */
             lock = 1;
@@ -1955,15 +2007,16 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
         if ( !(opcode & 2) )
         {
             data_sel = read_segment_register(v, regs, es);
+            which_sel = x86_seg_es;
             lm_ovr = lm_seg_none;
         }
 
         if ( !(ar & _SEGMENT_L) )
         {
-            if ( !read_descriptor(data_sel, v, regs,
-                                  &data_base, &data_limit, &ar,
-                                  _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|
-                                  _SEGMENT_P) )
+            if ( !read_descriptor_sel(data_sel, which_sel, v, regs,
+                                      &data_base, &data_limit, &ar,
+                                      _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|
+                                      _SEGMENT_P) )
                 goto fail;
             if ( !(ar & _SEGMENT_S) ||
                  !(ar & _SEGMENT_P) ||
@@ -1993,9 +2046,9 @@ static int emulate_privileged_op(struct cpu_user_regs
*regs)
                 }
             }
             else
-                read_descriptor(data_sel, v, regs,
-                                &data_base, &data_limit, &ar,
-                                0);
+                read_descriptor_sel(data_sel, which_sel, v, regs,
+                                    &data_base, &data_limit, &ar,
+                                    0);
             data_limit = ~0UL;
             ar = _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P;
         }
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 354b889..041e9d3 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -38,7 +38,9 @@
 
 #ifndef __ASSEMBLY__
 
-#define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
+/* PVH 32bitfixme : see emulate_gate_op call from do_general_protection */
+#define GUEST_KERNEL_RPL(d) ({ ASSERT(is_pv_domain(d)); \
+                               is_pv_32bit_domain(d) ? 1 : 3; })
 
 /* Fix up the RPL of a guest segment selector. */
 #define __fixup_guest_selector(d, sel)                             \
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 13/20] pvh: Set up more PV stuff in set_info_guest
Allow the guest to set up a few more things when bringing up a vcpu.
This includes cr3 and gs_base.
Also set up wallclock, and only initialize a vcpu once.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Get rid of separate pvh call, and fold gs_base write into hvm_set_info_guest
 - Check pvh parameters for validity at the top of arch_set_info_guest
 - Fix comment about PVH and set_info_guest
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/domain.c             |   33 +++++++++++++++++++++++++++++++--
 xen/arch/x86/hvm/vmx/vmx.c        |    7 ++++++-
 xen/include/asm-x86/hvm/hvm.h     |    6 +++---
 xen/include/public/arch-x86/xen.h |   11 +++++++++++
 4 files changed, 51 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4b4c66d..b2bb3fc 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -699,6 +699,18 @@ int arch_set_info_guest(
                 return -EINVAL;
         }
     }
+    else if ( is_pvh_vcpu(v) )
+    {
+        /* PVH 32bitfixme */
+        ASSERT(!compat);
+
+        if ( c(ctrlreg[1]) || c(ldt_base) || c(ldt_ents) ||
+             c(user_regs.cs) || c(user_regs.ss) || c(user_regs.es) ||
+             c(user_regs.ds) || c(user_regs.fs) || c(user_regs.gs) ||
+             c.nat->gdt_ents || c.nat->fs_base || c.nat->gs_base_user
)
+            return -EINVAL;
+
+    }
 
     v->fpu_initialised = !!(flags & VGCF_I387_VALID);
 
@@ -736,8 +748,24 @@ int arch_set_info_guest(
 
     if ( has_hvm_container_vcpu(v) )
     {
-        hvm_set_info_guest(v);
-        goto out;
+        hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel);
+
+        if ( is_hvm_vcpu(v) || v->is_initialised )
+            goto out;
+
+        cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);
+        cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
+
+        v->arch.cr3 = page_to_maddr(cr3_page);
+        v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3];
+        
+        ASSERT(paging_mode_enabled(d));
+
+        paging_update_paging_modes(v);
+
+        update_cr3(v);
+
+        goto pvh_skip_pv_stuff;
     }
 
     init_int80_direct_trap(v);
@@ -947,6 +975,7 @@ int arch_set_info_guest(
 
     update_cr3(v);
 
+ pvh_skip_pv_stuff:
     if ( v->vcpu_id == 0 )
         update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5392223..e9aa06b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1426,7 +1426,7 @@ static void vmx_set_uc_mode(struct vcpu *v)
     hvm_asid_flush_vcpu(v);
 }
 
-static void vmx_set_info_guest(struct vcpu *v)
+static void vmx_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel)
 {
     unsigned long intr_shadow;
 
@@ -1451,6 +1451,11 @@ static void vmx_set_info_guest(struct vcpu *v)
         __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
     }
 
+    /* PVH 32bitfixme */
+    if ( is_pvh_vcpu(v) )
+        __vmwrite(GUEST_GS_BASE, gs_base_kernel);
+
+
     vmx_vmcs_exit(v);
 }
 
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 9437ff7..cc401cc 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -158,7 +158,7 @@ struct hvm_function_table {
     int (*msr_write_intercept)(unsigned int msr, uint64_t msr_content);
     void (*invlpg_intercept)(unsigned long vaddr);
     void (*set_uc_mode)(struct vcpu *v);
-    void (*set_info_guest)(struct vcpu *v);
+    void (*set_info_guest)(struct vcpu *v, uint64_t gs_base_kernel);
     void (*set_rdtsc_exiting)(struct vcpu *v, bool_t);
 
     /* Nested HVM */
@@ -437,10 +437,10 @@ void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t
permanent);
 void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent);
 void hvm_unmap_guest_frame(void *p, bool_t permanent);
 
-static inline void hvm_set_info_guest(struct vcpu *v)
+static inline void hvm_set_info_guest(struct vcpu *v, uint64_t gs_base_kernel)
 {
     if ( hvm_funcs.set_info_guest )
-        return hvm_funcs.set_info_guest(v);
+        return hvm_funcs.set_info_guest(v, gs_base_kernel);
 }
 
 int hvm_debug_op(struct vcpu *v, int32_t op);
diff --git a/xen/include/public/arch-x86/xen.h
b/xen/include/public/arch-x86/xen.h
index 908ef87..42b818e 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -154,6 +154,17 @@ typedef uint64_t tsc_timestamp_t; /* RDTSC timestamp */
 /*
  * The following is all CPU context. Note that the fpu_ctxt block is filled 
  * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ *
+ * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
+ * for HVM and PVH guests, not all information in this structure is updated:
+ *
+ * - For HVM guests, the structures read include: fpu_ctxt (if
+ * VGCT_I387_VALID is set), flags, user_regs, debugreg[*]
+ *
+ * - PVH guests are the same as HVM guests, but additionally set cr3,
+ * and for 64-bit guests, gs_base_kernel.  Additionally, the following
+ * entries must be 0: ctrlreg[1], ldt_base, ldg_ents, user_regs.{cs,
+ * ss, es, ds, fs, gs), gdt_ents, fs_base, and gs_base_user.
  */
 struct vcpu_guest_context {
     /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 14/20] pvh: Use PV handlers for emulated forced invalid ops, cpuid, and IO
This means putting hooks into the vmexit handlers to call them instead
of the HVM ones.
It also means calling raw_copy_from_guest() rather than
copy_from_user(), and delivering page faults during PV emulation to
guests using the HVM injection path rather than the PV one for PVH
guests.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Remove unnecessary privilege check in PIO path, update related comment
 - Move ? and : to end of line rather than beginning, as requested
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/hvm/vmx/vmx.c      |   45 +++++++++++++++++++++++++++------------
 xen/arch/x86/traps.c            |   28 ++++++++++++++++++------
 xen/include/asm-x86/processor.h |    2 ++
 xen/include/asm-x86/traps.h     |    3 +++
 4 files changed, 57 insertions(+), 21 deletions(-)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index e9aa06b..c200409 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -56,6 +56,7 @@
 #include <asm/apic.h>
 #include <asm/hvm/nestedhvm.h>
 #include <asm/event.h>
+#include <asm/traps.h>
 
 enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
 
@@ -2645,8 +2646,16 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
             /* Already handled above. */
             break;
         case TRAP_invalid_op:
-            HVMTRACE_1D(TRAP, vector);
-            vmx_vmexit_ud_intercept(regs);
+            if ( is_pvh_vcpu(v) )
+            {
+                if ( !emulate_forced_invalid_op(regs) )
+                    hvm_inject_hw_exception(TRAP_invalid_op,
HVM_DELIVER_NO_ERROR_CODE);
+            }
+            else
+            {
+                HVMTRACE_1D(TRAP, vector);
+                vmx_vmexit_ud_intercept(regs);
+            }
             break;
         default:
             HVMTRACE_1D(TRAP, vector);
@@ -2695,8 +2704,8 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         break;
     }
     case EXIT_REASON_CPUID:
+        is_pvh_vcpu(v) ? pv_cpuid(regs) : vmx_do_cpuid(regs);
         update_guest_eip(); /* Safe: CPUID */
-        vmx_do_cpuid(regs);
         break;
     case EXIT_REASON_HLT:
         update_guest_eip(); /* Safe: HLT */
@@ -2844,21 +2853,29 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     }
 
     case EXIT_REASON_IO_INSTRUCTION:
-        exit_qualification = __vmread(EXIT_QUALIFICATION);
-        if ( exit_qualification & 0x10 )
+        if ( is_pvh_vcpu(v) )
         {
-            /* INS, OUTS */
-            if ( !handle_mmio() )
-                hvm_inject_hw_exception(TRAP_gp_fault, 0);
+            if ( !emulate_privileged_op(regs) )
+                hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code);
         }
         else
         {
-            /* IN, OUT */
-            uint16_t port = (exit_qualification >> 16) & 0xFFFF;
-            int bytes = (exit_qualification & 0x07) + 1;
-            int dir = (exit_qualification & 0x08) ? IOREQ_READ :
IOREQ_WRITE;
-            if ( handle_pio(port, bytes, dir) )
-                update_guest_eip(); /* Safe: IN, OUT */
+            exit_qualification = __vmread(EXIT_QUALIFICATION);
+            if ( exit_qualification & 0x10 )
+            {
+                /* INS, OUTS */
+                if ( !handle_mmio() )
+                    hvm_inject_hw_exception(TRAP_gp_fault, 0);
+            }
+            else
+            {
+                /* IN, OUT */
+                uint16_t port = (exit_qualification >> 16) & 0xFFFF;
+                int bytes = (exit_qualification & 0x07) + 1;
+                int dir = (exit_qualification & 0x08) ? IOREQ_READ :
IOREQ_WRITE;
+                if ( handle_pio(port, bytes, dir) )
+                    update_guest_eip(); /* Safe: IN, OUT */
+            }
         }
         break;
 
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0463697..68ff4f6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -738,7 +738,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx,
     return 1;
 }
 
-static void pv_cpuid(struct cpu_user_regs *regs)
+void pv_cpuid(struct cpu_user_regs *regs)
 {
     uint32_t a, b, c, d;
 
@@ -915,7 +915,7 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs
*regs)
     return EXCRET_fault_fixed;
 }
 
-static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
+int emulate_forced_invalid_op(struct cpu_user_regs *regs)
 {
     char sig[5], instr[2];
     unsigned long eip, rc;
@@ -923,7 +923,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs
*regs)
     eip = regs->eip;
 
     /* Check for forced emulation signature: ud2 ; .ascii "xen". */
-    if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 )
+    if ( (rc = raw_copy_from_guest(sig, (char *)eip, sizeof(sig))) != 0 )
     {
         propagate_page_fault(eip + sizeof(sig) - rc, 0);
         return EXCRET_fault_fixed;
@@ -933,7 +933,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs
*regs)
     eip += sizeof(sig);
 
     /* We only emulate CPUID. */
-    if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 )
+    if ( ( rc = raw_copy_from_guest(instr, (char *)eip, sizeof(instr))) != 0 )
     {
         propagate_page_fault(eip + sizeof(instr) - rc, 0);
         return EXCRET_fault_fixed;
@@ -1074,7 +1074,7 @@ static void reserved_bit_page_fault(
     show_execution_state(regs);
 }
 
-void propagate_page_fault(unsigned long addr, u16 error_code)
+static void pv_inject_page_fault(unsigned long addr, u16 error_code)
 {
     struct trap_info *ti;
     struct vcpu *v = current;
@@ -1108,6 +1108,13 @@ void propagate_page_fault(unsigned long addr, u16
error_code)
         reserved_bit_page_fault(addr, guest_cpu_user_regs());
 }
 
+void propagate_page_fault(unsigned long addr, u16 error_code)
+{
+    is_pvh_vcpu(current) ?
+        hvm_inject_page_fault(error_code, addr) :
+        pv_inject_page_fault(addr, error_code);
+}
+
 static int handle_gdt_ldt_mapping_fault(
     unsigned long offset, struct cpu_user_regs *regs)
 {
@@ -1626,6 +1633,13 @@ static int guest_io_okay(
     int user_mode = !(v->arch.flags & TF_kernel_mode);
 #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v)
 
+    /*
+     * For PVH, privilege checks are done by the hardware.  If we''ve
+     * gotten here, then the access is good.
+     */
+    if ( is_pvh_vcpu(v) )
+        return 1;
+
     if ( !vm86_mode(regs) &&
          (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) )
         return 1;
@@ -1871,7 +1885,7 @@ static inline uint64_t guest_misc_enable(uint64_t val)
         _ptr = (unsigned int)_ptr;                                          \
     if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) )
\
         goto fail;                                                          \
-    if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 )      
\
+    if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) 
\
     {                                                                       \
         propagate_page_fault(_ptr + sizeof(_x) - _rc, 0);                   \
         goto skip;                                                          \
@@ -1886,7 +1900,7 @@ static int is_cpufreq_controller(struct domain *d)
 
 #include "x86_64/mmconfig.h"
 
-static int emulate_privileged_op(struct cpu_user_regs *regs)
+int emulate_privileged_op(struct cpu_user_regs *regs)
 {
     enum x86_segment which_sel;
     struct vcpu *v = current;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 5cdacc7..22a9653 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -566,6 +566,8 @@ void microcode_set_module(unsigned int);
 int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len);
 int microcode_resume_cpu(int cpu);
 
+void pv_cpuid(struct cpu_user_regs *regs);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_X86_PROCESSOR_H */
diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h
index 82cbcee..20c9151 100644
--- a/xen/include/asm-x86/traps.h
+++ b/xen/include/asm-x86/traps.h
@@ -49,4 +49,7 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t
vcpuid,
 extern int send_guest_trap(struct domain *d, uint16_t vcpuid,
 				unsigned int trap_nr);
 
+int emulate_privileged_op(struct cpu_user_regs *regs);
+int emulate_forced_invalid_op(struct cpu_user_regs *regs);
+
 #endif /* ASM_TRAP_H */
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 15/20] pvh: Disable 32-bit guest support for now
To be implemented.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
Acked-by: Jan Beulich <jbeulich@suse.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/domain.c |    8 ++++++++
 1 file changed, 8 insertions(+)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b2bb3fc..9da9aa1 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -339,6 +339,14 @@ int switch_compat(struct domain *d)
 
     if ( d == NULL )
         return -EINVAL;
+
+    if ( is_pvh_domain(d) )
+    {
+        printk(XENLOG_INFO
+               "Xen currently does not support 32bit PVH guests\n");
+        return -EINVAL;
+    }
+
     if ( !may_switch_mode(d) )
         return -EACCES;
     if ( is_pv_32on64_domain(d) )
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 16/20] pvh: Restrict tsc_mode to NEVER_EMULATE for now
To be implemented.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
v13:
 - Only print a warning if tsc_mode != TSC_MODE_DEFAULT
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/time.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index c1bbd50..087b301 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1827,6 +1827,22 @@ void tsc_set_info(struct domain *d,
         d->arch.vtsc = 0;
         return;
     }
+    if ( is_pvh_domain(d) )
+    {
+        /* PVH fixme: support more tsc modes. */
+        switch ( tsc_mode )
+        {
+        case TSC_MODE_NEVER_EMULATE:
+            break;
+        default:
+            printk(XENLOG_WARNING
+                   "PVH currently does not support tsc emulation. Setting
timer_mode = never_emulate\n");
+            /* FALLTHRU */
+        case TSC_MODE_DEFAULT:
+            tsc_mode = TSC_MODE_NEVER_EMULATE;
+            break;
+        }
+    }
 
     switch ( d->arch.tsc_mode = tsc_mode )
     {
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 17/20] pvh: Disable debug traps when doing pv emulation for PVH domains
I''m not exactly sure why these were singled out in Mukesh''s
patch
series.  One accesses a pv_cpu structure, but the other just
accesses debugreg.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
---
 xen/arch/x86/traps.c |    9 +++++++++
 1 file changed, 9 insertions(+)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 68ff4f6..b643d56 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -461,6 +461,11 @@ static void instruction_done(
     struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch)
 {
     regs->eip = eip;
+
+    /* PVH fixme: Make debug traps work */
+    if ( is_pvh_vcpu(current) )
+        return;
+
     regs->eflags &= ~X86_EFLAGS_RF;
     if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) )
     {
@@ -477,6 +482,10 @@ static unsigned int check_guest_io_breakpoint(struct vcpu
*v,
     unsigned int width, i, match = 0;
     unsigned long start;
 
+    /* PVH fixme: support io breakpoint. */
+    if ( is_pvh_vcpu(v) )
+        return 0;
+
     if ( !(v->arch.debugreg[5]) ||
          !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) )
         return 0;
-- 
1.7.9.5
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> CC: Jan Beulich <jan.beulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Keir Fraser <keir@xen.org> --- docs/misc/pvh-readme.txt | 59 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 docs/misc/pvh-readme.txt diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt new file mode 100644 index 0000000..fc63f7d --- /dev/null +++ b/docs/misc/pvh-readme.txt @@ -0,0 +1,59 @@ + +PVH : an x86 PV guest running in an HVM container. + +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ + +At the moment HAP is required for PVH. + +At present the only PVH guest is an x86 64bit PV linux. Patches are at: + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git + +A PVH guest kernel must support following features, as defined for linux +in arch/x86/xen/xen-head.S: + + #define FEATURES_PVH "|writable_descriptor_tables" \ + "|auto_translated_physmap" \ + "|supervisor_mode_kernel" \ + "|hvm_callback_vector" + +In a nutshell, the guest uses auto translate, ie, p2m is managed by +xen, it uses event callback and not vlapic emulation, the page tables +are native, so mmu_update hcall is N/A for PVH guest. Moreover IDT is +native, so set_trap_table hcall is also N/A for a PVH guest. For a +full list of hcalls supported for PVH, see pvh_hypercall64_table in +arch/x86/hvm/hvm.c in xen. From the ABI prespective, it''s mostly a PV +guest with auto translate, although it does use hvm_op for setting +callback vector. + +The initial phase targets the booting of a 64bit UP/SMP linux guest in PVH +mode. This is done by adding: pvh=1 in the config file. xl, and not xm, is +supported. Phase I patches are broken into three parts: + - xen changes for booting of 64bit PVH guest + - tools changes for creating a PVH guest + - boot of 64bit dom0 in PVH mode. + +Following fixme''s exist in the code: + - Add support for more memory types in arch/x86/hvm/mtrr.c. + - arch/x86/time.c: support more tsc modes. + - check_guest_io_breakpoint(): check/add support for IO breakpoint. + - implement arch_get_info_guest() for pvh. + - verify bp matching on emulated instructions will work same as HVM for + PVH guest. see instruction_done() and check_guest_io_breakpoint(). + +Following remain to be done for PVH: + - Investigate what else needs to be done for VMI support. + - AMD port. + - 32bit PVH guest support in both linux and xen. Xen changes are tagged + "32bitfixme". + - Add support for monitoring guest behavior. See hvm_memory_event* functions + in hvm.c + - vcpu hotplug support + - Live migration of PVH guests. + - Avail PVH dom0 of posted interrupts. (This will be a big win). + + +Note, any emails to me must be cc''d to xen devel mailing list. OTOH, please +cc me on PVH emails to the xen devel mailing list. + +Mukesh Rathor +mukesh.rathor [at] oracle [dot] com -- 1.7.9.5
George Dunlap
2013-Sep-23  16:49 UTC
[PATCH RFC v13 19/20] PVH xen tools: libxc changes to build a PVH guest.
From: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/xc_dom.h      |    1 +
 tools/libxc/xc_dom_core.c |    9 +++++
 tools/libxc/xc_dom_x86.c  |   90 +++++++++++++++++++++++++++++----------------
 3 files changed, 69 insertions(+), 31 deletions(-)
diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index 86e23ee..5168bcd 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -130,6 +130,7 @@ struct xc_dom_image {
     domid_t console_domid;
     domid_t xenstore_domid;
     xen_pfn_t shared_info_mfn;
+    int pvh_enabled;
 
     xc_interface *xch;
     domid_t guest_domid;
diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c
index 0f367f6..faa7e0f 100644
--- a/tools/libxc/xc_dom_core.c
+++ b/tools/libxc/xc_dom_core.c
@@ -766,6 +766,15 @@ int xc_dom_parse_image(struct xc_dom_image *dom)
         goto err;
     }
 
+    if ( dom->pvh_enabled )
+    {
+        const char *pvh_features = "writable_descriptor_tables|"
+                                   "auto_translated_physmap|"
+                                   "supervisor_mode_kernel|"
+                                   "hvm_callback_vector";
+        elf_xen_parse_features(pvh_features, dom->f_requested, NULL);
+    }
+
     /* check features */
     for ( i = 0; i < XENFEAT_NR_SUBMAPS; i++ )
     {
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index 7cc2ff2..f36dfbb 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -407,7 +407,8 @@ static int setup_pgtables_x86_64(struct xc_dom_image *dom)
         pgpfn = (addr - dom->parms.virt_base) >> PAGE_SHIFT_X86;
         l1tab[l1off]              pfn_to_paddr(xc_dom_p2m_guest(dom, pgpfn)) |
L1_PROT;
-        if ( (addr >= dom->pgtables_seg.vstart) && 
+        if ( (!dom->pvh_enabled)                && 
+             (addr >= dom->pgtables_seg.vstart) && 
              (addr < dom->pgtables_seg.vend) )
             l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */
 
@@ -588,6 +589,13 @@ static int vcpu_x86_32(struct xc_dom_image *dom, void *ptr)
 
     DOMPRINTF_CALLED(dom->xch);
 
+    if ( dom->pvh_enabled )
+    {
+        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                     "%s: PVH not supported for 32bit guests.",
__FUNCTION__);
+        return -1;
+    }
+
     /* clear everything */
     memset(ctxt, 0, sizeof(*ctxt));
 
@@ -630,12 +638,6 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void *ptr)
     /* clear everything */
     memset(ctxt, 0, sizeof(*ctxt));
 
-    ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64;
-    ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64;
-    ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64;
-    ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64;
-    ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64;
-    ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64;
     ctxt->user_regs.rip = dom->parms.virt_entry;
     ctxt->user_regs.rsp          dom->parms.virt_base +
(dom->bootstack_pfn + 1) * PAGE_SIZE_X86;
@@ -643,15 +645,25 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void
*ptr)
         dom->parms.virt_base + (dom->start_info_pfn) * PAGE_SIZE_X86;
     ctxt->user_regs.rflags = 1 << 9; /* Interrupt Enable */
 
-    ctxt->kernel_ss = ctxt->user_regs.ss;
-    ctxt->kernel_sp = ctxt->user_regs.esp;
-
     ctxt->flags = VGCF_in_kernel_X86_64 | VGCF_online_X86_64;
     cr3_pfn = xc_dom_p2m_guest(dom, dom->pgtables_seg.pfn);
     ctxt->ctrlreg[3] = xen_pfn_to_cr3_x86_64(cr3_pfn);
     DOMPRINTF("%s: cr3: pfn 0x%" PRIpfn " mfn 0x%" PRIpfn
"",
               __FUNCTION__, dom->pgtables_seg.pfn, cr3_pfn);
 
+    if ( dom->pvh_enabled )
+        return 0;
+
+    ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64;
+    ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64;
+    ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64;
+    ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64;
+    ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64;
+    ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64;
+
+    ctxt->kernel_ss = ctxt->user_regs.ss;
+    ctxt->kernel_sp = ctxt->user_regs.esp;
+
     return 0;
 }
 
@@ -752,7 +764,7 @@ int arch_setup_meminit(struct xc_dom_image *dom)
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
         return rc;
-    if ( xc_dom_feature_translated(dom) )
+    if ( xc_dom_feature_translated(dom) && !dom->pvh_enabled )
     {
         dom->shadow_enabled = 1;
         rc = x86_shadow(dom->xch, dom->guest_domid);
@@ -828,6 +840,38 @@ int arch_setup_bootearly(struct xc_dom_image *dom)
     return 0;
 }
 
+/*
+ * Map grant table frames into guest physmap. PVH manages grant during boot
+ * via HVM mechanisms.
+ */
+static int map_grant_table_frames(struct xc_dom_image *dom)
+{
+    int i, rc;
+
+    if ( dom->pvh_enabled )
+        return 0;
+
+    for ( i = 0; ; i++ )
+    {
+        rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid,
+                                      XENMAPSPACE_grant_table,
+                                      i, dom->total_pages + i);
+        if ( rc != 0 )
+        {
+            if ( (i > 0) && (errno == EINVAL) )
+            {
+                DOMPRINTF("%s: %d grant tables mapped", __FUNCTION__,
i);
+                break;
+            }
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                         "%s: mapping grant tables failed "
"(pfn=0x%" PRIpfn
+                         ", rc=%d)", __FUNCTION__,
dom->total_pages + i, rc);
+            return rc;
+        }
+    }
+    return 0;
+}
+
 int arch_setup_bootlate(struct xc_dom_image *dom)
 {
     static const struct {
@@ -866,7 +910,6 @@ int arch_setup_bootlate(struct xc_dom_image *dom)
     else
     {
         /* paravirtualized guest with auto-translation */
-        int i;
 
         /* Map shared info frame into guest physmap. */
         rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid,
@@ -880,25 +923,10 @@ int arch_setup_bootlate(struct xc_dom_image *dom)
             return rc;
         }
 
-        /* Map grant table frames into guest physmap. */
-        for ( i = 0; ; i++ )
-        {
-            rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid,
-                                          XENMAPSPACE_grant_table,
-                                          i, dom->total_pages + i);
-            if ( rc != 0 )
-            {
-                if ( (i > 0) && (errno == EINVAL) )
-                {
-                    DOMPRINTF("%s: %d grant tables mapped",
__FUNCTION__, i);
-                    break;
-                }
-                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
-                             "%s: mapping grant tables failed "
"(pfn=0x%"
-                             PRIpfn ", rc=%d)", __FUNCTION__,
dom->total_pages + i, rc);
-                return rc;
-            }
-        }
+        rc = map_grant_table_frames(dom);
+        if ( rc != 0 )
+            return rc;
+
         shinfo = dom->shared_info_pfn;
     }
 
-- 
1.7.9.5
George Dunlap
2013-Sep-23  16:50 UTC
[PATCH RFC v13 20/20] PVH xen tools: libxl changes to create a PVH guest.
From: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
---
v13 (gwd):
 - Added XEN_DOMCTL_CDF_pvh_guest flag
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Ian Campbell <ian.campbell@citrix.com>
---
 docs/man/xl.cfg.pod.5        |    3 +++
 tools/libxl/libxl.h          |    6 ++++++
 tools/libxl/libxl_create.c   |   13 +++++++++++++
 tools/libxl/libxl_dom.c      |    2 ++
 tools/libxl/libxl_internal.h |    1 +
 tools/libxl/libxl_types.idl  |    1 +
 tools/libxl/libxl_x86.c      |    4 +++-
 tools/libxl/xl_cmdimpl.c     |    1 +
 8 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 769767b..69a4cc1 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -634,6 +634,9 @@ if your particular guest kernel does not require this
behaviour then
 it is safe to allow this to be enabled but you may wish to disable it
 anyway.
 
+=item B<pvh=BOOLEAN>
+Selects whether to run this PV guest in an HVM container. Default is 0.
+
 =back
 
 =head2 Fully-virtualised (HVM) Guest Specific Options
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 4cab294..2801da4 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -355,6 +355,12 @@
  */
 #define LIBXL_HAVE_SPICE_VDAGENT 1
 
+/*
+ * LIBXL_HAVE_CREATEINFO_PVH
+ * If this is defined, then libxl supports creation of a PVH guest.
+ */
+#define LIBXL_HAVE_CREATEINFO_PVH 1
+
 /* Functions annotated with LIBXL_EXTERNAL_CALLERS_ONLY may not be
  * called from within libxl itself. Callers outside libxl, who
  * do not #include libxl_internal.h, are fine. */
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 7567238..a8e773a 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -33,6 +33,9 @@ int libxl__domain_create_info_setdefault(libxl__gc *gc,
     if (c_info->type == LIBXL_DOMAIN_TYPE_HVM) {
         libxl_defbool_setdefault(&c_info->hap, true);
         libxl_defbool_setdefault(&c_info->oos, true);
+    } else {
+        libxl_defbool_setdefault(&c_info->pvh, false);
+        libxl_defbool_setdefault(&c_info->hap,
libxl_defbool_val(c_info->pvh));
     }
 
     libxl_defbool_setdefault(&c_info->run_hotplug_scripts, true);
@@ -349,6 +352,8 @@ int libxl__domain_build(libxl__gc *gc,
 
         break;
     case LIBXL_DOMAIN_TYPE_PV:
+        state->pvh_enabled = libxl_defbool_val(d_config->c_info.pvh);
+
         ret = libxl__build_pv(gc, domid, info, state);
         if (ret)
             goto out;
@@ -408,6 +413,14 @@ int libxl__domain_make(libxl__gc *gc,
libxl_domain_create_info *info,
         flags |= XEN_DOMCTL_CDF_hvm_guest;
         flags |= libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0;
         flags |= libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off;
+    } else if ( libxl_defbool_val(info->pvh) ) {
+        flags |= XEN_DOMCTL_CDF_pvh_guest;
+        if ( !libxl_defbool_val(info->hap) ) {
+            LOG(ERROR, "HAP must be on for PVH");
+            rc = ERROR_INVAL;
+            goto out;
+        }
+        flags |= XEN_DOMCTL_CDF_hap;
     }
     *domid = -1;
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 6e2252a..89e919e 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -338,6 +338,8 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
         return ERROR_FAIL;
     }
 
+    dom->pvh_enabled = state->pvh_enabled;
+
     LOG(DEBUG, "pv kernel mapped %d path %s\n",
state->pv_kernel.mapped, state->pv_kernel.path);
     if (state->pv_kernel.mapped) {
         ret = xc_dom_kernel_mem(dom,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index f051d91..441ec66 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -886,6 +886,7 @@ typedef struct {
     libxl__file_reference pv_kernel;
     libxl__file_reference pv_ramdisk;
     const char * pv_cmdline;
+    bool pvh_enabled;
 } libxl__domain_build_state;
 
 _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 049dbb5..e83b70d 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -257,6 +257,7 @@ libxl_domain_create_info =
Struct("domain_create_info",[
     ("platformdata", libxl_key_value_list),
     ("poolid",       uint32),
     ("run_hotplug_scripts",libxl_defbool),
+    ("pvh",          libxl_defbool),
     ], dir=DIR_IN)
 
 MemKB = UInt(64, init_val = "LIBXL_MEMKB_DEFAULT")
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index a78c91d..87a8110 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -290,7 +290,9 @@ int libxl__arch_domain_create(libxl__gc *gc,
libxl_domain_config *d_config,
     if (rtc_timeoffset)
         xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);
 
-    if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM) {
+    if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM ||
+        libxl_defbool_val(d_config->c_info.pvh)) {
+
         unsigned long shadow;
         shadow = (d_config->b_info.shadow_memkb + 1023) / 1024;
         xc_shadow_control(ctx->xch, domid,
XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 3d7eaad..7f36423 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -610,6 +610,7 @@ static void parse_config_data(const char *config_source,
         !strncmp(buf, "hvm", strlen(buf)))
         c_info->type = LIBXL_DOMAIN_TYPE_HVM;
 
+    xlu_cfg_get_defbool(config, "pvh", &c_info->pvh, 0);
     xlu_cfg_get_defbool(config, "hap", &c_info->hap, 0);
 
     if (xlu_cfg_replace_string (config, "name", &c_info->name,
0)) {
-- 
1.7.9.5
Hi, At 17:49 +0100 on 23 Sep (1379958582), George Dunlap wrote:> There are many functions where PVH requires some code in common with > HVM. Rearrange some of these functions so that the code is together. > > In general, the HVM code that PVH also uses includes: > - cacheattr functionality > - paging > - hvm_funcs > - hvm_assert_evtchn_irq tasklet > - tm_list > - hvm_params > > And code that PVH shares with PV but not with PVH: > - updating the domain wallclock > - setting v->is_initialized > > There should be no end-to-end changes in behavior.> --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -945,17 +945,16 @@ int arch_set_info_guest( > > clear_bit(_VPF_in_reset, &v->pause_flags); > > - if ( v->vcpu_id == 0 ) > - update_domain_wallclock_time(d); > - > - /* Don''t redo final setup */ > - v->is_initialised = 1; > - > if ( paging_mode_enabled(d) ) > paging_update_paging_modes(v); > > update_cr3(v); > > + if ( v->vcpu_id == 0 ) > + update_domain_wallclock_time(d); > + > + /* Don''t redo final setup */ > + v->is_initialised = 1;This will change behaviour for at least some cases, e.g. update_cr3() -> sh_update_cr3() -> ``if ( !is_hvm_domain(d) && !v->is_initialised ) ...'''' Tim.
Tim Deegan
2013-Sep-26  11:53 UTC
Re: [PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
Hi, At 17:49 +0100 on 23 Sep (1379958583), George Dunlap wrote:> The goal of this patch is to classify conditionals more clearly, as to > whether they relate to pv guests, hvm-only guests, or guests with an > "hvm container" (which will eventually include PVH).- speculative wombling begins - Reading this for the (apparently) 13th time, it strikes me that this distinction feels wrong, like entities are being multiplied beyond necessity. A PVH guest is basically a HVM guest that starts with paging enabled and uses event channel instead of virtual APIC. I''m not sure this needs any special treatment from Xen. We can supply a PVH guest with an APIC and if it never uses it, fine. And we can supply all HVM guests with any extra hypercalls that PVH needs (for vcpu bringup &c). Things like not having a qemu process to serve ioreqs can be arranged by the toolstack. This is from a position of ignorance, of course, and I''m happy to be corrected by the people who''ve actually been hacking on the code. - normal wombling resumes - Tim.
Ian Campbell
2013-Sep-26  12:54 UTC
Re: [PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
On Thu, 2013-09-26 at 12:53 +0100, Tim Deegan wrote:> A PVH guest is basically a HVM guest that starts with paging enabled and > uses event channel instead of virtual APIC. > > I''m not sure this needs any special treatment from Xen. We can supply a > PVH guest with an APIC and if it never uses it, fine.It''s not entirely out of the questions that hardware APIC virtualisation might change that trade off before too long... Don''t hvm guests have an apic=0 setting anyway?
George Dunlap
2013-Sep-26  13:46 UTC
Re: [PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
On 26/09/13 12:53, Tim Deegan wrote:> Hi, > > At 17:49 +0100 on 23 Sep (1379958583), George Dunlap wrote: >> The goal of this patch is to classify conditionals more clearly, as to >> whether they relate to pv guests, hvm-only guests, or guests with an >> "hvm container" (which will eventually include PVH). > - speculative wombling begins - > > Reading this for the (apparently) 13th time, it strikes me that this > distinction feels wrong, like entities are being multiplied beyond > necessity. > > A PVH guest is basically a HVM guest that starts with paging enabled and > uses event channel instead of virtual APIC. > > I''m not sure this needs any special treatment from Xen. We can supply a > PVH guest with an APIC and if it never uses it, fine. And we can supply > all HVM guests with any extra hypercalls that PVH needs (for vcpu > bringup &c). Things like not having a qemu process to serve ioreqs can > be arranged by the toolstack. > > This is from a position of ignorance, of course, and I''m happy to be > corrected by the people who''ve actually been hacking on the code.As I''ve been putting the series together, I''ve wondered the same myself. There''s more to not having a qemu than just not starting qemu, of course; a lot of the HVM codepaths assume that there is one and will dereference structures which will be empty. But that should be fairly easy to fix. Having the PV e820 map makes sense, but you can file that under "make available to hvm guests as well". The main things left are the PV paths for cpuid, PIO, and forced emulated ops. I haven''t taken a close look at how these differ, or what benefit is gained from using the PV version over the HVM version. The other big thing is being able to set up more state when bringing up a vcpu. One reason to disable unneeded things is the security angle: there is a risk, no matter how small, that there is somehow an exploitable bug in our emulated APIC / PIT code; so running with the code entirely disabled is more secure than running with it enabled but just not used. But it''s possible that we could just add a few options to the existing "HVM mode" that would implement all of this. -George
Jan Beulich
2013-Sep-26  15:22 UTC
Re: [PATCH RFC v13 05/20] pvh: Disable unneeded features of HVM containers
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > @@ -578,6 +592,9 @@ int hvm_domain_initialise(struct domain *d) > > void hvm_domain_relinquish_resources(struct domain *d) > { > + if ( is_pvh_domain(d) ) > + return; > +Iirc in an earlier patch you''re allocating ->arch.hvm_domain.params. Which is supposed to be freed here, and hence would now be leaked for PVH domains.> @@ -1114,6 +1135,14 @@ int hvm_vcpu_initialise(struct vcpu *v) > > v->arch.hvm_vcpu.inject_trap.vector = -1; > > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ > + /* This for hvm_long_mode_enabled(v). */ > + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME;Please fix the comment (perhaps just missing an "is"). Jan
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Changes: > * Enforce HAP mode for now > * Disable exits related to virtual interrupts or emulated APICs > * Disable changing paging mode > - "unrestricted guest" (i.e., real mode for EPT) disabled > - write guest EFER disabled > * Start in 64-bit mode > * Force TSC mode to be "none" > * Paging mode update to happen in arch_set_info_guest > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > v13: > - Fix up default cr0 settings > - Get rid of some unnecessary PVH-related changes > - Return EOPNOTSUPP instead of ENOSYS if hardware features are not present > - Remove an unnecessary variable from pvh_check_requirements > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/hvm/vmx/vmcs.c | 130 > +++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 126 insertions(+), 4 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > index cf54d18..53fccdf 100644 > --- a/xen/arch/x86/hvm/vmx/vmcs.c > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > @@ -828,6 +828,60 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 > vmcs_encoding, u64 val) > virtual_vmcs_exit(vvmcs); > } > > +static int pvh_check_requirements(struct vcpu *v) > +{ > + u64 required; > + > + /* Check for required hardware features */ > + if ( !cpu_has_vmx_ept ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have EPT support\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_pat ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_msr_bitmap ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); > + return -EOPNOTSUPP; > + } > + if ( !cpu_has_vmx_secondary_exec_control ) > + { > + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); > + return -EOPNOTSUPP; > + }Up to here the checks are VMX specific, and hence belong in a VMX specific file, ...> + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; > + if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) != required ) > + { > + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", > + required); > + return -EOPNOTSUPP; > + } > + > + /* Check for required configuration options */ > + if ( !paging_mode_hap(v->domain) ) > + { > + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); > + return -EINVAL; > + } > + /* > + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, > + * then pv_vcpu.ctrlreg must be added to the pvh struct. > + */ > + if ( v->domain->arch.vtsc ) > + { > + printk(XENLOG_G_INFO > + "At present PVH only supports the default timer mode\n"); > + return -EINVAL; > + }... but all of these are pretty generic (apart from the X86_CR4_VMXE in the CR4 mask checked above, but I wonder whether that shouldn''t be checked much earlier - for HVM guests no such check exists here afaik).> @@ -874,7 +935,32 @@ static int construct_vmcs(struct vcpu *v) > /* Do not enable Monitor Trap Flag unless start single step debug */ > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > > + if ( is_pvh_domain(d) ) > + { > + /* Disable virtual apics, TPR */ > + v->arch.hvm_vmx.secondary_exec_control &= > + ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES > + | SECONDARY_EXEC_APIC_REGISTER_VIRT > + | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; > + > + /* Disable wbinvd (only necessary for MMIO), > + * unrestricted guest (real mode for EPT) */To not confuse the reader (I got confused the last time through, and now again) this should say "Disable wbinvd exiting ...". Jan
Jan Beulich
2013-Sep-26  15:30 UTC
Re: [PATCH RFC v13 07/20] pvh: Do not allow PVH guests to change paging modes
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -1765,6 +1765,16 @@ int hvm_set_cr0(unsigned long value) > (value & (X86_CR0_PE | X86_CR0_PG)) == X86_CR0_PG ) > goto gpf; > > + > +Please drop these bogus blank lines ...> + /* A pvh is not expected to change to real mode. */ > + if ( is_pvh_vcpu(v) > + && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) ) > + { > + printk(XENLOG_G_WARNING > + "PVH attempting to turn off PE/PG. CR0:%lx\n", value); > + goto gpf; > + }... and perhaps insert one here instead.> if ( (value & X86_CR0_PG) && !(old_value & X86_CR0_PG) ) > { > if ( v->arch.hvm_vcpu.guest_efer & EFER_LME )Jan
Konrad Rzeszutek Wilk
2013-Sep-26  15:31 UTC
Re: [PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
On Thu, Sep 26, 2013 at 02:46:04PM +0100, George Dunlap wrote:> On 26/09/13 12:53, Tim Deegan wrote: > >Hi, > > > >At 17:49 +0100 on 23 Sep (1379958583), George Dunlap wrote: > >>The goal of this patch is to classify conditionals more clearly, as to > >>whether they relate to pv guests, hvm-only guests, or guests with an > >>"hvm container" (which will eventually include PVH). > > - speculative wombling begins - > > > >Reading this for the (apparently) 13th time, it strikes me that this > >distinction feels wrong, like entities are being multiplied beyond > >necessity. > > > >A PVH guest is basically a HVM guest that starts with paging enabled and > >uses event channel instead of virtual APIC. > > > >I''m not sure this needs any special treatment from Xen. We can supply a > >PVH guest with an APIC and if it never uses it, fine. And we can supply > >all HVM guests with any extra hypercalls that PVH needs (for vcpu > >bringup &c). Things like not having a qemu process to serve ioreqs can > >be arranged by the toolstack. > > > >This is from a position of ignorance, of course, and I''m happy to be > >corrected by the people who''ve actually been hacking on the code. > > As I''ve been putting the series together, I''ve wondered the same myself. > > There''s more to not having a qemu than just not starting qemu, of > course; a lot of the HVM codepaths assume that there is one and will > dereference structures which will be empty. But that should be > fairly easy to fix. > > Having the PV e820 map makes sense, but you can file that under > "make available to hvm guests as well".Which I think Mr. Gordon (sp?) had been doing for some of the PCI NVidia passthrough stuff. Basically making e820_host work for HVM guests.> > The main things left are the PV paths for cpuid, PIO, and forced > emulated ops. I haven''t taken a close look at how these differ, or > what benefit is gained from using the PV version over the HVM > version. > > The other big thing is being able to set up more state when bringing > up a vcpu. > > One reason to disable unneeded things is the security angle: there > is a risk, no matter how small, that there is somehow an exploitable > bug in our emulated APIC / PIT code; so running with the code > entirely disabled is more secure than running with it enabled but > just not used. > > But it''s possible that we could just add a few options to the > existing "HVM mode" that would implement all of this. > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > --- a/xen/arch/x86/hvm/mtrr.c > +++ b/xen/arch/x86/hvm/mtrr.c > @@ -578,6 +578,7 @@ int32_t hvm_set_mem_pinned_cacheattr( > { > struct hvm_mem_pinned_cacheattr_range *range; > > + /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */ > if ( !((type == PAT_TYPE_UNCACHABLE) || > (type == PAT_TYPE_WRCOMB) || > (type == PAT_TYPE_WRTHROUGH) ||I always wondered what this comment is trying to tell the reader. Apart from that: Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich
2013-Sep-26  15:36 UTC
Re: [PATCH RFC v13 11/20] pvh: Support read_segment_register for PVH
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > +static inline u16 hvm_read_selector(struct vcpu *v, enum x86_segment seg) > +{ > + return hvm_funcs.read_selector(v, seg); > +}This is sort of a latent crash as long as SVM doesn''t have a handler for this (i.e. if some other, non-PVH caller appears). Apart from that Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan
Jan Beulich
2013-Sep-26  15:43 UTC
Re: [PATCH RFC v13 13/20] pvh: Set up more PV stuff in set_info_guest
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > @@ -699,6 +699,18 @@ int arch_set_info_guest( > return -EINVAL; > } > } > + else if ( is_pvh_vcpu(v) ) > + { > + /* PVH 32bitfixme */ > + ASSERT(!compat); > + > + if ( c(ctrlreg[1]) || c(ldt_base) || c(ldt_ents) || > + c(user_regs.cs) || c(user_regs.ss) || c(user_regs.es) || > + c(user_regs.ds) || c(user_regs.fs) || c(user_regs.gs) || > + c.nat->gdt_ents || c.nat->fs_base || c.nat->gs_base_user ) > + return -EINVAL; > + > + } > > v->fpu_initialised = !!(flags & VGCF_I387_VALID); > > @@ -736,8 +748,24 @@ int arch_set_info_guest( > > if ( has_hvm_container_vcpu(v) ) > { > - hvm_set_info_guest(v); > - goto out; > + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel);I had pointed this out to Mukesh already: If you want to imply kernel mode here, the checks above should verify that the kernel mode flag is set in flags.> + * > + * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise > + * for HVM and PVH guests, not all information in this structure is updated: > + * > + * - For HVM guests, the structures read include: fpu_ctxt (if > + * VGCT_I387_VALID is set), flags, user_regs, debugreg[*] > + * > + * - PVH guests are the same as HVM guests, but additionally set cr3, > + * and for 64-bit guests, gs_base_kernel. Additionally, the following > + * entries must be 0: ctrlreg[1], ldt_base, ldg_ents, user_regs.{cs,ldt_ents> + * ss, es, ds, fs, gs), gdt_ents, fs_base, and gs_base_user. */ > struct vcpu_guest_context { > /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */Jan
Jan Beulich
2013-Sep-26  15:52 UTC
Re: [PATCH RFC v13 14/20] pvh: Use PV handlers for emulated forced invalid ops, cpuid, and IO
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > v13: > - Remove unnecessary privilege check in PIO path, update related commentWhile this is the correct thing, ...> case EXIT_REASON_IO_INSTRUCTION: > - exit_qualification = __vmread(EXIT_QUALIFICATION); > - if ( exit_qualification & 0x10 ) > + if ( is_pvh_vcpu(v) ) > { > - /* INS, OUTS */ > - if ( !handle_mmio() ) > - hvm_inject_hw_exception(TRAP_gp_fault, 0); > + if ( !emulate_privileged_op(regs) ).. we still shouldn''t do this blindly. It is a latent security issue to do double decoding on an instruction: The hardware decoded it, and did all guest side checks. If a malicious guest forges the instruction to be something else by the time emulate_privileged_op() gets to decode it, we may induce a guest side security violation. As the decoded information is available, we ought to use it here.> + hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code); > } > else > { > - /* IN, OUT */ > - uint16_t port = (exit_qualification >> 16) & 0xFFFF; > - int bytes = (exit_qualification & 0x07) + 1; > - int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE; > - if ( handle_pio(port, bytes, dir) ) > - update_guest_eip(); /* Safe: IN, OUT */ > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + if ( exit_qualification & 0x10 ) > + { > + /* INS, OUTS */ > + if ( !handle_mmio() )With this moved into a !PVH code path, is there any path that can still actively reach handle_mmio() for a PVH guest? If not, the check an earlier patch put there propbably ought to become an ASSERT(). Also, to reduce the cde churn due to this patch, if you added a "break;" to the end of the if() path above, most of the indentation only changes here would go away. Jan
Jan Beulich
2013-Sep-26  15:55 UTC
Re: [PATCH RFC v13 17/20] pvh: Disable debug traps when doing pv emulation for PVH domains
>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > I''m not exactly sure why these were singled out in Mukesh''s patch > series. One accesses a pv_cpu structure, but the other just > accesses debugreg.Right, but as long as he promises to deal with them subsequently one way or another, that fine for the moment.> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>Reviewed-by: Jan Beulich <jbeulich@suse.com>> CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/traps.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index 68ff4f6..b643d56 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -461,6 +461,11 @@ static void instruction_done( > struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch) > { > regs->eip = eip; > + > + /* PVH fixme: Make debug traps work */ > + if ( is_pvh_vcpu(current) ) > + return; > + > regs->eflags &= ~X86_EFLAGS_RF; > if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) > { > @@ -477,6 +482,10 @@ static unsigned int check_guest_io_breakpoint(struct > vcpu *v, > unsigned int width, i, match = 0; > unsigned long start; > > + /* PVH fixme: support io breakpoint. */ > + if ( is_pvh_vcpu(v) ) > + return 0; > + > if ( !(v->arch.debugreg[5]) || > !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) ) > return 0; > -- > 1.7.9.5
Tim Deegan
2013-Sep-26  16:24 UTC
Re: [PATCH RFC v13 03/20] Introduce pv guest type and has_hvm_container macros
At 14:46 +0100 on 26 Sep (1380206764), George Dunlap wrote:> There''s more to not having a qemu than just not starting qemu, of > course; a lot of the HVM codepaths assume that there is one and will > dereference structures which will be empty. But that should be fairly > easy to fix.True. And I suspect with various patchsets around that allow for multiple ioreq-servicing backends we can allow for there to be none.> Having the PV e820 map makes sense, but you can file that under "make > available to hvm guests as well". > > The main things left are the PV paths for cpuid, PIO, and forced > emulated ops. I haven''t taken a close look at how these differ, or what > benefit is gained from using the PV version over the HVM version.I would be inclined to use the HVM paths for PIO and emulated ops; the cpuid interface might need fudging, I guess.> The other big thing is being able to set up more state when bringing up > a vcpu.Sure. But again, probably OK to expose a fuller setvcpucontext to all HVM guests.> One reason to disable unneeded things is the security angle: there is a > risk, no matter how small, that there is somehow an exploitable bug in > our emulated APIC / PIT code; so running with the code entirely disabled > is more secure than running with it enabled but just not used.That''s a fair point. Could we arrange that by having control flags flags for things like RTC and [[IO]A]PIC, the way we do for HPET? The same argument goes the other way -- might we want to have a HVM param that disables the extended PV interface? We haven''t done that before (except, I guess, for the Viridian interface), but it would be easy enough to arrange, and it seems less intrusive than having a third class of guests at the top level. Tim.
Konrad Rzeszutek Wilk
2013-Sep-27  13:08 UTC
Re: [PATCH RFC v13 00/20] Introduce PVH domU support
> * I changed the "enable PVH by setting PV + HAP", replacing it instead > with a separate flag, just like the HVM case, since it makes sense > to plan on using shadow in the future (although it is... ?> > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > CC: Mukesh Rathor <mukesh.rathor@oracle.com> > CC: Jan Beulich <beulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > CC: Ian Jackson <ian.jackson@citrix.com> > CC: Ian Campbell <ian.campbell@citrix.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Mon, Sep 23, 2013 at 05:49:49PM +0100, George Dunlap wrote:> Allow PV e820 map to be set and read from a PVH domain. This requires > moving the pv e820 struct out from the pv-specific domain struct and > into the arch domain struct. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com>Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> --- > > CC: Jan Beulich <jbeulich@suse.com> > CC: Tim Deegan <tim@xen.org> > CC: Keir Fraser <keir@xen.org> > --- > xen/arch/x86/domain.c | 9 +++------ > xen/arch/x86/mm.c | 26 ++++++++++++-------------- > xen/include/asm-x86/domain.h | 10 +++++----- > 3 files changed, 20 insertions(+), 25 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index ac18aa7..5b7e1b2 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > if ( (rc = iommu_domain_init(d)) != 0 ) > goto fail; > } > + spin_lock_init(&d->arch.e820_lock); > > if ( has_hvm_container_domain(d) ) > { > @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > } > } > else > - { > /* 64-bit PV guest by default. */ > d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; > > - spin_lock_init(&d->arch.pv_domain.e820_lock); > - } > - > /* initialize default tsc behavior in case tools don''t */ > tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); > spin_lock_init(&d->arch.vtsc_lock); > @@ -592,8 +589,8 @@ void arch_domain_destroy(struct domain *d) > { > if ( has_hvm_container_domain(d) ) > hvm_domain_destroy(d); > - else > - xfree(d->arch.pv_domain.e820); > + > + xfree(d->arch.e820); > > free_domain_pirqs(d); > if ( !is_idle_domain(d) ) > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 120599a..f25df92 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -4760,11 +4760,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) > return -EFAULT; > } > > - spin_lock(&d->arch.pv_domain.e820_lock); > - xfree(d->arch.pv_domain.e820); > - d->arch.pv_domain.e820 = e820; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > + xfree(d->arch.e820); > + d->arch.e820 = e820; > + d->arch.nr_e820 = fmap.map.nr_entries; > + spin_unlock(&d->arch.e820_lock); > > rcu_unlock_domain(d); > return rc; > @@ -4778,26 +4778,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > - spin_lock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > > /* Backwards compatibility. */ > - if ( (d->arch.pv_domain.nr_e820 == 0) || > - (d->arch.pv_domain.e820 == NULL) ) > + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -ENOSYS; > } > > - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); > - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, > - map.nr_entries) || > + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); > + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || > __copy_to_guest(arg, &map, 1) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -EFAULT; > } > > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return 0; > } > > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index 0831700..a60d25c 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -234,11 +234,6 @@ struct pv_domain > > /* map_domain_page() mapping cache. */ > struct mapcache_domain mapcache; > - > - /* Pseudophysical e820 map (XENMEM_memory_map). */ > - spinlock_t e820_lock; > - struct e820entry *e820; > - unsigned int nr_e820; > }; > > struct arch_domain > @@ -313,6 +308,11 @@ struct arch_domain > (possibly other cases in the future */ > uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ > uint64_t vtsc_usercount; /* not used for hvm */ > + > + /* Pseudophysical e820 map (XENMEM_memory_map). */ > + spinlock_t e820_lock; > + struct e820entry *e820; > + unsigned int nr_e820; > } __cacheline_aligned; > > #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) > -- > 1.7.9.5 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-27  18:34 UTC
Re: [PATCH RFC v13 12/20] pvh: read_descriptor for PVH guests.
On Mon, Sep 23, 2013 at 05:49:52PM +0100, George Dunlap wrote:> This is in preparation for enabling emulated privops for PVH guests. > > This one unfortunately has a bit more of an impedance mismatch: > > * For PV, the selector is hard-coded in by passing #name in the > #define. For PVH, we have to do a switch statement. > > * For PV, given the desrciptor, it can read the resulting base, limit,descriptor.
Konrad Rzeszutek Wilk
2013-Sep-27  18:37 UTC
Re: [PATCH RFC v13 19/20] PVH xen tools: libxc changes to build a PVH guest.
On Mon, Sep 23, 2013 at 05:49:59PM +0100, George Dunlap wrote:> From: Mukesh Rathor <mukesh.rathor@oracle.com> > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> CC: Ian Jackson <ian.jackson@citrix.com> > CC: Ian Campbell <ian.campbell@citrix.com> > --- > tools/libxc/xc_dom.h | 1 + > tools/libxc/xc_dom_core.c | 9 +++++ > tools/libxc/xc_dom_x86.c | 90 +++++++++++++++++++++++++++++---------------- > 3 files changed, 69 insertions(+), 31 deletions(-) > > diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h > index 86e23ee..5168bcd 100644 > --- a/tools/libxc/xc_dom.h > +++ b/tools/libxc/xc_dom.h > @@ -130,6 +130,7 @@ struct xc_dom_image { > domid_t console_domid; > domid_t xenstore_domid; > xen_pfn_t shared_info_mfn; > + int pvh_enabled; > > xc_interface *xch; > domid_t guest_domid; > diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c > index 0f367f6..faa7e0f 100644 > --- a/tools/libxc/xc_dom_core.c > +++ b/tools/libxc/xc_dom_core.c > @@ -766,6 +766,15 @@ int xc_dom_parse_image(struct xc_dom_image *dom) > goto err; > } > > + if ( dom->pvh_enabled ) > + { > + const char *pvh_features = "writable_descriptor_tables|" > + "auto_translated_physmap|" > + "supervisor_mode_kernel|" > + "hvm_callback_vector"; > + elf_xen_parse_features(pvh_features, dom->f_requested, NULL); > + } > + > /* check features */ > for ( i = 0; i < XENFEAT_NR_SUBMAPS; i++ ) > { > diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c > index 7cc2ff2..f36dfbb 100644 > --- a/tools/libxc/xc_dom_x86.c > +++ b/tools/libxc/xc_dom_x86.c > @@ -407,7 +407,8 @@ static int setup_pgtables_x86_64(struct xc_dom_image *dom) > pgpfn = (addr - dom->parms.virt_base) >> PAGE_SHIFT_X86; > l1tab[l1off] > pfn_to_paddr(xc_dom_p2m_guest(dom, pgpfn)) | L1_PROT; > - if ( (addr >= dom->pgtables_seg.vstart) && > + if ( (!dom->pvh_enabled) && > + (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > > @@ -588,6 +589,13 @@ static int vcpu_x86_32(struct xc_dom_image *dom, void *ptr) > > DOMPRINTF_CALLED(dom->xch); > > + if ( dom->pvh_enabled ) > + { > + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, > + "%s: PVH not supported for 32bit guests.", __FUNCTION__); > + return -1; > + } > + > /* clear everything */ > memset(ctxt, 0, sizeof(*ctxt)); > > @@ -630,12 +638,6 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void *ptr) > /* clear everything */ > memset(ctxt, 0, sizeof(*ctxt)); > > - ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64; > - ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64; > - ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64; > - ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64; > - ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64; > - ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64; > ctxt->user_regs.rip = dom->parms.virt_entry; > ctxt->user_regs.rsp > dom->parms.virt_base + (dom->bootstack_pfn + 1) * PAGE_SIZE_X86; > @@ -643,15 +645,25 @@ static int vcpu_x86_64(struct xc_dom_image *dom, void *ptr) > dom->parms.virt_base + (dom->start_info_pfn) * PAGE_SIZE_X86; > ctxt->user_regs.rflags = 1 << 9; /* Interrupt Enable */ > > - ctxt->kernel_ss = ctxt->user_regs.ss; > - ctxt->kernel_sp = ctxt->user_regs.esp; > - > ctxt->flags = VGCF_in_kernel_X86_64 | VGCF_online_X86_64; > cr3_pfn = xc_dom_p2m_guest(dom, dom->pgtables_seg.pfn); > ctxt->ctrlreg[3] = xen_pfn_to_cr3_x86_64(cr3_pfn); > DOMPRINTF("%s: cr3: pfn 0x%" PRIpfn " mfn 0x%" PRIpfn "", > __FUNCTION__, dom->pgtables_seg.pfn, cr3_pfn); > > + if ( dom->pvh_enabled ) > + return 0; > + > + ctxt->user_regs.ds = FLAT_KERNEL_DS_X86_64; > + ctxt->user_regs.es = FLAT_KERNEL_DS_X86_64; > + ctxt->user_regs.fs = FLAT_KERNEL_DS_X86_64; > + ctxt->user_regs.gs = FLAT_KERNEL_DS_X86_64; > + ctxt->user_regs.ss = FLAT_KERNEL_SS_X86_64; > + ctxt->user_regs.cs = FLAT_KERNEL_CS_X86_64; > + > + ctxt->kernel_ss = ctxt->user_regs.ss; > + ctxt->kernel_sp = ctxt->user_regs.esp; > + > return 0; > } > > @@ -752,7 +764,7 @@ int arch_setup_meminit(struct xc_dom_image *dom) > rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type); > if ( rc ) > return rc; > - if ( xc_dom_feature_translated(dom) ) > + if ( xc_dom_feature_translated(dom) && !dom->pvh_enabled ) > { > dom->shadow_enabled = 1; > rc = x86_shadow(dom->xch, dom->guest_domid); > @@ -828,6 +840,38 @@ int arch_setup_bootearly(struct xc_dom_image *dom) > return 0; > } > > +/* > + * Map grant table frames into guest physmap. PVH manages grant during boot > + * via HVM mechanisms. > + */ > +static int map_grant_table_frames(struct xc_dom_image *dom) > +{ > + int i, rc; > + > + if ( dom->pvh_enabled ) > + return 0; > + > + for ( i = 0; ; i++ ) > + { > + rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, > + XENMAPSPACE_grant_table, > + i, dom->total_pages + i); > + if ( rc != 0 ) > + { > + if ( (i > 0) && (errno == EINVAL) ) > + { > + DOMPRINTF("%s: %d grant tables mapped", __FUNCTION__, i); > + break; > + } > + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, > + "%s: mapping grant tables failed " "(pfn=0x%" PRIpfn > + ", rc=%d)", __FUNCTION__, dom->total_pages + i, rc); > + return rc; > + } > + } > + return 0; > +} > + > int arch_setup_bootlate(struct xc_dom_image *dom) > { > static const struct { > @@ -866,7 +910,6 @@ int arch_setup_bootlate(struct xc_dom_image *dom) > else > { > /* paravirtualized guest with auto-translation */ > - int i; > > /* Map shared info frame into guest physmap. */ > rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, > @@ -880,25 +923,10 @@ int arch_setup_bootlate(struct xc_dom_image *dom) > return rc; > } > > - /* Map grant table frames into guest physmap. */ > - for ( i = 0; ; i++ ) > - { > - rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid, > - XENMAPSPACE_grant_table, > - i, dom->total_pages + i); > - if ( rc != 0 ) > - { > - if ( (i > 0) && (errno == EINVAL) ) > - { > - DOMPRINTF("%s: %d grant tables mapped", __FUNCTION__, i); > - break; > - } > - xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, > - "%s: mapping grant tables failed " "(pfn=0x%" > - PRIpfn ", rc=%d)", __FUNCTION__, dom->total_pages + i, rc); > - return rc; > - } > - } > + rc = map_grant_table_frames(dom); > + if ( rc != 0 ) > + return rc; > + > shinfo = dom->shared_info_pfn; > } > > -- > 1.7.9.5 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-27  18:38 UTC
Re: [PATCH RFC v13 20/20] PVH xen tools: libxl changes to create a PVH guest.
On Mon, Sep 23, 2013 at 05:50:00PM +0100, George Dunlap wrote:> From: Mukesh Rathor <mukesh.rathor@oracle.com> > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> --- > v13 (gwd): > - Added XEN_DOMCTL_CDF_pvh_guest flag > CC: Ian Jackson <ian.jackson@citrix.com> > CC: Ian Campbell <ian.campbell@citrix.com> > --- > docs/man/xl.cfg.pod.5 | 3 +++ > tools/libxl/libxl.h | 6 ++++++ > tools/libxl/libxl_create.c | 13 +++++++++++++ > tools/libxl/libxl_dom.c | 2 ++ > tools/libxl/libxl_internal.h | 1 + > tools/libxl/libxl_types.idl | 1 + > tools/libxl/libxl_x86.c | 4 +++- > tools/libxl/xl_cmdimpl.c | 1 + > 8 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > index 769767b..69a4cc1 100644 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -634,6 +634,9 @@ if your particular guest kernel does not require this behaviour then > it is safe to allow this to be enabled but you may wish to disable it > anyway. > > +=item B<pvh=BOOLEAN> > +Selects whether to run this PV guest in an HVM container. Default is 0. > + > =back > > =head2 Fully-virtualised (HVM) Guest Specific Options > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > index 4cab294..2801da4 100644 > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -355,6 +355,12 @@ > */ > #define LIBXL_HAVE_SPICE_VDAGENT 1 > > +/* > + * LIBXL_HAVE_CREATEINFO_PVH > + * If this is defined, then libxl supports creation of a PVH guest. > + */ > +#define LIBXL_HAVE_CREATEINFO_PVH 1 > + > /* Functions annotated with LIBXL_EXTERNAL_CALLERS_ONLY may not be > * called from within libxl itself. Callers outside libxl, who > * do not #include libxl_internal.h, are fine. */ > diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c > index 7567238..a8e773a 100644 > --- a/tools/libxl/libxl_create.c > +++ b/tools/libxl/libxl_create.c > @@ -33,6 +33,9 @@ int libxl__domain_create_info_setdefault(libxl__gc *gc, > if (c_info->type == LIBXL_DOMAIN_TYPE_HVM) { > libxl_defbool_setdefault(&c_info->hap, true); > libxl_defbool_setdefault(&c_info->oos, true); > + } else { > + libxl_defbool_setdefault(&c_info->pvh, false); > + libxl_defbool_setdefault(&c_info->hap, libxl_defbool_val(c_info->pvh)); > } > > libxl_defbool_setdefault(&c_info->run_hotplug_scripts, true); > @@ -349,6 +352,8 @@ int libxl__domain_build(libxl__gc *gc, > > break; > case LIBXL_DOMAIN_TYPE_PV: > + state->pvh_enabled = libxl_defbool_val(d_config->c_info.pvh); > + > ret = libxl__build_pv(gc, domid, info, state); > if (ret) > goto out; > @@ -408,6 +413,14 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_create_info *info, > flags |= XEN_DOMCTL_CDF_hvm_guest; > flags |= libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0; > flags |= libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off; > + } else if ( libxl_defbool_val(info->pvh) ) { > + flags |= XEN_DOMCTL_CDF_pvh_guest; > + if ( !libxl_defbool_val(info->hap) ) { > + LOG(ERROR, "HAP must be on for PVH"); > + rc = ERROR_INVAL; > + goto out; > + } > + flags |= XEN_DOMCTL_CDF_hap; > } > *domid = -1; > > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index 6e2252a..89e919e 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -338,6 +338,8 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, > return ERROR_FAIL; > } > > + dom->pvh_enabled = state->pvh_enabled; > + > LOG(DEBUG, "pv kernel mapped %d path %s\n", state->pv_kernel.mapped, state->pv_kernel.path); > if (state->pv_kernel.mapped) { > ret = xc_dom_kernel_mem(dom, > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > index f051d91..441ec66 100644 > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -886,6 +886,7 @@ typedef struct { > libxl__file_reference pv_kernel; > libxl__file_reference pv_ramdisk; > const char * pv_cmdline; > + bool pvh_enabled; > } libxl__domain_build_state; > > _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid, > diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl > index 049dbb5..e83b70d 100644 > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -257,6 +257,7 @@ libxl_domain_create_info = Struct("domain_create_info",[ > ("platformdata", libxl_key_value_list), > ("poolid", uint32), > ("run_hotplug_scripts",libxl_defbool), > + ("pvh", libxl_defbool), > ], dir=DIR_IN) > > MemKB = UInt(64, init_val = "LIBXL_MEMKB_DEFAULT") > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c > index a78c91d..87a8110 100644 > --- a/tools/libxl/libxl_x86.c > +++ b/tools/libxl/libxl_x86.c > @@ -290,7 +290,9 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, > if (rtc_timeoffset) > xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset); > > - if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM) { > + if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM || > + libxl_defbool_val(d_config->c_info.pvh)) { > + > unsigned long shadow; > shadow = (d_config->b_info.shadow_memkb + 1023) / 1024; > xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 3d7eaad..7f36423 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -610,6 +610,7 @@ static void parse_config_data(const char *config_source, > !strncmp(buf, "hvm", strlen(buf))) > c_info->type = LIBXL_DOMAIN_TYPE_HVM; > > + xlu_cfg_get_defbool(config, "pvh", &c_info->pvh, 0); > xlu_cfg_get_defbool(config, "hap", &c_info->hap, 0); > > if (xlu_cfg_replace_string (config, "name", &c_info->name, 0)) { > -- > 1.7.9.5 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Mukesh Rathor
2013-Sep-27  21:15 UTC
Re: [PATCH RFC v13 08/20] pvh: PVH access to hypercalls
On Thu, 26 Sep 2013 16:33:07 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> > >>> wrote: > > --- a/xen/arch/x86/hvm/mtrr.c > > +++ b/xen/arch/x86/hvm/mtrr.c > > @@ -578,6 +578,7 @@ int32_t hvm_set_mem_pinned_cacheattr( > > { > > struct hvm_mem_pinned_cacheattr_range *range; > > > > + /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. > > */ if ( !((type == PAT_TYPE_UNCACHABLE) || > > (type == PAT_TYPE_WRCOMB) || > > (type == PAT_TYPE_WRTHROUGH) || > > I always wondered what this comment is trying to tell the > reader.Original code: /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */ if ( is_pvh_domain(d) ) return -EOPNOTSUPP; We don''t support this because the guest writes PAT natively, was my intention. Either it was your comment to return EOPNOTSUPP or just my thought - it will return EINVAL now, I guess ok too. thanks mukesh
>>> On 27.09.13 at 23:15, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 26 Sep 2013 16:33:07 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> >> >>> wrote: >> > --- a/xen/arch/x86/hvm/mtrr.c >> > +++ b/xen/arch/x86/hvm/mtrr.c >> > @@ -578,6 +578,7 @@ int32_t hvm_set_mem_pinned_cacheattr( >> > { >> > struct hvm_mem_pinned_cacheattr_range *range; >> > >> > + /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. >> > */ if ( !((type == PAT_TYPE_UNCACHABLE) || >> > (type == PAT_TYPE_WRCOMB) || >> > (type == PAT_TYPE_WRTHROUGH) || >> >> I always wondered what this comment is trying to tell the >> reader. > > Original code: > > /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */ > if ( is_pvh_domain(d) ) > return -EOPNOTSUPP; > > We don''t support this because the guest writes PAT natively, was > my intention. Either it was your comment to return EOPNOTSUPP > or just my thought - it will return EINVAL now, I guess ok too.That wasn''t the question (albeit nevertheless good you comment on it). I was really after understanding what "writes natively" is supposed to mean. _If_ there''s any behavioral difference to how "normal" HVM accesses the PAT MSR, this surely requires (in the commit message at least) a more detailed explanation. And if there was no difference, the comment would seem rather pointless. Jan
On 23/09/13 18:49, George Dunlap wrote:> There are many functions where PVH requires some code in common with > HVM. Rearrange some of these functions so that the code is together. > > In general, the HVM code that PVH also uses includes: > - cacheattr functionality > - paging > - hvm_funcs > - hvm_assert_evtchn_irq tasklet > - tm_list > - hvm_params > > And code that PVH shares with PV but not with PVH: > - updating the domain wallclock > - setting v->is_initialized > > There should be no end-to-end changes in behavior.According to my bisection, this patch breaks HVM domains, here is the failure when trying to boot a HVM domain: (XEN) <vm_launch_fail> error code 7 (XEN) domain_crash_sync called from vmcs.c:1224 (XEN) Domain 1 (vcpu#0) crashed on cpu#7: (XEN) ----[ Xen-4.4-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 7 (XEN) RIP: 0000:[<0000000000000000>] (XEN) RFLAGS: 0000000000000002 CONTEXT: hvm guest (XEN) rax: 0000000000000000 rbx: ffff83019a148000 rcx: ffff83019a14ff80 (XEN) rdx: ffff82d0801cd03f rsi: 0000000000000000 rdi: ffff82d0801da5ec (XEN) rbp: ffff82d080105a78 rsp: 0000000000000000 r8: 0000000000000004 (XEN) r9: ffff82d080105ae7 r10: ffff83019a14ff70 r11: 0000000000000000 (XEN) r12: ffff83019a14ff50 r13: ffff83011da64000 r14: 0000000000000004 (XEN) r15: ffff82d080184a90 cr0: 0000000000000011 cr4: 0000000000000000 (XEN) cr3: 0000000000000000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: 0000
On 23/09/13 18:49, George Dunlap wrote:> @@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) > | (1U << TRAP_no_device); > vmx_update_exception_bitmap(v); > > + /* In HVM domains, this happens on the realmode->paging > + * transition. Since PVH never goes through this transition, we > + * need to do it at start-of-day. */ > + if ( is_pvh_domain(d) ) > + vmx_update_debug_state(v); > + > v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; > + > + /* PVH domains always start in paging mode */ > + if ( is_pvh_domain(d) ) > + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | X86_CR0_WP; > + > hvm_update_guest_cr(v, 0); > > - v->arch.hvm_vcpu.guest_cr[4] = 0; > + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? > + real_cr4_to_pv_guest_cr4(mmu_cr4_features)Here we need to mask the bits in CR4 that the guest isn''t allowed to set. Right now Xen is setting the VMXE bit by default, which the guest is not able to modify, so if the guests tries to update CR4 based on the previous value Xen is going to complain: + real_cr4_to_pv_guest_cr4(mmu_cr4_features) & + ~HVM_CR4_GUEST_RESERVED_BITS(v)
On 07/10/13 16:55, Roger Pau Monné wrote:> On 23/09/13 18:49, George Dunlap wrote: >> @@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) >> | (1U << TRAP_no_device); >> vmx_update_exception_bitmap(v); >> >> + /* In HVM domains, this happens on the realmode->paging >> + * transition. Since PVH never goes through this transition, we >> + * need to do it at start-of-day. */ >> + if ( is_pvh_domain(d) ) >> + vmx_update_debug_state(v); >> + >> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; >> + >> + /* PVH domains always start in paging mode */ >> + if ( is_pvh_domain(d) ) >> + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | X86_CR0_WP; >> + >> hvm_update_guest_cr(v, 0); >> >> - v->arch.hvm_vcpu.guest_cr[4] = 0; >> + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? >> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) > Here we need to mask the bits in CR4 that the guest isn''t allowed to > set. Right now Xen is setting the VMXE bit by default, which the guest > is not able to modify, so if the guests tries to update CR4 based on the > previous value Xen is going to complain: > > + real_cr4_to_pv_guest_cr4(mmu_cr4_features) & > + ~HVM_CR4_GUEST_RESERVED_BITS(v)Thanks for testing that -- I''ll include it in the next spin-up. I''ll probably also add a patch ASSERT''ing at the end of this function that the guest CR values have no reserved bits set. -George
At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote:> On 07/10/13 16:55, Roger Pau Monné wrote: > >On 23/09/13 18:49, George Dunlap wrote: > >>@@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) > >> | (1U << TRAP_no_device); > >> vmx_update_exception_bitmap(v); > >> > >>+ /* In HVM domains, this happens on the realmode->paging > >>+ * transition. Since PVH never goes through this transition, we > >>+ * need to do it at start-of-day. */ > >>+ if ( is_pvh_domain(d) ) > >>+ vmx_update_debug_state(v); > >>+ > >> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; > >>+ > >>+ /* PVH domains always start in paging mode */ > >>+ if ( is_pvh_domain(d) ) > >>+ v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | > >>X86_CR0_WP; > >>+ > >> hvm_update_guest_cr(v, 0); > >> > >>- v->arch.hvm_vcpu.guest_cr[4] = 0; > >>+ v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? > >>+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) > >Here we need to mask the bits in CR4 that the guest isn''t allowed to > >set. Right now Xen is setting the VMXE bit by default, which the guest > >is not able to modify, so if the guests tries to update CR4 based on the > >previous value Xen is going to complain: > > > >+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) & > >+ ~HVM_CR4_GUEST_RESERVED_BITS(v) > > Thanks for testing that -- I''ll include it in the next spin-up.<harping on> This is the sort of thing that would be easier if PVH guests were just HVM ones with extra hypercalls. <harping off> Tim.
On 07/10/13 17:12, Tim Deegan wrote:> At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote: >> On 07/10/13 16:55, Roger Pau Monné wrote: >>> On 23/09/13 18:49, George Dunlap wrote: >>>> @@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) >>>> | (1U << TRAP_no_device); >>>> vmx_update_exception_bitmap(v); >>>> >>>> + /* In HVM domains, this happens on the realmode->paging >>>> + * transition. Since PVH never goes through this transition, we >>>> + * need to do it at start-of-day. */ >>>> + if ( is_pvh_domain(d) ) >>>> + vmx_update_debug_state(v); >>>> + >>>> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; >>>> + >>>> + /* PVH domains always start in paging mode */ >>>> + if ( is_pvh_domain(d) ) >>>> + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | >>>> X86_CR0_WP; >>>> + >>>> hvm_update_guest_cr(v, 0); >>>> >>>> - v->arch.hvm_vcpu.guest_cr[4] = 0; >>>> + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? >>>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) >>> Here we need to mask the bits in CR4 that the guest isn''t allowed to >>> set. Right now Xen is setting the VMXE bit by default, which the guest >>> is not able to modify, so if the guests tries to update CR4 based on the >>> previous value Xen is going to complain: >>> >>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) & >>> + ~HVM_CR4_GUEST_RESERVED_BITS(v) >> Thanks for testing that -- I''ll include it in the next spin-up. > <harping on> This is the sort of thing that would be easier if PVH guests > were just HVM ones with extra hypercalls. <harping off>I''m not sure this would have helped -- the issue here is that we start the guest with long mode and paging enabled, which I think is something we want to do (rather than having to start in 16-bit mode and level-up through 32-bit to 64-bit mode). To do that, Mukesh very reasonably began by setting the guest cr4 to what a PV guest would see of the cr4, and for him it worked -- apparently because his PVH Linux doesn''t touch cr4. It''s possible that his version of the cr4 exit handler tolerated the VMXE bit being set, in which case it was the reworking I did that introduced the bug. (Although that demonstrates why having duplicate code is a really bad idea.) -George
At 17:20 +0100 on 07 Oct (1381166400), George Dunlap wrote:> On 07/10/13 17:12, Tim Deegan wrote: > >At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote: > >>On 07/10/13 16:55, Roger Pau Monné wrote: > >>>On 23/09/13 18:49, George Dunlap wrote: > >>>>@@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) > >>>> | (1U << TRAP_no_device); > >>>> vmx_update_exception_bitmap(v); > >>>> > >>>>+ /* In HVM domains, this happens on the realmode->paging > >>>>+ * transition. Since PVH never goes through this transition, we > >>>>+ * need to do it at start-of-day. */ > >>>>+ if ( is_pvh_domain(d) ) > >>>>+ vmx_update_debug_state(v); > >>>>+ > >>>> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; > >>>>+ > >>>>+ /* PVH domains always start in paging mode */ > >>>>+ if ( is_pvh_domain(d) ) > >>>>+ v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | > >>>>X86_CR0_WP; > >>>>+ > >>>> hvm_update_guest_cr(v, 0); > >>>> > >>>>- v->arch.hvm_vcpu.guest_cr[4] = 0; > >>>>+ v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? > >>>>+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) > >>>Here we need to mask the bits in CR4 that the guest isn''t allowed to > >>>set. Right now Xen is setting the VMXE bit by default, which the guest > >>>is not able to modify, so if the guests tries to update CR4 based on the > >>>previous value Xen is going to complain: > >>> > >>>+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) & > >>>+ ~HVM_CR4_GUEST_RESERVED_BITS(v) > >>Thanks for testing that -- I''ll include it in the next spin-up. > ><harping on> This is the sort of thing that would be easier if PVH guests > >were just HVM ones with extra hypercalls. <harping off> > > I''m not sure this would have helped -- the issue here is that we start > the guest with long mode and paging enabled, which I think is something > we want to do (rather than having to start in 16-bit mode and level-up > through 32-bit to 64-bit mode). To do that, Mukesh very reasonably > began by setting the guest cr4 to what a PV guest would see of the cr4, > and for him it worked -- apparently because his PVH Linux doesn''t touch > cr4.Sure. But if we used the existing HVM state-setting hypercalls to build PVH guests, they would DTRT and return an error to the builder. Tim.
>>> On 07.10.13 at 18:12, Tim Deegan <tim@xen.org> wrote: > At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote: >> On 07/10/13 16:55, Roger Pau Monné wrote: >> >On 23/09/13 18:49, George Dunlap wrote: >> >>@@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) >> >> | (1U << TRAP_no_device); >> >> vmx_update_exception_bitmap(v); >> >> >> >>+ /* In HVM domains, this happens on the realmode->paging >> >>+ * transition. Since PVH never goes through this transition, we >> >>+ * need to do it at start-of-day. */ >> >>+ if ( is_pvh_domain(d) ) >> >>+ vmx_update_debug_state(v); >> >>+ >> >> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; >> >>+ >> >>+ /* PVH domains always start in paging mode */ >> >>+ if ( is_pvh_domain(d) ) >> >>+ v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | >> >>X86_CR0_WP; >> >>+ >> >> hvm_update_guest_cr(v, 0); >> >> >> >>- v->arch.hvm_vcpu.guest_cr[4] = 0; >> >>+ v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? >> >>+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) >> >Here we need to mask the bits in CR4 that the guest isn't allowed to >> >set. Right now Xen is setting the VMXE bit by default, which the guest >> >is not able to modify, so if the guests tries to update CR4 based on the >> >previous value Xen is going to complain: >> > >> >+ real_cr4_to_pv_guest_cr4(mmu_cr4_features) & >> >+ ~HVM_CR4_GUEST_RESERVED_BITS(v) >> >> Thanks for testing that -- I'll include it in the next spin-up. > > <harping on> This is the sort of thing that would be easier if PVH guests > were just HVM ones with extra hypercalls. <harping off>And indeed I'm rather puzzled by the simplistic re-use of a PV construct: There are memory management related bits in CR4, and with memory management not being PV for PVH guests, this seems conceptually wrong. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Roger Pau Monné
2013-Oct-18  16:45 UTC
Re: [PATCH RFC v13 19/20] PVH xen tools: libxc changes to build a PVH guest.
This patch is incomplete, PVH guests with 4GB of RAM or more will fail 
to boot due to libxc temporary mapping the grant table to gfn 0xFFFFE, 
which is an invalid gfn on HVM guests, but not on PVH guests. When 
libxc removes the mapping of 0xFFFFE it will leave an invalid mfn in 
the p2m table, that will make the guest crash when trying to access 
that gfn.
The easiest way I''ve found to solve this is to temporary map the grant 
table to the maximum gfn + 1. This fixes the PVH issue, and also seems 
to work fine with HVM guests, here is the diff which should probably be 
integrated into this patch:
---
diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index 5168bcd..90679da 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -18,9 +18,6 @@
 
 #define INVALID_P2M_ENTRY   ((xen_pfn_t)-1)
 
-/* Scrach PFN for temporary mappings in HVM */
-#define SCRATCH_PFN_GNTTAB 0xFFFFE
-
 /* --- typedefs and structs ---------------------------------------- */
 
 typedef uint64_t xen_vaddr_t;
diff --git a/tools/libxc/xc_dom_boot.c b/tools/libxc/xc_dom_boot.c
index 71e1897..fdfeaf8 100644
--- a/tools/libxc/xc_dom_boot.c
+++ b/tools/libxc/xc_dom_boot.c
@@ -361,17 +361,27 @@ int xc_dom_gnttab_hvm_seed(xc_interface *xch, domid_t
domid,
                            domid_t xenstore_domid)
 {
     int rc;
+    xen_pfn_t max_gfn;
     struct xen_add_to_physmap xatp = {
         .domid = domid,
         .space = XENMAPSPACE_grant_table,
         .idx   = 0,
-        .gpfn  = SCRATCH_PFN_GNTTAB
     };
     struct xen_remove_from_physmap xrfp = {
         .domid = domid,
-        .gpfn  = SCRATCH_PFN_GNTTAB
     };
 
+    max_gfn = xc_domain_maximum_gpfn(xch, domid);
+    if ( max_gfn <= 0 ) {
+        xc_dom_panic(xch, XC_INTERNAL_ERROR,
+                     "%s: failed to get max gfn "
+                     "[errno=%d]\n",
+                     __FUNCTION__, errno);
+        return -1;
+    }
+    xatp.gpfn = max_gfn + 1;
+    xrfp.gpfn = max_gfn + 1;
+
     rc = do_memory_op(xch, XENMEM_add_to_physmap, &xatp, sizeof(xatp));
     if ( rc != 0 )
     {
George Dunlap
2013-Nov-04  11:56 UTC
Re: [PATCH RFC v13 19/20] PVH xen tools: libxc changes to build a PVH guest.
On 18/10/13 17:45, Roger Pau Monné wrote:> This patch is incomplete, PVH guests with 4GB of RAM or more will fail > to boot due to libxc temporary mapping the grant table to gfn 0xFFFFE, > which is an invalid gfn on HVM guests, but not on PVH guests. When > libxc removes the mapping of 0xFFFFE it will leave an invalid mfn in > the p2m table, that will make the guest crash when trying to access > that gfn. > > The easiest way I''ve found to solve this is to temporary map the grant > table to the maximum gfn + 1. This fixes the PVH issue, and also seems > to work fine with HVM guests, here is the diff which should probably be > integrated into this patch:Thanks Roger. I assume it''s OK to add your sign-off to this patch? -George> > --- > diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h > index 5168bcd..90679da 100644 > --- a/tools/libxc/xc_dom.h > +++ b/tools/libxc/xc_dom.h > @@ -18,9 +18,6 @@ > > #define INVALID_P2M_ENTRY ((xen_pfn_t)-1) > > -/* Scrach PFN for temporary mappings in HVM */ > -#define SCRATCH_PFN_GNTTAB 0xFFFFE > - > /* --- typedefs and structs ---------------------------------------- */ > > typedef uint64_t xen_vaddr_t; > diff --git a/tools/libxc/xc_dom_boot.c b/tools/libxc/xc_dom_boot.c > index 71e1897..fdfeaf8 100644 > --- a/tools/libxc/xc_dom_boot.c > +++ b/tools/libxc/xc_dom_boot.c > @@ -361,17 +361,27 @@ int xc_dom_gnttab_hvm_seed(xc_interface *xch, domid_t domid, > domid_t xenstore_domid) > { > int rc; > + xen_pfn_t max_gfn; > struct xen_add_to_physmap xatp = { > .domid = domid, > .space = XENMAPSPACE_grant_table, > .idx = 0, > - .gpfn = SCRATCH_PFN_GNTTAB > }; > struct xen_remove_from_physmap xrfp = { > .domid = domid, > - .gpfn = SCRATCH_PFN_GNTTAB > }; > > + max_gfn = xc_domain_maximum_gpfn(xch, domid); > + if ( max_gfn <= 0 ) { > + xc_dom_panic(xch, XC_INTERNAL_ERROR, > + "%s: failed to get max gfn " > + "[errno=%d]\n", > + __FUNCTION__, errno); > + return -1; > + } > + xatp.gpfn = max_gfn + 1; > + xrfp.gpfn = max_gfn + 1; > + > rc = do_memory_op(xch, XENMEM_add_to_physmap, &xatp, sizeof(xatp)); > if ( rc != 0 ) > { >
George Dunlap
2013-Nov-04  12:31 UTC
Re: [PATCH RFC v13 05/20] pvh: Disable unneeded features of HVM containers
On 26/09/13 16:22, Jan Beulich wrote:>>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> @@ -578,6 +592,9 @@ int hvm_domain_initialise(struct domain *d) >> >> void hvm_domain_relinquish_resources(struct domain *d) >> { >> + if ( is_pvh_domain(d) ) >> + return; >> + > Iirc in an earlier patch you''re allocating ->arch.hvm_domain.params. > Which is supposed to be freed here, and hence would now be > leaked for PVH domains. > >> @@ -1114,6 +1135,14 @@ int hvm_vcpu_initialise(struct vcpu *v) >> >> v->arch.hvm_vcpu.inject_trap.vector = -1; >> >> + if ( is_pvh_vcpu(v) ) >> + { >> + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ >> + /* This for hvm_long_mode_enabled(v). */ >> + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; > Please fix the comment (perhaps just missing an "is").Gah... Sorry Jan, I thought I had already gone through and addressed a number of these things you pointed out, but apparently I hadn''t. I''ll go through the series again... -George
Roger Pau Monné
2013-Nov-04  13:18 UTC
Re: [PATCH RFC v13 19/20] PVH xen tools: libxc changes to build a PVH guest.
On 04/11/13 12:56, George Dunlap wrote:> On 18/10/13 17:45, Roger Pau Monné wrote: >> This patch is incomplete, PVH guests with 4GB of RAM or more will fail >> to boot due to libxc temporary mapping the grant table to gfn 0xFFFFE, >> which is an invalid gfn on HVM guests, but not on PVH guests. When >> libxc removes the mapping of 0xFFFFE it will leave an invalid mfn in >> the p2m table, that will make the guest crash when trying to access >> that gfn. >> >> The easiest way I''ve found to solve this is to temporary map the grant >> table to the maximum gfn + 1. This fixes the PVH issue, and also seems >> to work fine with HVM guests, here is the diff which should probably be >> integrated into this patch: > > Thanks Roger. I assume it''s OK to add your sign-off to this patch?Yes, you can merge it with the current patch. Roger.
On 08/10/13 09:45, Jan Beulich wrote:>>>> On 07.10.13 at 18:12, Tim Deegan <tim@xen.org> wrote: >> At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote: >>> On 07/10/13 16:55, Roger Pau Monné wrote: >>>> On 23/09/13 18:49, George Dunlap wrote: >>>>> @@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) >>>>> | (1U << TRAP_no_device); >>>>> vmx_update_exception_bitmap(v); >>>>> >>>>> + /* In HVM domains, this happens on the realmode->paging >>>>> + * transition. Since PVH never goes through this transition, we >>>>> + * need to do it at start-of-day. */ >>>>> + if ( is_pvh_domain(d) ) >>>>> + vmx_update_debug_state(v); >>>>> + >>>>> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; >>>>> + >>>>> + /* PVH domains always start in paging mode */ >>>>> + if ( is_pvh_domain(d) ) >>>>> + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | >>>>> X86_CR0_WP; >>>>> + >>>>> hvm_update_guest_cr(v, 0); >>>>> >>>>> - v->arch.hvm_vcpu.guest_cr[4] = 0; >>>>> + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? >>>>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) >>>> Here we need to mask the bits in CR4 that the guest isn't allowed to >>>> set. Right now Xen is setting the VMXE bit by default, which the guest >>>> is not able to modify, so if the guests tries to update CR4 based on the >>>> previous value Xen is going to complain: >>>> >>>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) & >>>> + ~HVM_CR4_GUEST_RESERVED_BITS(v) >>> Thanks for testing that -- I'll include it in the next spin-up. >> <harping on> This is the sort of thing that would be easier if PVH guests >> were just HVM ones with extra hypercalls. <harping off> > And indeed I'm rather puzzled by the simplistic re-use of a PV > construct: There are memory management related bits in CR4, > and with memory management not being PV for PVH guests, > this seems conceptually wrong.What would be a better approach, do you think? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 07.11.13 at 13:02, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 08/10/13 09:45, Jan Beulich wrote: >>>>> On 07.10.13 at 18:12, Tim Deegan <tim@xen.org> wrote: >>> At 17:06 +0100 on 07 Oct (1381165615), George Dunlap wrote: >>>> On 07/10/13 16:55, Roger Pau Monné wrote: >>>>> On 23/09/13 18:49, George Dunlap wrote: >>>>>> @@ -1028,12 +1129,28 @@ static int construct_vmcs(struct vcpu *v) >>>>>> | (1U << TRAP_no_device); >>>>>> vmx_update_exception_bitmap(v); >>>>>> >>>>>> + /* In HVM domains, this happens on the realmode->paging >>>>>> + * transition. Since PVH never goes through this transition, we >>>>>> + * need to do it at start-of-day. */ >>>>>> + if ( is_pvh_domain(d) ) >>>>>> + vmx_update_debug_state(v); >>>>>> + >>>>>> v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; >>>>>> + >>>>>> + /* PVH domains always start in paging mode */ >>>>>> + if ( is_pvh_domain(d) ) >>>>>> + v->arch.hvm_vcpu.guest_cr[0] |= X86_CR0_PG | X86_CR0_NE | >>>>>> X86_CR0_WP; >>>>>> + >>>>>> hvm_update_guest_cr(v, 0); >>>>>> >>>>>> - v->arch.hvm_vcpu.guest_cr[4] = 0; >>>>>> + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(d) ? >>>>>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) >>>>> Here we need to mask the bits in CR4 that the guest isn't allowed to >>>>> set. Right now Xen is setting the VMXE bit by default, which the guest >>>>> is not able to modify, so if the guests tries to update CR4 based on the >>>>> previous value Xen is going to complain: >>>>> >>>>> + real_cr4_to_pv_guest_cr4(mmu_cr4_features) & >>>>> + ~HVM_CR4_GUEST_RESERVED_BITS(v) >>>> Thanks for testing that -- I'll include it in the next spin-up. >>> <harping on> This is the sort of thing that would be easier if PVH guests >>> were just HVM ones with extra hypercalls. <harping off> >> And indeed I'm rather puzzled by the simplistic re-use of a PV >> construct: There are memory management related bits in CR4, >> and with memory management not being PV for PVH guests, >> this seems conceptually wrong. > > What would be a better approach, do you think?Split between the memory management related bits and the others, taking the former from what HVM would and the latter from how PV would. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 26/09/13 16:29, Jan Beulich wrote:>>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> Changes: >> * Enforce HAP mode for now >> * Disable exits related to virtual interrupts or emulated APICs >> * Disable changing paging mode >> - "unrestricted guest" (i.e., real mode for EPT) disabled >> - write guest EFER disabled >> * Start in 64-bit mode >> * Force TSC mode to be "none" >> * Paging mode update to happen in arch_set_info_guest >> >> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> >> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >> --- >> v13: >> - Fix up default cr0 settings >> - Get rid of some unnecessary PVH-related changes >> - Return EOPNOTSUPP instead of ENOSYS if hardware features are not present >> - Remove an unnecessary variable from pvh_check_requirements >> CC: Jan Beulich <jbeulich@suse.com> >> CC: Tim Deegan <tim@xen.org> >> CC: Keir Fraser <keir@xen.org> >> --- >> xen/arch/x86/hvm/vmx/vmcs.c | 130 >> +++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 126 insertions(+), 4 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c >> index cf54d18..53fccdf 100644 >> --- a/xen/arch/x86/hvm/vmx/vmcs.c >> +++ b/xen/arch/x86/hvm/vmx/vmcs.c >> @@ -828,6 +828,60 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 >> vmcs_encoding, u64 val) >> virtual_vmcs_exit(vvmcs); >> } >> >> +static int pvh_check_requirements(struct vcpu *v) >> +{ >> + u64 required; >> + >> + /* Check for required hardware features */ >> + if ( !cpu_has_vmx_ept ) >> + { >> + printk(XENLOG_G_INFO "PVH: CPU does not have EPT support\n"); >> + return -EOPNOTSUPP; >> + } >> + if ( !cpu_has_vmx_pat ) >> + { >> + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); >> + return -EOPNOTSUPP; >> + } >> + if ( !cpu_has_vmx_msr_bitmap ) >> + { >> + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); >> + return -EOPNOTSUPP; >> + } >> + if ( !cpu_has_vmx_secondary_exec_control ) >> + { >> + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); >> + return -EOPNOTSUPP; >> + } > Up to here the checks are VMX specific, and hence belong in a VMX > specific file, ... > >> + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; >> + if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) != required ) >> + { >> + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", >> + required); >> + return -EOPNOTSUPP; >> + } >> + >> + /* Check for required configuration options */ >> + if ( !paging_mode_hap(v->domain) ) >> + { >> + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); >> + return -EINVAL; >> + } >> + /* >> + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, >> + * then pv_vcpu.ctrlreg must be added to the pvh struct. >> + */ >> + if ( v->domain->arch.vtsc ) >> + { >> + printk(XENLOG_G_INFO >> + "At present PVH only supports the default timer mode\n"); >> + return -EINVAL; >> + } > ... but all of these are pretty generic (apart from the X86_CR4_VMXE > in the CR4 mask checked above, but I wonder whether that > shouldn''t be checked much earlier - for HVM guests no such check > exists here afaik).The hap check can be removed and checked in hvm_domain_initialise(), and the vtsc one moved to tsc_set_info(). Is it really necessary to check these bits in cr4? Or is this more or less an ASSERT() to make sure people haven''t accidentally disabled these in the real->guest cr4 conversion? -George
>>> On 07.11.13 at 15:14, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 26/09/13 16:29, Jan Beulich wrote: >>>>> On 23.09.13 at 18:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: >>> + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; >>> + if ( (real_cr4_to_pv_guest_cr4(mmu_cr4_features) & required) != required ) >>> + { >>> + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", >>> + required); >>> + return -EOPNOTSUPP; >>> + } >>> + >>> + /* Check for required configuration options */ >>> + if ( !paging_mode_hap(v->domain) ) >>> + { >>> + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); >>> + return -EINVAL; >>> + } >>> + /* >>> + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, >>> + * then pv_vcpu.ctrlreg must be added to the pvh struct. >>> + */ >>> + if ( v->domain->arch.vtsc ) >>> + { >>> + printk(XENLOG_G_INFO >>> + "At present PVH only supports the default timer mode\n"); >>> + return -EINVAL; >>> + } >> ... but all of these are pretty generic (apart from the X86_CR4_VMXE >> in the CR4 mask checked above, but I wonder whether that >> shouldn''t be checked much earlier - for HVM guests no such check >> exists here afaik). > > The hap check can be removed and checked in hvm_domain_initialise(), and > the vtsc one moved to tsc_set_info(). > > Is it really necessary to check these bits in cr4? Or is this more or > less an ASSERT() to make sure people haven''t accidentally disabled these > in the real->guest cr4 conversion?That''s a question to Mukesh really; I don''t know the motivation for them being there. Apart from that the check is also guest independent, and hence could be done - if needed at all - once (when setting the proposed pvh_enabled). So - PAE is always there and doesn''t require checking, - VMXE would better be replaced by hvm_enabled being a prereq for pvh_enabled, - FXSR is also always there (now that we haven''t been supporting 32-bit hosts for quite a while), and it escapes me what relation this has to PVH anyway. Jan
George Dunlap
2013-Nov-07  15:57 UTC
Re: [PATCH RFC v13 13/20] pvh: Set up more PV stuff in set_info_guest
On 26/09/13 16:43, Jan Beulich wrote:>> @@ -736,8 +748,24 @@ int arch_set_info_guest( >> >> if ( has_hvm_container_vcpu(v) ) >> { >> - hvm_set_info_guest(v); >> - goto out; >> + hvm_set_info_guest(v, compat ? 0 : c.nat->gs_base_kernel); > I had pointed this out to Mukesh already: If you want to imply kernel > mode here, the checks above should verify that the kernel mode flag > is set in flags.It looks like for some reason TF_kernel_mode is set unconditionally for HVM guests (though the /*???*/ would seem to indicate that someone at some point didn''t understand why): if ( (flags & VGCF_in_kernel) || has_hvm_container_vcpu(v)/*???*/ ) v->arch.flags |= TF_kernel_mode; I''ll add a comment pointing this out. -George