Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 00/17][RFC] Nested virtualization for VMX
This patch set enables nested virtualization for VMX, That
is to allow a VMX guest (L1) to run other VMX guests (L2).
The patch can generally run on different configurations:
  - EPT-on-EPT, shadow-on-EPT, shadow-on-shadow
  - different 32/64 combination of L1 and L2
  - L1/L2 SMP
EPT-on-EPT is however, preferrable due to performance
advantage, I''ve tested the patch on a 64bit NHM L0,
against Xen cs.  21190. With EPT-on-EPT and a a kernel
build workload, L2 needs around 17% more time to complete.
Known problems:
  - L1/L2=64/64, shadow-on-shadow doesn''t work as for now
  - On 21190, even without nested patchset, Xen as L1
    suffers a considerable booting lag, this phenomenon
    was not observed on my previous base, around cs.
    20200
  - multiple L2 in one L1 hasn''t been tested
The patch list is as below, it contains 3 preparation
patches (01 -- 03), 11 generic patches (04 -- 14), 1 to
enable EPT-on-EPT (15), and 2 support patches (16, 17).
[PATCH 01/17] vmx: nest: fix CR4.VME in update_guest_cr
[PATCH 02/17] vmx: nest: rename host_vmcs
[PATCH 03/17] vmx: nest: wrapper for control update
[PATCH 04/17] vmx: nest: domain and vcpu flags
[PATCH 05/17] vmx: nest: nested control structure
[PATCH 06/17] vmx: nest: virtual vmcs layout
[PATCH 07/17] vmx: nest: handling VMX instruction exits
[PATCH 08/17] vmx: nest: L1 <-> L2 context switch
[PATCH 09/17] vmx: nest: interrupt
[PATCH 10/17] vmx: nest: VMExit handler in L2
[PATCH 11/17] vmx: nest: L2 tsc
[PATCH 12/17] vmx: nest: CR0.TS and #NM
[PATCH 13/17] vmx: nest: capability reporting MSRs
[PATCH 14/17] vmx: nest: enable virtual VMX
[PATCH 15/17] vmx: nest: virtual ept for nested
[PATCH 16/17] vmx: nest: hvmtrace for nested
[PATCH 17/17] tools: nest: allow enabling nesting
Thanks,
Qing He
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 01/17] vmx: nest: fix CR4.VME in update_guest_cr
X86_CR4_VME in guest_cr[4] is updated in cr0 handling, but not in
cr4 handling, fix it for guest VM86.
Signed-off-by: Qing He <qing.he@intel.com>
---
 vmx.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
diff -r 9be1d3918ec7 -r ca507122f84e xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Apr 21 23:43:59 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 21:28:41 2010 +0800
@@ -1174,7 +1174,8 @@
         if ( paging_mode_hap(v->domain) )
             v->arch.hvm_vcpu.hw_cr[4] &= ~X86_CR4_PAE;
         v->arch.hvm_vcpu.hw_cr[4] |= v->arch.hvm_vcpu.guest_cr[4];
-        if ( v->arch.hvm_vmx.vmx_realmode ) 
+        if ( v->arch.hvm_vmx.vmx_realmode ||
+             (v->arch.hvm_vcpu.hw_cr[4] & X86_CR4_VME) )
             v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME;
         if ( paging_mode_hap(v->domain) && !hvm_paging_enabled(v) )
         {
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
the VMCS region used for vmxon is named host_vmcs, which is
somewhat misleading in nested virtualization context, rename it
to vmxon_vmcs.
Signed-off-by: Qing He <qing.he@intel.com>
---
 vmcs.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
diff -r ca507122f84e -r fe49b7452637 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 21:28:41 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 21:49:38 2010 +0800
@@ -67,7 +67,7 @@
 u8 vmx_ept_super_page_level_limit __read_mostly;
 bool_t cpu_has_vmx_ins_outs_instr_info __read_mostly;
 
-static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, host_vmcs);
+static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, vmxon_vmcs);
 static DEFINE_PER_CPU(struct vmcs_struct *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, active_vmcs_list);
 
@@ -338,11 +338,11 @@
 
 int vmx_cpu_prepare(unsigned int cpu)
 {
-    if ( per_cpu(host_vmcs, cpu) != NULL )
+    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
         return 0;
 
-    per_cpu(host_vmcs, cpu) = vmx_alloc_vmcs();
-    if ( per_cpu(host_vmcs, cpu) != NULL )
+    per_cpu(vmxon_vmcs, cpu) = vmx_alloc_vmcs();
+    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
         return 0;
 
     printk("CPU%d: Could not allocate host VMCS\n", cpu);
@@ -399,7 +399,7 @@
     if ( vmx_cpu_prepare(cpu) != 0 )
         return 0;
 
-    switch ( __vmxon(virt_to_maddr(this_cpu(host_vmcs))) )
+    switch ( __vmxon(virt_to_maddr(this_cpu(vmxon_vmcs))) )
     {
     case -2: /* #UD or #GP */
         if ( bios_locked &&
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 03/17] vmx: nest: wrapper for control update
In nested virtualization, the L0 controls may not be the same
with controls in physical VMCS.
Explict maintain guest controls in variables and use wrappers
for control update, do not rely on physical control value.
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/intr.c        |    4 +-
 arch/x86/hvm/vmx/vmcs.c        |    6 +--
 arch/x86/hvm/vmx/vmx.c         |   72 ++++++++++++++++++++++++-----------------
 include/asm-x86/hvm/vmx/vmcs.h |    1 
 include/asm-x86/hvm/vmx/vmx.h  |    3 +
 5 files changed, 52 insertions(+), 34 deletions(-)
diff -r fe49b7452637 -r a0bbec37b529 xen/arch/x86/hvm/vmx/intr.c
--- a/xen/arch/x86/hvm/vmx/intr.c	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/intr.c	Thu Apr 22 21:49:38 2010 +0800
@@ -106,7 +106,7 @@
     if ( !(*cpu_exec_control & ctl) )
     {
         *cpu_exec_control |= ctl;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, *cpu_exec_control);
+        vmx_update_cpu_exec_control(v);
     }
 }
 
@@ -121,7 +121,7 @@
     if ( unlikely(v->arch.hvm_vcpu.single_step) )
     {
         v->arch.hvm_vmx.exec_control |= CPU_BASED_MONITOR_TRAP_FLAG;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         return;
     }
 
diff -r fe49b7452637 -r a0bbec37b529 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 21:49:38 2010 +0800
@@ -737,10 +737,10 @@
     __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL);
 #endif
 
-    __vmwrite(EXCEPTION_BITMAP,
-              HVM_TRAP_MASK
+    v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
               | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault))
-              | (1U << TRAP_no_device));
+              | (1U << TRAP_no_device);
+    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
 
     v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET;
     hvm_update_guest_cr(v, 0);
diff -r fe49b7452637 -r a0bbec37b529 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 21:49:38 2010 +0800
@@ -390,6 +390,22 @@
 
 #endif /* __i386__ */
 
+void vmx_update_cpu_exec_control(struct vcpu *v)
+{
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+}
+
+void vmx_update_secondary_exec_control(struct vcpu *v)
+{
+    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
+              v->arch.hvm_vmx.secondary_exec_control);
+}
+
+void vmx_update_exception_bitmap(struct vcpu *v)
+{
+    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
+}
+
 static int vmx_guest_x86_mode(struct vcpu *v)
 {
     unsigned int cs_ar_bytes;
@@ -413,7 +429,7 @@
     /* Clear the DR dirty flag and re-enable intercepts for DR accesses. */
     v->arch.hvm_vcpu.flag_dr_dirty = 0;
     v->arch.hvm_vmx.exec_control |= CPU_BASED_MOV_DR_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
 
     v->arch.guest_context.debugreg[0] = read_debugreg(0);
     v->arch.guest_context.debugreg[1] = read_debugreg(1);
@@ -627,7 +643,8 @@
 static void vmx_fpu_enter(struct vcpu *v)
 {
     setup_fpu(v);
-    __vm_clear_bit(EXCEPTION_BITMAP, TRAP_no_device);
+    v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device);
+    vmx_update_exception_bitmap(v);
     v->arch.hvm_vmx.host_cr0 &= ~X86_CR0_TS;
     __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
 }
@@ -653,7 +670,8 @@
     {
         v->arch.hvm_vcpu.hw_cr[0] |= X86_CR0_TS;
         __vmwrite(GUEST_CR0, v->arch.hvm_vcpu.hw_cr[0]);
-        __vm_set_bit(EXCEPTION_BITMAP, TRAP_no_device);
+        v->arch.hvm_vmx.exception_bitmap |= (1u << TRAP_no_device);
+        vmx_update_exception_bitmap(v);
     }
 }
 
@@ -959,7 +977,7 @@
     v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
     if ( enable )
         v->arch.hvm_vmx.exec_control |= CPU_BASED_RDTSC_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
     vmx_vmcs_exit(v);
 }
 
@@ -1052,7 +1070,7 @@
 
 void vmx_update_debug_state(struct vcpu *v)
 {
-    unsigned long intercepts, mask;
+    unsigned long mask;
 
     ASSERT(v == current);
 
@@ -1060,12 +1078,11 @@
     if ( !cpu_has_monitor_trap_flag )
         mask |= 1u << TRAP_debug;
 
-    intercepts = __vmread(EXCEPTION_BITMAP);
     if ( v->arch.hvm_vcpu.debug_state_latch )
-        intercepts |= mask;
+        v->arch.hvm_vmx.exception_bitmap |= mask;
     else
-        intercepts &= ~mask;
-    __vmwrite(EXCEPTION_BITMAP, intercepts);
+        v->arch.hvm_vmx.exception_bitmap &= ~mask;
+    vmx_update_exception_bitmap(v);
 }
 
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
@@ -1092,7 +1109,7 @@
             v->arch.hvm_vmx.exec_control &= ~cr3_ctls;
             if ( !hvm_paging_enabled(v) )
                 v->arch.hvm_vmx.exec_control |= cr3_ctls;
-            __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
v->arch.hvm_vmx.exec_control);
+            vmx_update_cpu_exec_control(v);
 
             /* Changing CR0.PE can change some bits in real CR4. */
             vmx_update_guest_cr(v, 4);
@@ -1127,7 +1144,8 @@
                     vmx_set_segment_register(v, s, ®[s]);
                 v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME;
                 __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
-                __vmwrite(EXCEPTION_BITMAP, 0xffffffff);
+                v->arch.hvm_vmx.exception_bitmap = 0xffffffff;
+                vmx_update_exception_bitmap(v);
             }
             else 
             {
@@ -1139,11 +1157,11 @@
                     ((v->arch.hvm_vcpu.hw_cr[4] & ~X86_CR4_VME)
                      |(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VME));
                 __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
-                __vmwrite(EXCEPTION_BITMAP, 
-                          HVM_TRAP_MASK
+                v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
                           | (paging_mode_hap(v->domain) ?
                              0 : (1U << TRAP_page_fault))
-                          | (1U << TRAP_no_device));
+                          | (1U << TRAP_no_device);
+                vmx_update_exception_bitmap(v);
                 vmx_update_debug_state(v);
             }
         }
@@ -1556,7 +1574,7 @@
 
     /* Allow guest direct access to DR registers */
     v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MOV_DR_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
 }
 
 static void vmx_invlpg_intercept(unsigned long vaddr)
@@ -1949,18 +1967,18 @@
 void vmx_vlapic_msr_changed(struct vcpu *v)
 {
     struct vlapic *vlapic = vcpu_vlapic(v);
-    uint32_t ctl;
 
     if ( !cpu_has_vmx_virtualize_apic_accesses )
         return;
 
     vmx_vmcs_enter(v);
-    ctl  = __vmread(SECONDARY_VM_EXEC_CONTROL);
-    ctl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+    v->arch.hvm_vmx.secondary_exec_control
+        &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
     if ( !vlapic_hw_disabled(vlapic) &&
          (vlapic_base_address(vlapic) == APIC_DEFAULT_PHYS_BASE) )
-        ctl |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-    __vmwrite(SECONDARY_VM_EXEC_CONTROL, ctl);
+        v->arch.hvm_vmx.secondary_exec_control
+            |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+    vmx_update_secondary_exec_control(v);
     vmx_vmcs_exit(v);
 }
 
@@ -2497,14 +2515,12 @@
     case EXIT_REASON_PENDING_VIRT_INTR:
         /* Disable the interrupt window. */
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
-                  v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         break;
     case EXIT_REASON_PENDING_VIRT_NMI:
         /* Disable the NMI window. */
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
-                  v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         break;
     case EXIT_REASON_TASK_SWITCH: {
         const enum hvm_task_switch_reason reasons[] = {
@@ -2644,7 +2660,7 @@
 
     case EXIT_REASON_MONITOR_TRAP_FLAG:
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         if ( v->domain->debugger_attached &&
v->arch.hvm_vcpu.single_step )
             domain_pause_for_debugger();
         break;
@@ -2694,16 +2710,14 @@
             /* VPID was disabled: now enabled. */
             curr->arch.hvm_vmx.secondary_exec_control |                
SECONDARY_EXEC_ENABLE_VPID;
-            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-                      curr->arch.hvm_vmx.secondary_exec_control);
+            vmx_update_secondary_exec_control(curr);
         }
         else if ( old_asid && !new_asid )
         {
             /* VPID was enabled: now disabled. */
             curr->arch.hvm_vmx.secondary_exec_control &                
~SECONDARY_EXEC_ENABLE_VPID;
-            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-                      curr->arch.hvm_vmx.secondary_exec_control);
+            vmx_update_secondary_exec_control(curr);
         }
     }
 
diff -r fe49b7452637 -r a0bbec37b529 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 21:49:38 2010 +0800
@@ -90,6 +90,7 @@
     /* Cache of cpu execution control. */
     u32                  exec_control;
     u32                  secondary_exec_control;
+    u32                  exception_bitmap;
 
     /* PMU */
     struct vpmu_struct   vpmu;
diff -r fe49b7452637 -r a0bbec37b529 xen/include/asm-x86/hvm/vmx/vmx.h
--- a/xen/include/asm-x86/hvm/vmx/vmx.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h	Thu Apr 22 21:49:38 2010 +0800
@@ -60,6 +60,9 @@
 void vmx_vlapic_msr_changed(struct vcpu *v);
 void vmx_realmode(struct cpu_user_regs *regs);
 void vmx_update_debug_state(struct vcpu *v);
+void vmx_update_cpu_exec_control(struct vcpu *v);
+void vmx_update_secondary_exec_control(struct vcpu *v);
+void vmx_update_exception_bitmap(struct vcpu *v);
 
 /*
  * Exit Reasons
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Introduce a domain create flag to allow user to set availability
of nested virtualization.
The flag will be used to disable all reporting and function
facilities, improving guest security.
Another per vcpu flag is used to indicate whether the vcpu
is in L1 or L2 context.
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/domain.c            |    4 ++++
 common/domctl.c              |    5 ++++-
 include/asm-x86/hvm/domain.h |    1 +
 include/asm-x86/hvm/vcpu.h   |    2 ++
 include/public/domctl.h      |    3 +++
 include/xen/sched.h          |    3 +++
 6 files changed, 17 insertions(+), 1 deletion(-)
diff -r a0bbec37b529 -r 6f0f41f80285 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/arch/x86/domain.c	Thu Apr 22 22:30:00 2010 +0800
@@ -413,6 +413,10 @@
 
     d->arch.s3_integrity = !!(domcr_flags & DOMCRF_s3_integrity);
 
+    d->arch.hvm_domain.nesting_avail +        is_hvm_domain(d) &&
+        (domcr_flags & DOMCRF_nesting);
+
     INIT_LIST_HEAD(&d->arch.pdev_list);
 
     d->arch.relmem = RELMEM_not_started;
diff -r a0bbec37b529 -r 6f0f41f80285 xen/common/domctl.c
--- a/xen/common/domctl.c	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/common/domctl.c	Thu Apr 22 22:30:00 2010 +0800
@@ -393,7 +393,8 @@
         if ( supervisor_mode_kernel ||
              (op->u.createdomain.flags &
              ~(XEN_DOMCTL_CDF_hvm_guest | XEN_DOMCTL_CDF_hap |
-               XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off)) )
+               XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off |
+               XEN_DOMCTL_CDF_nesting)) )
             break;
 
         dom = op->domain;
@@ -429,6 +430,8 @@
             domcr_flags |= DOMCRF_s3_integrity;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_oos_off )
             domcr_flags |= DOMCRF_oos_off;
+        if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_nesting )
+            domcr_flags |= DOMCRF_nesting;
 
         ret = -ENOMEM;
         d = domain_create(dom, domcr_flags, op->u.createdomain.ssidref);
diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/asm-x86/hvm/domain.h
--- a/xen/include/asm-x86/hvm/domain.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/domain.h	Thu Apr 22 22:30:00 2010 +0800
@@ -93,6 +93,7 @@
     bool_t                 mem_sharing_enabled;
     bool_t                 qemu_mapcache_invalidate;
     bool_t                 is_s3_suspended;
+    bool_t                 nesting_avail;
 
     union {
         struct vmx_domain vmx;
diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/asm-x86/hvm/vcpu.h
--- a/xen/include/asm-x86/hvm/vcpu.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/vcpu.h	Thu Apr 22 22:30:00 2010 +0800
@@ -70,6 +70,8 @@
     bool_t              debug_state_latch;
     bool_t              single_step;
 
+    bool_t              in_nesting;
+
     u64                 asid_generation;
     u32                 asid;
 
diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/public/domctl.h	Thu Apr 22 22:30:00 2010 +0800
@@ -64,6 +64,9 @@
  /* Disable out-of-sync shadow page tables? */
 #define _XEN_DOMCTL_CDF_oos_off       3
 #define XEN_DOMCTL_CDF_oos_off        (1U<<_XEN_DOMCTL_CDF_oos_off)
+ /* Is nested virtualization allowed */
+#define _XEN_DOMCTL_CDF_nesting       4
+#define XEN_DOMCTL_CDF_nesting        (1U<<_XEN_DOMCTL_CDF_nesting)
 };
 typedef struct xen_domctl_createdomain xen_domctl_createdomain_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_createdomain_t);
diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h	Thu Apr 22 21:49:38 2010 +0800
+++ b/xen/include/xen/sched.h	Thu Apr 22 22:30:00 2010 +0800
@@ -393,6 +393,9 @@
  /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */
 #define _DOMCRF_oos_off         4
 #define DOMCRF_oos_off          (1U<<_DOMCRF_oos_off)
+ /* DOMCRF_nesting: Create a domain that allows nested virtualization . */
+#define _DOMCRF_nesting       5
+#define DOMCRF_nesting        (1U<<_DOMCRF_nesting)
 
 /*
  * rcu_lock_domain_by_id() is more efficient than get_domain_by_id().
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 05/17] vmx: nest: nested control structure
v->arch.hvm_vmx.nest as control structure
Signed-off-by: Qing He <qing.he@intel.com>
---
 b/xen/include/asm-x86/hvm/vmx/nest.h |   45 +++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/vmx/vmcs.h   |    4 +++
 2 files changed, 49 insertions(+)
diff -r 6f0f41f80285 -r fe50c7458a43 xen/include/asm-x86/hvm/vmx/nest.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -0,0 +1,45 @@
+/*
+ * nest.h: nested virtualization for VMX.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+#ifndef __ASM_X86_HVM_NEST_H__
+#define __ASM_X86_HVM_NEST_H__
+
+struct vmcs_struct;
+
+struct vmx_nest_struct {
+    paddr_t              guest_vmxon_pa;
+
+    /* Saved host vmcs for vcpu itself */
+    struct vmcs_struct  *hvmcs;
+
+    /*
+     * Guest''s `current vmcs'' of vcpu
+     *  - gvmcs_pa: guest VMCS region physical address
+     *  - vvmcs:    (guest) virtual vmcs
+     *  - svmcs:    effective vmcs for the guest of this vcpu
+     *  - invalid:  launch state: invalid on clear, valid on ld
+     */
+    paddr_t              gvmcs_pa;
+    void                *vvmcs;
+    struct vmcs_struct  *svmcs;
+    int                  vmcs_invalid;
+};
+
+#endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 6f0f41f80285 -r fe50c7458a43 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:00 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:09 2010 +0800
@@ -22,6 +22,7 @@
 #include <asm/config.h>
 #include <asm/hvm/io.h>
 #include <asm/hvm/vmx/vpmu.h>
+#include <asm/hvm/vmx/nest.h>
 
 extern void start_vmx(void);
 extern void vmcs_dump_vcpu(struct vcpu *v);
@@ -95,6 +96,9 @@
     /* PMU */
     struct vpmu_struct   vpmu;
 
+    /* nested virtualization */
+    struct vmx_nest_struct nest;
+
 #ifdef __x86_64__
     struct vmx_msr_state msr_state;
     unsigned long        shadow_gs;
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Since physical vmcs is unknown, a customized virtual vmcs (vvmcs) is
introduced. It converts the vmcs encoding to an offset into vvmcs page.
Signed-off-by: Qing He <qing.he@intel.com>
---
 vvmcs.h |  154 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 154 insertions(+)
diff -r fe50c7458a43 -r 9cb31076d2d0 xen/include/asm-x86/hvm/vmx/vvmcs.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/hvm/vmx/vvmcs.h	Thu Apr 22 22:30:09 2010 +0800
@@ -0,0 +1,154 @@
+/*
+ * vvmcs.h: virtual VMCS access for nested virtualization.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include <xen/config.h>
+#include <asm/types.h>
+
+/*
+ * Virtual VMCS layout
+ *
+ * Since physical VMCS layout is unknown, a custom layout is used
+ * for virtual VMCS seen by guest. It occupies a 4k page, and the
+ * field is offset by an 9-bit offset into u64[], The offset is as
+ * follow, which means every <width, type> pair has a max of 32
+ * fields available.
+ *
+ *             9       7      5               0
+ *             --------------------------------
+ *     offset: | width | type |     index     |
+ *             --------------------------------
+ *
+ * Also, since the lower range <width=0, type={0,1}> has only one
+ * field: VPID, it is moved to a higher offset (63), and leaves the
+ * lower range to non-indexed field like VMCS revision.
+ *
+ */
+
+#define VVMCS_REVISION 0x40000001u
+
+struct vvmcs_header {
+    u32 revision;
+    u32 abort;
+};
+
+union vmcs_encoding {
+    struct {
+        u32 access_type : 1;
+        u32 index : 9;
+        u32 type : 2;
+        u32 rsv1 : 1;
+        u32 width : 2;
+        u32 rsv2 : 17;
+    };
+    u32 word;
+};
+
+enum vvmcs_encoding_width {
+    VVMCS_WIDTH_16 = 0,
+    VVMCS_WIDTH_64,
+    VVMCS_WIDTH_32,
+    VVMCS_WIDTH_NATURAL,
+};
+
+enum vvmcs_encoding_type {
+    VVMCS_TYPE_CONTROL = 0,
+    VVMCS_TYPE_RO,
+    VVMCS_TYPE_GSTATE,
+    VVMCS_TYPE_HSTATE,
+};
+
+static inline int vvmcs_offset(u32 width, u32 type, u32 index)
+{
+    int offset;
+
+    offset = (index & 0x1f) | type << 5 | width << 7;
+
+    if ( offset == 0 )    /* vpid */
+        offset = 0x3f;
+
+    return offset;
+}
+
+static inline u64 __get_vvmcs(void *vvmcs, u32 vmcs_encoding)
+{
+    union vmcs_encoding enc;
+    u64 *content = (u64 *) vvmcs;
+    int offset;
+    u64 res;
+
+    enc.word = vmcs_encoding;
+    offset = vvmcs_offset(enc.width, enc.type, enc.index);
+    res = content[offset];
+
+    switch ( enc.width ) {
+    case VVMCS_WIDTH_16:
+        res &= 0xffff;
+        break;
+    case VVMCS_WIDTH_64:
+        if ( enc.access_type )
+            res >>= 32;
+        break;
+    case VVMCS_WIDTH_32:
+        res &= 0xffffffff;
+        break;
+    case VVMCS_WIDTH_NATURAL:
+    default:
+        break;
+    }
+
+    return res;
+}
+
+static inline void __set_vvmcs(void *vvmcs, u32 vmcs_encoding, u64 val)
+{
+    union vmcs_encoding enc;
+    u64 *content = (u64 *) vvmcs;
+    int offset;
+    u64 res;
+
+    enc.word = vmcs_encoding;
+    offset = vvmcs_offset(enc.width, enc.type, enc.index);
+    res = content[offset];
+
+    switch ( enc.width ) {
+    case VVMCS_WIDTH_16:
+        res = val & 0xffff;
+        break;
+    case VVMCS_WIDTH_64:
+        if ( enc.access_type )
+        {
+            res &= 0xffffffff;
+            res |= val << 32;
+        }
+        else
+            res = val;
+        break;
+    case VVMCS_WIDTH_32:
+        res = val & 0xffffffff;
+        break;
+    case VVMCS_WIDTH_NATURAL:
+    default:
+        res = val;
+        break;
+    }
+
+    content[offset] = res;
+}
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 07/17] vmx: nest: handling VMX instruction exits
add a VMX instruction decoder and handle simple VMX instructions
except vmlaunch/vmresume and invept
Signed-off-by: Qing He <qing.he@intel.com>
---
 b/xen/arch/x86/hvm/vmx/nest.c      |  502 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/vmx/Makefile      |    1 
 xen/arch/x86/hvm/vmx/vmx.c         |   43 ++-
 xen/include/asm-x86/hvm/vmx/nest.h |   10 
 4 files changed, 549 insertions(+), 7 deletions(-)
diff -r 9cb31076d2d0 -r 38a4757e94ef xen/arch/x86/hvm/vmx/Makefile
--- a/xen/arch/x86/hvm/vmx/Makefile	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/Makefile	Thu Apr 22 22:30:09 2010 +0800
@@ -5,3 +5,4 @@
 obj-y += vmx.o
 obj-y += vpmu.o
 obj-y += vpmu_core2.o
+obj-y += nest.o
diff -r 9cb31076d2d0 -r 38a4757e94ef xen/arch/x86/hvm/vmx/nest.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -0,0 +1,502 @@
+/*
+ * nest.c: nested virtualization for VMX.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include <xen/config.h>
+#include <asm/types.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <asm/hvm/vmx/vvmcs.h>
+#include <asm/hvm/vmx/nest.h>
+
+/*
+ * VMX instructions support functions
+ */
+
+enum vmx_regs_enc {
+    VMX_REG_RAX,
+    VMX_REG_RCX,
+    VMX_REG_RDX,
+    VMX_REG_RBX,
+    VMX_REG_RSP,
+    VMX_REG_RBP,
+    VMX_REG_RSI,
+    VMX_REG_RDI,
+#ifdef CONFIG_X86_64
+    VMX_REG_R8,
+    VMX_REG_R9,
+    VMX_REG_R10,
+    VMX_REG_R11,
+    VMX_REG_R12,
+    VMX_REG_R13,
+    VMX_REG_R14,
+    VMX_REG_R15,
+#endif
+};
+
+enum vmx_sregs_enc {
+    VMX_SREG_ES,
+    VMX_SREG_CS,
+    VMX_SREG_SS,
+    VMX_SREG_DS,
+    VMX_SREG_FS,
+    VMX_SREG_GS,
+};
+
+enum x86_segment sreg_to_index[] = {
+    [VMX_SREG_ES] = x86_seg_es,
+    [VMX_SREG_CS] = x86_seg_cs,
+    [VMX_SREG_SS] = x86_seg_ss,
+    [VMX_SREG_DS] = x86_seg_ds,
+    [VMX_SREG_FS] = x86_seg_fs,
+    [VMX_SREG_GS] = x86_seg_gs,
+};
+
+union vmx_inst_info {
+    struct {
+        unsigned int scaling           :2; /* bit 0-1 */
+        unsigned int __rsvd0           :1; /* bit 2 */
+        unsigned int reg1              :4; /* bit 3-6 */
+        unsigned int addr_size         :3; /* bit 7-9 */
+        unsigned int memreg            :1; /* bit 10 */
+        unsigned int __rsvd1           :4; /* bit 11-14 */
+        unsigned int segment           :3; /* bit 15-17 */
+        unsigned int index_reg         :4; /* bit 18-21 */
+        unsigned int index_reg_invalid :1; /* bit 22 */
+        unsigned int base_reg          :4; /* bit 23-26 */
+        unsigned int base_reg_invalid  :1; /* bit 27 */
+        unsigned int reg2              :4; /* bit 28-31 */
+    } fields;
+    u32 word;
+};
+
+struct vmx_inst_decoded {
+#define VMX_INST_MEMREG_TYPE_MEMORY 0
+#define VMX_INST_MEMREG_TYPE_REG    1
+    int type;
+    union {
+        struct {
+            unsigned long mem;
+            unsigned int  len;
+        };
+        enum vmx_regs_enc reg1;
+    };
+
+    enum vmx_regs_enc reg2;
+};
+
+enum vmx_ops_result {
+    VMSUCCEED,
+    VMFAIL_VALID,
+    VMFAIL_INVALID,
+};
+
+#define CASE_SET_REG(REG, reg)      \
+    case VMX_REG_ ## REG: regs->reg = value; break
+#define CASE_GET_REG(REG, reg)      \
+    case VMX_REG_ ## REG: value = regs->reg; break
+
+#define CASE_EXTEND_SET_REG         \
+    CASE_EXTEND_REG(S)
+#define CASE_EXTEND_GET_REG         \
+    CASE_EXTEND_REG(G)
+
+#ifdef __i386__
+#define CASE_EXTEND_REG(T)
+#else
+#define CASE_EXTEND_REG(T)          \
+    CASE_ ## T ## ET_REG(R8, r8);   \
+    CASE_ ## T ## ET_REG(R9, r9);   \
+    CASE_ ## T ## ET_REG(R10, r10); \
+    CASE_ ## T ## ET_REG(R11, r11); \
+    CASE_ ## T ## ET_REG(R12, r12); \
+    CASE_ ## T ## ET_REG(R13, r13); \
+    CASE_ ## T ## ET_REG(R14, r14); \
+    CASE_ ## T ## ET_REG(R15, r15)
+#endif
+
+static unsigned long reg_read(struct cpu_user_regs *regs,
+                              enum vmx_regs_enc index)
+{
+    unsigned long value = 0;
+
+    switch ( index ) {
+    CASE_GET_REG(RAX, eax);
+    CASE_GET_REG(RCX, ecx);
+    CASE_GET_REG(RDX, edx);
+    CASE_GET_REG(RBX, ebx);
+    CASE_GET_REG(RBP, ebp);
+    CASE_GET_REG(RSI, esi);
+    CASE_GET_REG(RDI, edi);
+    CASE_GET_REG(RSP, esp);
+    CASE_EXTEND_GET_REG;
+    default:
+        break;
+    }
+
+    return value;
+}
+
+static void reg_write(struct cpu_user_regs *regs,
+                      enum vmx_regs_enc index,
+                      unsigned long value)
+{
+    switch ( index ) {
+    CASE_SET_REG(RAX, eax);
+    CASE_SET_REG(RCX, ecx);
+    CASE_SET_REG(RDX, edx);
+    CASE_SET_REG(RBX, ebx);
+    CASE_SET_REG(RBP, ebp);
+    CASE_SET_REG(RSI, esi);
+    CASE_SET_REG(RDI, edi);
+    CASE_SET_REG(RSP, esp);
+    CASE_EXTEND_SET_REG;
+    default:
+        break;
+    }
+}
+
+static void decode_vmx_inst(struct cpu_user_regs *regs,
+                            struct vmx_inst_decoded *decode)
+{
+    struct vcpu *v = current;
+    union vmx_inst_info info;
+    struct segment_register seg;
+    unsigned long base, index, seg_base, disp;
+    int scale;
+
+    info.word = __vmread(VMX_INSTRUCTION_INFO);
+
+    if ( info.fields.memreg ) {
+        decode->type = VMX_INST_MEMREG_TYPE_REG;
+        decode->reg1 = info.fields.reg1;
+    }
+    else
+    {
+        decode->type = VMX_INST_MEMREG_TYPE_MEMORY;
+        hvm_get_segment_register(v, sreg_to_index[info.fields.segment],
&seg);
+        seg_base = seg.base;
+
+        base = info.fields.base_reg_invalid ? 0 :
+            reg_read(regs, info.fields.base_reg);
+
+        index = info.fields.index_reg_invalid ? 0 :
+            reg_read(regs, info.fields.index_reg);
+
+        scale = 1 << info.fields.scaling;
+
+        disp = __vmread(EXIT_QUALIFICATION);
+
+
+        decode->mem = seg_base + base + index * scale + disp;
+        decode->len = 1 << (info.fields.addr_size + 1);
+    }
+
+    decode->reg2 = info.fields.reg2;
+}
+
+static void vmreturn(struct cpu_user_regs *regs, enum vmx_ops_result res)
+{
+    unsigned long eflags = regs->eflags;
+    unsigned long mask = X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+                         X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF;
+
+    eflags &= ~mask;
+
+    switch ( res ) {
+    case VMSUCCEED:
+        break;
+    case VMFAIL_VALID:
+        /* TODO: error number of VMFailValid */
+        eflags |= X86_EFLAGS_ZF;
+        break;
+    case VMFAIL_INVALID:
+    default:
+        eflags |= X86_EFLAGS_CF;
+        break;
+    }
+
+    regs->eflags = eflags;
+}
+
+static void __clear_current_vvmcs(struct vmx_nest_struct *nest)
+{
+    if ( nest->svmcs )
+        __vmpclear(virt_to_maddr(nest->svmcs));
+
+    hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs, PAGE_SIZE);
+
+    nest->vmcs_invalid = 1;
+}
+
+/*
+ * VMX instructions handling
+ */
+
+int vmx_nest_handle_vmxon(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    struct vmx_inst_decoded decode;
+    unsigned long gpa = 0;
+
+    if ( !v->domain->arch.hvm_domain.nesting_avail )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+
+    nest->guest_vmxon_pa = gpa;
+    nest->gvmcs_pa = 0;
+    nest->vmcs_invalid = 1;
+    nest->vvmcs = alloc_xenheap_page();
+    if ( !nest->vvmcs )
+    {
+        gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs
failed\n");
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+    nest->svmcs = alloc_xenheap_page();
+    if ( !nest->svmcs )
+    {
+        gdprintk(XENLOG_ERR, "nest: allocation for shadow vmcs
failed\n");
+        free_xenheap_page(nest->vvmcs);
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+
+    /*
+     * `fork'' the host vmcs to shadow_vmcs
+     * vmcs_lock is not needed since we are on current
+     */
+    nest->hvmcs = v->arch.hvm_vmx.vmcs;
+    __vmpclear(virt_to_maddr(nest->hvmcs));
+    memcpy(nest->svmcs, nest->hvmcs, PAGE_SIZE);
+    __vmptrld(virt_to_maddr(nest->hvmcs));
+    v->arch.hvm_vmx.launched = 0;
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    nest->guest_vmxon_pa = 0;
+    __vmpclear(virt_to_maddr(nest->svmcs));
+
+    free_xenheap_page(nest->vvmcs);
+    free_xenheap_page(nest->svmcs);
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+
+    if ( gpa == nest->guest_vmxon_pa || gpa & 0xfff )
+    {
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+
+    if ( nest->gvmcs_pa != gpa )
+    {
+        if ( !nest->vmcs_invalid )
+            __clear_current_vvmcs(nest);
+        nest->gvmcs_pa = gpa;
+        ASSERT(nest->vmcs_invalid == 1);
+    }
+
+
+    if ( nest->vmcs_invalid )
+    {
+        hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa, PAGE_SIZE);
+        nest->vmcs_invalid = 0;
+    }
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+
+    gpa = nest->gvmcs_pa;
+
+    hvm_copy_to_guest_virt(decode.mem, &gpa, decode.len, 0);
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmclear(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+
+    if ( gpa & 0xfff )
+    {
+        vmreturn(regs, VMFAIL_VALID);
+        goto out;
+    }
+
+    if ( gpa != nest->gvmcs_pa )
+    {
+        gdprintk(XENLOG_ERR, "vmclear gpa not the same with current
vmcs\n");
+        vmreturn(regs, VMSUCCEED);
+        goto out;
+    }
+
+    __clear_current_vvmcs(nest);
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+
+
+int vmx_nest_handle_vmread(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u64 value = 0;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    value = __get_vvmcs(nest->vvmcs, reg_read(regs, decode.reg2));
+
+    switch ( decode.type ) {
+    case VMX_INST_MEMREG_TYPE_MEMORY:
+        hvm_copy_to_guest_virt(decode.mem, &value, decode.len, 0);
+        break;
+    case VMX_INST_MEMREG_TYPE_REG:
+        reg_write(regs, decode.reg1, value);
+        break;
+    }
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u64 value = 0;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    switch ( decode.type ) {
+    case VMX_INST_MEMREG_TYPE_MEMORY:
+        hvm_copy_from_guest_virt(&value, decode.mem, decode.len, 0);
+        break;
+    case VMX_INST_MEMREG_TYPE_REG:
+        value = reg_read(regs, decode.reg1);
+        break;
+    }
+
+    __set_vvmcs(nest->vvmcs, reg_read(regs, decode.reg2), value);
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
diff -r 9cb31076d2d0 -r 38a4757e94ef xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -2605,17 +2605,46 @@
             __update_guest_eip(inst_len);
         break;
 
+    case EXIT_REASON_VMCLEAR:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMPTRLD:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMPTRST:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmptrst(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMREAD:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmread(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMWRITE:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMXOFF:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmxoff(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMXON:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmxon(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+
     case EXIT_REASON_MWAIT_INSTRUCTION:
     case EXIT_REASON_MONITOR_INSTRUCTION:
-    case EXIT_REASON_VMCLEAR:
     case EXIT_REASON_VMLAUNCH:
-    case EXIT_REASON_VMPTRLD:
-    case EXIT_REASON_VMPTRST:
-    case EXIT_REASON_VMREAD:
     case EXIT_REASON_VMRESUME:
-    case EXIT_REASON_VMWRITE:
-    case EXIT_REASON_VMXOFF:
-    case EXIT_REASON_VMXON:
         vmx_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
         break;
 
diff -r 9cb31076d2d0 -r 38a4757e94ef xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -42,4 +42,14 @@
     int                  vmcs_invalid;
 };
 
+int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs);
+
+int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmclear(struct cpu_user_regs *regs);
+
+int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
This patch adds mode switch between L1 and L2, many controls
and states handling may need additioinal scrutiny.
Roughly, at virtual VMEntry time, sVMCS is loaded, L2 control
is combined from controls of L0 and vVMCS, L2 state from vVMCS
guest state.
when virtual VMExit, host VMCS is loaded, L1 control is from L0,
L1 state from vVMCS host state.
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/entry.S       |    1 
 arch/x86/hvm/vmx/nest.c        |  410 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/vmcs.c        |   35 +++
 arch/x86/hvm/vmx/vmx.c         |   32 ++-
 include/asm-x86/hvm/vmx/nest.h |   14 +
 include/asm-x86/hvm/vmx/vmcs.h |    4 
 6 files changed, 490 insertions(+), 6 deletions(-)
diff -r 38a4757e94ef -r c7e763bbea63 xen/arch/x86/hvm/vmx/entry.S
--- a/xen/arch/x86/hvm/vmx/entry.S	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/entry.S	Thu Apr 22 22:30:09 2010 +0800
@@ -123,6 +123,7 @@
 .globl vmx_asm_do_vmentry
 vmx_asm_do_vmentry:
         call vmx_intr_assist
+        call vmx_nest_switch_mode
 
         get_current(bx)
         cli
diff -r 38a4757e94ef -r c7e763bbea63 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -21,6 +21,8 @@
 
 #include <xen/config.h>
 #include <asm/types.h>
+#include <asm/paging.h>
+#include <asm/hvm/support.h>
 #include <asm/hvm/vmx/vmx.h>
 #include <asm/hvm/vmx/vvmcs.h>
 #include <asm/hvm/vmx/nest.h>
@@ -500,3 +502,411 @@
     hvm_inject_exception(TRAP_invalid_op, 0, 0);
     return X86EMUL_EXCEPTION;
 }
+
+int vmx_nest_handle_vmresume(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    if ( nest->vmcs_invalid == 0 )
+        nest->vmresume_pending = 1;
+    else
+        vmreturn(regs, VMFAIL_INVALID);
+
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs)
+{
+    /* reuse vmresume for now */
+    return vmx_nest_handle_vmresume(regs);
+}
+
+/*
+ * Nested VMX context switch
+ */
+
+static unsigned long vmcs_gstate_field[] = {
+    /* 16 BITS */
+    GUEST_ES_SELECTOR,
+    GUEST_CS_SELECTOR,
+    GUEST_SS_SELECTOR,
+    GUEST_DS_SELECTOR,
+    GUEST_FS_SELECTOR,
+    GUEST_GS_SELECTOR,
+    GUEST_LDTR_SELECTOR,
+    GUEST_TR_SELECTOR,
+    /* 64 BITS */
+    VMCS_LINK_POINTER,
+    GUEST_IA32_DEBUGCTL,
+#ifndef CONFIG_X86_64
+    VMCS_LINK_POINTER_HIGH,
+    GUEST_IA32_DEBUGCTL_HIGH,
+#endif
+    /* 32 BITS */
+    GUEST_ES_LIMIT,
+    GUEST_CS_LIMIT,
+    GUEST_SS_LIMIT,
+    GUEST_DS_LIMIT,
+    GUEST_FS_LIMIT,
+    GUEST_GS_LIMIT,
+    GUEST_LDTR_LIMIT,
+    GUEST_TR_LIMIT,
+    GUEST_GDTR_LIMIT,
+    GUEST_IDTR_LIMIT,
+    GUEST_ES_AR_BYTES,
+    GUEST_CS_AR_BYTES,
+    GUEST_SS_AR_BYTES,
+    GUEST_DS_AR_BYTES,
+    GUEST_FS_AR_BYTES,
+    GUEST_GS_AR_BYTES,
+    GUEST_LDTR_AR_BYTES,
+    GUEST_TR_AR_BYTES,
+    GUEST_INTERRUPTIBILITY_INFO,
+    GUEST_ACTIVITY_STATE,
+    GUEST_SYSENTER_CS,
+    /* natural */
+    GUEST_ES_BASE,
+    GUEST_CS_BASE,
+    GUEST_SS_BASE,
+    GUEST_DS_BASE,
+    GUEST_FS_BASE,
+    GUEST_GS_BASE,
+    GUEST_LDTR_BASE,
+    GUEST_TR_BASE,
+    GUEST_GDTR_BASE,
+    GUEST_IDTR_BASE,
+    GUEST_DR7,
+    GUEST_RSP,
+    GUEST_RIP,
+    GUEST_RFLAGS,
+    GUEST_PENDING_DBG_EXCEPTIONS,
+    GUEST_SYSENTER_ESP,
+    GUEST_SYSENTER_EIP,
+};
+
+static unsigned long vmcs_ro_field[] = {
+    GUEST_PHYSICAL_ADDRESS,
+    VM_INSTRUCTION_ERROR,
+    VM_EXIT_REASON,
+    VM_EXIT_INTR_INFO,
+    VM_EXIT_INTR_ERROR_CODE,
+    IDT_VECTORING_INFO,
+    IDT_VECTORING_ERROR_CODE,
+    VM_EXIT_INSTRUCTION_LEN,
+    VMX_INSTRUCTION_INFO,
+    EXIT_QUALIFICATION,
+    GUEST_LINEAR_ADDRESS
+};
+
+static struct vmcs_host_to_guest {
+    unsigned long host_field;
+    unsigned long guest_field;
+} vmcs_h2g_field[] = {
+    {HOST_ES_SELECTOR, GUEST_ES_SELECTOR},
+    {HOST_CS_SELECTOR, GUEST_CS_SELECTOR},
+    {HOST_SS_SELECTOR, GUEST_SS_SELECTOR},
+    {HOST_DS_SELECTOR, GUEST_DS_SELECTOR},
+    {HOST_FS_SELECTOR, GUEST_FS_SELECTOR},
+    {HOST_GS_SELECTOR, GUEST_GS_SELECTOR},
+    {HOST_TR_SELECTOR, GUEST_TR_SELECTOR},
+    {HOST_SYSENTER_CS, GUEST_SYSENTER_CS},
+    {HOST_FS_BASE, GUEST_FS_BASE},
+    {HOST_GS_BASE, GUEST_GS_BASE},
+    {HOST_TR_BASE, GUEST_TR_BASE},
+    {HOST_GDTR_BASE, GUEST_GDTR_BASE},
+    {HOST_IDTR_BASE, GUEST_IDTR_BASE},
+    {HOST_SYSENTER_ESP, GUEST_SYSENTER_ESP},
+    {HOST_SYSENTER_EIP, GUEST_SYSENTER_EIP},
+};
+
+
+static void set_shadow_control(struct vmx_nest_struct *nest,
+                               unsigned int field,
+                               u32 host_value)
+{
+    u32 value;
+
+    value = (u32) __get_vvmcs(nest->vvmcs, field) | host_value;
+    __vmwrite(field, value);
+}
+
+void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, CPU_BASED_VM_EXEC_CONTROL, value);
+}
+
+void vmx_nest_update_secondary_exec_control(struct vcpu *v,
+                                            unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, SECONDARY_VM_EXEC_CONTROL, value);
+}
+
+void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, EXCEPTION_BITMAP, value);
+}
+
+static void vvmcs_to_shadow(void *vvmcs, unsigned int field)
+{
+    u64 value;
+
+    value = __get_vvmcs(vvmcs, field);
+    __vmwrite(field, value);
+}
+
+static void vvmcs_from_shadow(void *vvmcs, unsigned int field)
+{
+    u64 value;
+    int rc;
+
+    value = __vmread_safe(field, &rc);
+    if ( !rc )
+        __set_vvmcs(vvmcs, field, value);
+}
+
+static void load_l2_control(struct vmx_nest_struct *nest)
+{
+    u32 exit_control;
+    struct vcpu *v = current;
+
+    /* PIN_BASED, CPU_BASED controls: the union of L0 & L1 */
+    set_shadow_control(nest, PIN_BASED_VM_EXEC_CONTROL,
+                       vmx_pin_based_exec_control);
+    vmx_update_cpu_exec_control(v);
+
+    /* VM_EXIT_CONTROLS: owned by L0 except bits below */
+#define EXIT_CONTROL_GUEST_BITS    ((1<<2) | (1<<18) |
(1<<20) | (1<<22))
+    exit_control = __get_vvmcs(nest->vvmcs, VM_EXIT_CONTROLS) &
+                   EXIT_CONTROL_GUEST_BITS;
+    exit_control |= (vmx_vmexit_control & ~EXIT_CONTROL_GUEST_BITS);
+    __vmwrite(VM_EXIT_CONTROLS, exit_control);
+
+    /* VM_ENTRY_CONTROLS: owned by L1 */
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_CONTROLS);
+
+    vmx_update_exception_bitmap(v);
+}
+
+static void load_vvmcs_guest_state(struct vmx_nest_struct *nest)
+{
+    struct vcpu *v = current;
+    int i;
+
+    /* vvmcs.gstate to svmcs.gstate */
+    for ( i = 0; i < ARRAY_SIZE(vmcs_gstate_field); i++ )
+        vvmcs_to_shadow(nest->vvmcs, vmcs_gstate_field[i]);
+
+    hvm_set_cr0(__get_vvmcs(nest->vvmcs, GUEST_CR0));
+    hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4));
+    hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3));
+
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO);
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE);
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN);
+
+    /* XXX: should refer to GUEST_HOST_MASK of both L0 and L1 */
+    vvmcs_to_shadow(nest->vvmcs, CR0_READ_SHADOW);
+    vvmcs_to_shadow(nest->vvmcs, CR4_READ_SHADOW);
+    vvmcs_to_shadow(nest->vvmcs, CR0_GUEST_HOST_MASK);
+    vvmcs_to_shadow(nest->vvmcs, CR4_GUEST_HOST_MASK);
+
+    /* TODO: PDPTRs for nested ept */
+    /* TODO: CR3 target control */
+}
+
+static void virtual_vmentry(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+#ifdef __x86_64__
+    unsigned long lm_l1, lm_l2;
+#endif
+
+    vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->svmcs);
+
+    v->arch.hvm_vcpu.in_nesting = 1;
+    nest->vmresume_pending = 0;
+    nest->vmresume_in_progress = 1;
+
+#ifdef __x86_64__
+    /*
+     * EFER handling:
+     * hvm_set_efer won''t work if CR0.PG = 1, so we change the value
+     * directly to make hvm_long_mode_enabled(v) work in L2.
+     * An additional update_paging_modes is also needed is
+     * there is 32/64 switch. v->arch.hvm_vcpu.guest_efer doesn''t
+     * need to be saved, since its value on vmexit is determined by
+     * L1 exit_controls
+     */
+    lm_l1 = !!hvm_long_mode_enabled(v);
+    lm_l2 = !!(__get_vvmcs(nest->vvmcs, VM_ENTRY_CONTROLS) &
+                           VM_ENTRY_IA32E_MODE);
+
+    if ( lm_l2 )
+        v->arch.hvm_vcpu.guest_efer |= EFER_LMA | EFER_LME;
+    else
+        v->arch.hvm_vcpu.guest_efer &= ~(EFER_LMA | EFER_LME);
+#endif
+
+    load_l2_control(nest);
+    load_vvmcs_guest_state(nest);
+
+#ifdef __x86_64__
+    if ( lm_l1 != lm_l2 )
+    {
+        paging_update_paging_modes(v);
+    }
+#endif
+
+    regs->rip = __get_vvmcs(nest->vvmcs, GUEST_RIP);
+    regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP);
+    regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS);
+
+    /* TODO: EPT_POINTER */
+}
+
+static void sync_vvmcs_guest_state(struct vmx_nest_struct *nest)
+{
+    int i;
+    unsigned long mask;
+    unsigned long cr;
+
+    /* copy svmcs.gstate back to vvmcs.gstate */
+    for ( i = 0; i < ARRAY_SIZE(vmcs_gstate_field); i++ )
+        vvmcs_from_shadow(nest->vvmcs, vmcs_gstate_field[i]);
+
+    /* SDM 20.6.6: L2 guest execution may change GUEST CR0/CR4 */
+    mask = __get_vvmcs(nest->vvmcs, CR0_GUEST_HOST_MASK);
+    if ( ~mask )
+    {
+        cr = __get_vvmcs(nest->vvmcs, GUEST_CR0);
+        cr = (cr & mask) | (__vmread(GUEST_CR4) & ~mask);
+        __set_vvmcs(nest->vvmcs, GUEST_CR0, cr);
+    }
+
+    mask = __get_vvmcs(nest->vvmcs, CR4_GUEST_HOST_MASK);
+    if ( ~mask )
+    {
+        cr = __get_vvmcs(nest->vvmcs, GUEST_CR4);
+        cr = (cr & mask) | (__vmread(GUEST_CR4) & ~mask);
+        __set_vvmcs(nest->vvmcs, GUEST_CR4, cr);
+    }
+
+    /* CR3 sync if exec doesn''t want cr3 load exiting: i.e. nested EPT
*/
+    if ( !(__get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) &
+           CPU_BASED_CR3_LOAD_EXITING) )
+        vvmcs_from_shadow(nest->vvmcs, GUEST_CR3);
+}
+
+static void sync_vvmcs_ro(struct vmx_nest_struct *nest)
+{
+    int i;
+
+    for ( i = 0; i < ARRAY_SIZE(vmcs_ro_field); i++ )
+        vvmcs_from_shadow(nest->vvmcs, vmcs_ro_field[i]);
+}
+
+static void load_vvmcs_host_state(struct vmx_nest_struct *nest)
+{
+    struct vcpu *v = current;
+    int i;
+    u64 r;
+
+    for ( i = 0; i < ARRAY_SIZE(vmcs_h2g_field); i++ )
+    {
+        r = __get_vvmcs(nest->vvmcs, vmcs_h2g_field[i].host_field);
+        __vmwrite(vmcs_h2g_field[i].guest_field, r);
+    }
+
+    hvm_set_cr0(__get_vvmcs(nest->vvmcs, HOST_CR0));
+    hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4));
+    hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3));
+
+    __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
+}
+
+static void virtual_vmexit(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+#ifdef __x86_64__
+    unsigned long lm_l1, lm_l2;
+#endif
+
+    sync_vvmcs_ro(nest);
+    sync_vvmcs_guest_state(nest);
+
+    vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->hvmcs);
+
+    v->arch.hvm_vcpu.in_nesting = 0;
+    nest->vmexit_pending = 0;
+
+#ifdef __x86_64__
+    lm_l2 = !!hvm_long_mode_enabled(v);
+    lm_l1 = !!(__get_vvmcs(nest->vvmcs, VM_EXIT_CONTROLS) &
+                           VM_EXIT_IA32E_MODE);
+
+    if ( lm_l1 )
+        v->arch.hvm_vcpu.guest_efer |= EFER_LMA | EFER_LME;
+    else
+        v->arch.hvm_vcpu.guest_efer &= ~(EFER_LMA | EFER_LME);
+#endif
+
+    /* TODO: can be removed? */
+    vmx_update_cpu_exec_control(v);
+    vmx_update_exception_bitmap(v);
+
+    load_vvmcs_host_state(nest);
+
+#ifdef __x86_64__
+    if ( lm_l1 != lm_l2 )
+        paging_update_paging_modes(v);
+#endif
+
+    regs->rip = __get_vvmcs(nest->vvmcs, HOST_RIP);
+    regs->rsp = __get_vvmcs(nest->vvmcs, HOST_RSP);
+    regs->rflags = __vmread(GUEST_RFLAGS);
+
+    vmreturn(regs, VMSUCCEED);
+}
+
+asmlinkage void vmx_nest_switch_mode(void)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    struct cpu_user_regs *regs = guest_cpu_user_regs();
+
+    /*
+     * a softirq may interrupt us between a virtual vmentry is
+     * just handled and the true vmentry. If during this window,
+     * a L1 virtual interrupt causes another virtual vmexit, we
+     * cannot let that happen or VM_ENTRY_INTR_INFO will be lost.
+     */
+    if ( unlikely(nest->vmresume_in_progress) )
+        return;
+
+    if ( v->arch.hvm_vcpu.in_nesting && nest->vmexit_pending )
+    {
+        local_irq_enable();
+        virtual_vmexit(regs);
+    }
+    else if ( !v->arch.hvm_vcpu.in_nesting &&
nest->vmresume_pending )
+    {
+        local_irq_enable();
+        virtual_vmentry(regs);
+    }
+}
diff -r 38a4757e94ef -r c7e763bbea63 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Thu Apr 22 22:30:09 2010 +0800
@@ -542,6 +542,35 @@
               (unsigned
long)&get_cpu_info()->guest_cpu_user_regs.error_code);
 }
 
+void vmx_vmcs_switch_current(struct vcpu *v,
+                             struct vmcs_struct *from,
+                             struct vmcs_struct *to)
+{
+    /* no foreign access */
+    if ( unlikely(v != current) )
+        return;
+
+    if ( unlikely(current->arch.hvm_vmx.vmcs != from) )
+        return;
+
+    spin_lock(&v->arch.hvm_vmx.vmcs_lock);
+
+    __vmpclear(virt_to_maddr(from));
+    __vmptrld(virt_to_maddr(to));
+
+    v->arch.hvm_vmx.vmcs = to;
+    v->arch.hvm_vmx.launched = 0;
+    this_cpu(current_vmcs) = to;
+
+    if ( v->arch.hvm_vmx.vmcs_host_updated )
+    {
+        v->arch.hvm_vmx.vmcs_host_updated = 0;
+        vmx_set_host_env(v);
+    }
+
+    spin_unlock(&v->arch.hvm_vmx.vmcs_lock);
+}
+
 void vmx_disable_intercept_for_msr(struct vcpu *v, u32 msr)
 {
     unsigned long *msr_bitmap = v->arch.hvm_vmx.msr_bitmap;
@@ -976,6 +1005,12 @@
         hvm_migrate_pirqs(v);
         vmx_set_host_env(v);
         hvm_asid_flush_vcpu(v);
+
+        /*
+         * nesting: we need to do additional host env sync if we have other
+         * VMCS''s. Currently this only works with only one active
sVMCS.
+         */
+        v->arch.hvm_vmx.vmcs_host_updated = 1;
     }
 
     debug_state = v->domain->debugger_attached;
diff -r 38a4757e94ef -r c7e763bbea63 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -392,18 +392,28 @@
 
 void vmx_update_cpu_exec_control(struct vcpu *v)
 {
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_exec_control(v, v->arch.hvm_vmx.exec_control);
+    else
+        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
 }
 
 void vmx_update_secondary_exec_control(struct vcpu *v)
 {
-    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-              v->arch.hvm_vmx.secondary_exec_control);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_secondary_exec_control(v,
+            v->arch.hvm_vmx.secondary_exec_control);
+    else
+        __vmwrite(SECONDARY_VM_EXEC_CONTROL,
+                  v->arch.hvm_vmx.secondary_exec_control);
 }
 
 void vmx_update_exception_bitmap(struct vcpu *v)
 {
-    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_exception_bitmap(v,
v->arch.hvm_vmx.exception_bitmap);
+    else
+        __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
 }
 
 static int vmx_guest_x86_mode(struct vcpu *v)
@@ -2348,6 +2358,8 @@
     /* Now enable interrupts so it''s safe to take locks. */
     local_irq_enable();
 
+    v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
+
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
         return vmx_failed_vmentry(exit_reason, regs);
 
@@ -2610,6 +2622,11 @@
         if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
             __update_guest_eip(inst_len);
         break;
+    case EXIT_REASON_VMLAUNCH:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmlaunch(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
     case EXIT_REASON_VMPTRLD:
         inst_len = __get_instruction_length();
         if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
@@ -2630,6 +2647,11 @@
         if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
             __update_guest_eip(inst_len);
         break;
+    case EXIT_REASON_VMRESUME:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmresume(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
     case EXIT_REASON_VMXOFF:
         inst_len = __get_instruction_length();
         if ( vmx_nest_handle_vmxoff(regs) == X86EMUL_OKAY )
@@ -2643,8 +2665,6 @@
 
     case EXIT_REASON_MWAIT_INSTRUCTION:
     case EXIT_REASON_MONITOR_INSTRUCTION:
-    case EXIT_REASON_VMLAUNCH:
-    case EXIT_REASON_VMRESUME:
         vmx_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
         break;
 
diff -r 38a4757e94ef -r c7e763bbea63 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -40,8 +40,14 @@
     void                *vvmcs;
     struct vmcs_struct  *svmcs;
     int                  vmcs_invalid;
+
+    int                  vmexit_pending;
+    int                  vmresume_pending;
+    int                  vmresume_in_progress;
 };
 
+asmlinkage void vmx_nest_switch_mode(void);
+
 int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs);
 
@@ -52,4 +58,12 @@
 int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
 
+int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
+
+void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value);
+void vmx_nest_update_secondary_exec_control(struct vcpu *v,
+                                            unsigned long value);
+void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 38a4757e94ef -r c7e763bbea63 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:09 2010 +0800
@@ -98,6 +98,7 @@
 
     /* nested virtualization */
     struct vmx_nest_struct nest;
+    int                  vmcs_host_updated;
 
 #ifdef __x86_64__
     struct vmx_msr_state msr_state;
@@ -377,6 +378,9 @@
 int vmx_write_guest_msr(u32 msr, u64 val);
 int vmx_add_guest_msr(u32 msr);
 int vmx_add_host_load_msr(u32 msr);
+void vmx_vmcs_switch_current(struct vcpu *v,
+                             struct vmcs_struct *from,
+                             struct vmcs_struct *to);
 
 #endif /* ASM_X86_HVM_VMX_VMCS_H__ */
 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
This patch adds interrupt handling of nested, mainly includes:
  - L1 interrupt causes L2 to exit,
  - idtv handling in L2.
  - interrupt blocking handling in l2
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/intr.c        |   91 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/nest.c        |   66 +++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/vmx.c         |   78 +++++++++++++++++++++++------------
 include/asm-x86/hvm/vmx/nest.h |    5 ++
 4 files changed, 214 insertions(+), 26 deletions(-)
diff -r c7e763bbea63 -r a7de30ed250d xen/arch/x86/hvm/vmx/intr.c
--- a/xen/arch/x86/hvm/vmx/intr.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/intr.c	Thu Apr 22 22:30:09 2010 +0800
@@ -33,6 +33,7 @@
 #include <asm/hvm/support.h>
 #include <asm/hvm/vmx/vmx.h>
 #include <asm/hvm/vmx/vmcs.h>
+#include <asm/hvm/vmx/vvmcs.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vlapic.h>
 #include <public/hvm/ioreq.h>
@@ -110,6 +111,93 @@
     }
 }
 
+/*
+ * Nested virtualization interrupt handling:
+ *
+ *   When vcpu runs in nested context (L2), the event delivery from
+ *   L0 to L1 may be blocked by several reasons:
+ *       - virtual VMExit
+ *       - virtual VMEntry
+ *       - IDT vectoring reinjection
+ *
+ *   However, when in nesting, the blocked interrupt should not be
+ *   blocked by normal reasons like RFLAGS.IF (generating a VMExit
+ *   instead), so simply intr window may delay the interrupt from
+ *   delivery in time.
+ *
+ *   To solve this, the algorithm below is used.
+ *   v->arch.hvm_vmx.exec_control.VIRTUAL_INTR_PENDING now denotes
+ *   only L0 control, physical control may be different from it.
+ *       - if in L1, it behaves normally, intr window is written
+ *         to physical control as it is
+ *       - if in L2, replace it to MTF (or NMI window) if possible
+ *       - if MTF/NMI window is not used, intr window can still be
+ *         used but may have negative impact on interrupt performance.
+ */
+static int nest_intr_blocked(struct vcpu *v, struct hvm_intack intack)
+{
+    int r = 0;
+
+    if ( !v->arch.hvm_vcpu.in_nesting &&
+         v->arch.hvm_vmx.nest.vmresume_pending )
+        r = 1;
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( v->arch.hvm_vmx.nest.vmexit_pending ||
+             v->arch.hvm_vmx.nest.vmresume_in_progress ||
+             (__vmread(VM_ENTRY_INTR_INFO) & INTR_INFO_VALID_MASK) )
+            r = 1;
+    }
+
+    return r;
+}
+
+static int vmx_nest_intr_intercept(struct vcpu *v, struct hvm_intack intack)
+{
+    u32 exit_ctrl;
+
+    /*
+     * TODO:
+     *   - if L1 intr-window exiting == 0
+     *   - vNMI
+     */
+
+    if ( nest_intr_blocked(v, intack) )
+    {
+        enable_intr_window(v, intack);
+        return 1;
+    }
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( intack.source == hvm_intsrc_pic ||
+                 intack.source == hvm_intsrc_lapic )
+        {
+            vmx_inject_extint(intack.vector);
+
+            exit_ctrl = __get_vvmcs(v->arch.hvm_vmx.nest.vvmcs,
+                            VM_EXIT_CONTROLS);
+            if ( exit_ctrl & VM_EXIT_ACK_INTR_ON_EXIT )
+            {
+                /* for now, duplicate the ack path in vmx_intr_assist */
+                hvm_vcpu_ack_pending_irq(v, intack);
+                pt_intr_post(v, intack);
+
+                intack = hvm_vcpu_has_pending_irq(v);
+                if ( unlikely(intack.source != hvm_intsrc_none) )
+                    enable_intr_window(v, intack);
+            }
+            else
+                enable_intr_window(v, intack);
+
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
 asmlinkage void vmx_intr_assist(void)
 {
     struct hvm_intack intack;
@@ -133,6 +221,9 @@
         if ( likely(intack.source == hvm_intsrc_none) )
             goto out;
 
+        if ( unlikely(vmx_nest_intr_intercept(v, intack)) )
+            goto out;
+
         intblk = hvm_interrupt_blocked(v, intack);
         if ( intblk == hvm_intblk_tpr )
         {
diff -r c7e763bbea63 -r a7de30ed250d xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -642,6 +642,7 @@
 {
     struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
 
+    /* TODO: change L0 intr window to MTF or NMI window */
     set_shadow_control(nest, CPU_BASED_VM_EXEC_CONTROL, value);
 }
 
@@ -839,6 +840,33 @@
     __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
 }
 
+static void vmx_nest_intr_exit(struct vmx_nest_struct *nest)
+{
+    if ( !(nest->intr_info & INTR_INFO_VALID_MASK) )
+        return;
+
+    switch ( nest->intr_info & INTR_INFO_INTR_TYPE_MASK )
+    {
+    case X86_EVENTTYPE_EXT_INTR:
+        /* rename exit_reason to EXTERNAL_INTERRUPT */
+        __set_vvmcs(nest->vvmcs, VM_EXIT_REASON,
EXIT_REASON_EXTERNAL_INTERRUPT);
+        __set_vvmcs(nest->vvmcs, EXIT_QUALIFICATION, 0);
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_INFO, nest->intr_info);
+        break;
+
+    case X86_EVENTTYPE_HW_EXCEPTION:
+    case X86_EVENTTYPE_SW_INTERRUPT:
+    case X86_EVENTTYPE_SW_EXCEPTION:
+        /* throw to L1 */
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_INFO, nest->intr_info);
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_ERROR_CODE,
nest->error_code);
+        break;
+    case X86_EVENTTYPE_NMI:
+    default:
+        break;
+    }
+}
+
 static void virtual_vmexit(struct cpu_user_regs *regs)
 {
     struct vcpu *v = current;
@@ -848,6 +876,8 @@
 #endif
 
     sync_vvmcs_ro(nest);
+    vmx_nest_intr_exit(nest);
+
     sync_vvmcs_guest_state(nest);
 
     vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->hvmcs);
@@ -910,3 +940,39 @@
         virtual_vmentry(regs);
     }
 }
+
+void vmx_nest_idtv_handling(void)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned int idtv_info = __vmread(IDT_VECTORING_INFO);
+
+    if ( likely(!(idtv_info & INTR_INFO_VALID_MASK)) )
+        return;
+
+    /*
+     * If L0 can solve the fault that causes idt vectoring, it should
+     * be reinjected, otherwise, pass to L1.
+     */
+    if ( (__vmread(VM_EXIT_REASON) != EXIT_REASON_EPT_VIOLATION &&
+          !(nest->intr_info & INTR_INFO_VALID_MASK)) ||
+         (__vmread(VM_EXIT_REASON) == EXIT_REASON_EPT_VIOLATION &&
+          !nest->vmexit_pending) )
+    {
+        __vmwrite(VM_ENTRY_INTR_INFO, idtv_info &
~INTR_INFO_RESVD_BITS_MASK);
+        if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
+                        __vmread(IDT_VECTORING_ERROR_CODE));
+        /*
+         * SDM 23.2.4, if L1 tries to inject a software interrupt
+         * and the delivery fails, VM_EXIT_INSTRUCTION_LEN receives
+         * the value of previous VM_ENTRY_INSTRUCTION_LEN.
+         *
+         * This means EXIT_INSTRUCTION_LEN is always valid here, for
+         * software interrupts both injected by L1, and generated in L2.
+         */
+        __vmwrite(VM_ENTRY_INSTRUCTION_LEN, __vmread(VM_EXIT_INSTRUCTION_LEN));
+    }
+
+    /* TODO: NMI */
+}
diff -r c7e763bbea63 -r a7de30ed250d xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -1274,6 +1274,7 @@
 {
     unsigned long intr_fields;
     struct vcpu *curr = current;
+    struct vmx_nest_struct *nest = &curr->arch.hvm_vmx.nest;
 
     /*
      * NB. Callers do not need to worry about clearing STI/MOV-SS blocking:
@@ -1285,11 +1286,21 @@
 
     intr_fields = (INTR_INFO_VALID_MASK | (type<<8) | trap);
     if ( error_code != HVM_DELIVER_NO_ERROR_CODE ) {
-        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
         intr_fields |= INTR_INFO_DELIVER_CODE_MASK;
+        if ( curr->arch.hvm_vcpu.in_nesting )
+            nest->error_code = error_code;
+        else
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
     }
 
-    __vmwrite(VM_ENTRY_INTR_INFO, intr_fields);
+    if ( curr->arch.hvm_vcpu.in_nesting )
+    {
+        nest->intr_info = intr_fields;
+        nest->vmexit_pending = 1;
+        return;
+    }
+    else
+        __vmwrite(VM_ENTRY_INTR_INFO, intr_fields);
 
     /* Can''t inject exceptions in virtual 8086 mode because they would
      * use the protected-mode IDT.  Emulate at the next vmenter instead. */
@@ -1299,9 +1310,14 @@
 
 void vmx_inject_hw_exception(int trap, int error_code)
 {
-    unsigned long intr_info = __vmread(VM_ENTRY_INTR_INFO);
+    unsigned long intr_info;
     struct vcpu *curr = current;
 
+    if ( curr->arch.hvm_vcpu.in_nesting )
+        intr_info = curr->arch.hvm_vmx.nest.intr_info;
+    else
+        intr_info = __vmread(VM_ENTRY_INTR_INFO);
+
     switch ( trap )
     {
     case TRAP_debug:
@@ -2314,9 +2330,31 @@
     return -1;
 }
 
+static void vmx_idtv_reinject(unsigned long idtv_info)
+{
+    if ( hvm_event_needs_reinjection((idtv_info>>8)&7,
idtv_info&0xff) )
+    {
+        /* See SDM 3B 25.7.1.1 and .2 for info about masking resvd bits. */
+        __vmwrite(VM_ENTRY_INTR_INFO,
+                  idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
+        if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
+                      __vmread(IDT_VECTORING_ERROR_CODE));
+    }
+
+    /*
+     * Clear NMI-blocking interruptibility info if an NMI delivery faulted.
+     * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
+     */
+    if ( (idtv_info & INTR_INFO_INTR_TYPE_MASK) ==
(X86_EVENTTYPE_NMI<<8) )
+        __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
+                  __vmread(GUEST_INTERRUPTIBILITY_INFO) &
+                  ~VMX_INTR_SHADOW_NMI);
+}
+
 asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs)
 {
-    unsigned int exit_reason, idtv_info, intr_info = 0, vector = 0;
+    unsigned int exit_reason, idtv_info = 0, intr_info = 0, vector = 0;
     unsigned long exit_qualification, inst_len = 0;
     struct vcpu *v = current;
 
@@ -2398,29 +2436,14 @@
 
     hvm_maybe_deassert_evtchn_irq();
 
-    /* Event delivery caused this intercept? Queue for redelivery. */
-    idtv_info = __vmread(IDT_VECTORING_INFO);
-    if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) &&
-         (exit_reason != EXIT_REASON_TASK_SWITCH) )
+    /* TODO: consolidate nested idtv handling with ordinary one */
+    if ( !v->arch.hvm_vcpu.in_nesting )
     {
-        if ( hvm_event_needs_reinjection((idtv_info>>8)&7,
idtv_info&0xff) )
-        {
-            /* See SDM 3B 25.7.1.1 and .2 for info about masking resvd bits. */
-            __vmwrite(VM_ENTRY_INTR_INFO,
-                      idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
-            if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
-                __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
-                          __vmread(IDT_VECTORING_ERROR_CODE));
-        }
-
-        /*
-         * Clear NMI-blocking interruptibility info if an NMI delivery faulted.
-         * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
-         */
-        if ( (idtv_info & INTR_INFO_INTR_TYPE_MASK) ==
(X86_EVENTTYPE_NMI<<8) )
-            __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
-                      __vmread(GUEST_INTERRUPTIBILITY_INFO) &
-                      ~VMX_INTR_SHADOW_NMI);
+        /* Event delivery caused this intercept? Queue for redelivery. */
+        idtv_info = __vmread(IDT_VECTORING_INFO);
+        if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) &&
+             (exit_reason != EXIT_REASON_TASK_SWITCH) )
+            vmx_idtv_reinject(idtv_info);
     }
 
     switch ( exit_reason )
@@ -2736,6 +2759,9 @@
         domain_crash(v->domain);
         break;
     }
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_idtv_handling();
 }
 
 asmlinkage void vmx_vmenter_helper(void)
diff -r c7e763bbea63 -r a7de30ed250d xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -44,6 +44,9 @@
     int                  vmexit_pending;
     int                  vmresume_pending;
     int                  vmresume_in_progress;
+
+    unsigned long        intr_info;
+    unsigned long        error_code;
 };
 
 asmlinkage void vmx_nest_switch_mode(void);
@@ -66,4 +69,6 @@
                                             unsigned long value);
 void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
 
+void vmx_nest_idtv_handling(void);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Handles VMExits happened in L2
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/nest.c        |  182 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/vmx.c         |    6 +
 include/asm-x86/hvm/vmx/nest.h |    3 
 include/asm-x86/hvm/vmx/vmx.h  |    1 
 4 files changed, 192 insertions(+)
diff -r a7de30ed250d -r 2f9ba6dbbe62 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -976,3 +976,185 @@
 
     /* TODO: NMI */
 }
+
+/*
+ * L2 VMExit handling
+ */
+
+static struct control_bit_for_reason {
+    int reason;
+    unsigned long bit;
+} control_bit_for_reason [] = {
+    {EXIT_REASON_PENDING_VIRT_INTR, CPU_BASED_VIRTUAL_INTR_PENDING},
+    {EXIT_REASON_HLT, CPU_BASED_HLT_EXITING},
+    {EXIT_REASON_INVLPG, CPU_BASED_INVLPG_EXITING},
+    {EXIT_REASON_MWAIT_INSTRUCTION, CPU_BASED_MWAIT_EXITING},
+    {EXIT_REASON_RDPMC, CPU_BASED_RDPMC_EXITING},
+    {EXIT_REASON_RDTSC, CPU_BASED_RDTSC_EXITING},
+    {EXIT_REASON_PENDING_VIRT_NMI, CPU_BASED_VIRTUAL_NMI_PENDING},
+    {EXIT_REASON_DR_ACCESS, CPU_BASED_MOV_DR_EXITING},
+    {EXIT_REASON_MONITOR_INSTRUCTION, CPU_BASED_MONITOR_EXITING},
+    {EXIT_REASON_PAUSE_INSTRUCTION, CPU_BASED_PAUSE_EXITING},
+};
+
+int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
+                               unsigned int exit_reason)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u32 ctrl;
+    int bypass_l0 = 0;
+
+    nest->vmexit_pending = 0;
+    nest->intr_info = 0;
+    nest->error_code = 0;
+
+    switch (exit_reason) {
+    case EXIT_REASON_EXCEPTION_NMI:
+    {
+        u32 intr_info = __vmread(VM_EXIT_INTR_INFO);
+        u32 valid_mask = (X86_EVENTTYPE_HW_EXCEPTION << 8) |
+                         INTR_INFO_VALID_MASK;
+        u64 exec_bitmap;
+        int vector = intr_info & INTR_INFO_VECTOR_MASK;
+
+        /*
+         * decided by L0 and L1 exception bitmap, if the vetor is set by
+         * both, L0 has priority on #PF, L1 has priority on others
+         */
+        if ( vector == TRAP_page_fault )
+        {
+            if ( paging_mode_hap(v->domain) )
+                nest->vmexit_pending = 1;
+        }
+        else if ( (intr_info & valid_mask) == valid_mask )
+        {
+            exec_bitmap =__get_vvmcs(nest->vvmcs, EXCEPTION_BITMAP);
+
+            if ( exec_bitmap & (1 << vector) )
+                nest->vmexit_pending = 1;
+        }
+        break;
+    }
+
+    case EXIT_REASON_WBINVD:
+    case EXIT_REASON_EPT_VIOLATION:
+    case EXIT_REASON_EPT_MISCONFIG:
+    case EXIT_REASON_EXTERNAL_INTERRUPT:
+        /* pass to L0 handler */
+        break;
+
+    case VMX_EXIT_REASONS_FAILED_VMENTRY:
+    case EXIT_REASON_TRIPLE_FAULT:
+    case EXIT_REASON_TASK_SWITCH:
+    case EXIT_REASON_IO_INSTRUCTION:
+    case EXIT_REASON_CPUID:
+    case EXIT_REASON_MSR_READ:
+    case EXIT_REASON_MSR_WRITE:
+    case EXIT_REASON_VMCALL:
+    case EXIT_REASON_VMCLEAR:
+    case EXIT_REASON_VMLAUNCH:
+    case EXIT_REASON_VMPTRLD:
+    case EXIT_REASON_VMPTRST:
+    case EXIT_REASON_VMREAD:
+    case EXIT_REASON_VMRESUME:
+    case EXIT_REASON_VMWRITE:
+    case EXIT_REASON_VMXOFF:
+    case EXIT_REASON_VMXON:
+    case EXIT_REASON_INVEPT:
+        /* inject to L1 */
+        nest->vmexit_pending = 1;
+        break;
+
+    case EXIT_REASON_PENDING_VIRT_INTR:
+    {
+        ctrl = v->arch.hvm_vmx.exec_control;
+
+        /*
+         * if both open intr/nmi window, L0 has priority.
+         *
+         * Note that this is not strictly correct, in L2 context,
+         * L0''s intr/nmi window flag should be replaced to MTF,
+         * causing an imediate VMExit, but MTF may not be available
+         * on all hardware.
+         */
+        if ( !(ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+    case EXIT_REASON_PENDING_VIRT_NMI:
+    {
+        ctrl = v->arch.hvm_vmx.exec_control;
+
+        if ( !(ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_HLT:
+    case EXIT_REASON_RDTSC:
+    case EXIT_REASON_RDPMC:
+    case EXIT_REASON_MWAIT_INSTRUCTION:
+    case EXIT_REASON_PAUSE_INSTRUCTION:
+    case EXIT_REASON_MONITOR_INSTRUCTION:
+    case EXIT_REASON_DR_ACCESS:
+    case EXIT_REASON_INVLPG:
+    {
+        int i;
+
+        /* exit according to guest exec_control */
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        for ( i = 0; i < ARRAY_SIZE(control_bit_for_reason); i++ )
+            if ( control_bit_for_reason[i].reason == exit_reason )
+                break;
+
+        if ( i == ARRAY_SIZE(control_bit_for_reason) )
+            break;
+
+        if ( control_bit_for_reason[i].bit & ctrl )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+    case EXIT_REASON_CR_ACCESS:
+    {
+        u64 exit_qualification = __vmread(EXIT_QUALIFICATION);
+        int cr = exit_qualification & 15;
+        int write = (exit_qualification >> 4) & 3;
+        u32 mask = 0;
+
+        /* also according to guest exec_control */
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( cr == 3 )
+        {
+            mask = write? CPU_BASED_CR3_STORE_EXITING:
+                          CPU_BASED_CR3_LOAD_EXITING;
+            if ( ctrl & mask )
+                nest->vmexit_pending = 1;
+        }
+        else if ( cr == 8 )
+        {
+            mask = write? CPU_BASED_CR8_STORE_EXITING:
+                          CPU_BASED_CR8_LOAD_EXITING;
+            if ( ctrl & mask )
+                nest->vmexit_pending = 1;
+        }
+        else  /* CR0, CR4, CLTS, LMSW */
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+    default:
+        gdprintk(XENLOG_WARNING, "Unknown nested vmexit reason
%x.\n",
+                 exit_reason);
+    }
+
+    if ( nest->vmexit_pending )
+        bypass_l0 = 1;
+
+    return bypass_l0;
+}
diff -r a7de30ed250d -r 2f9ba6dbbe62 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -2397,6 +2397,11 @@
     local_irq_enable();
 
     v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( vmx_nest_l2_vmexit_handler(regs, exit_reason) )
+            goto out;
+    }
 
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
         return vmx_failed_vmentry(exit_reason, regs);
@@ -2760,6 +2765,7 @@
         break;
     }
 
+out:
     if ( v->arch.hvm_vcpu.in_nesting )
         vmx_nest_idtv_handling();
 }
diff -r a7de30ed250d -r 2f9ba6dbbe62 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -71,4 +71,7 @@
 
 void vmx_nest_idtv_handling(void);
 
+int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
+                               unsigned int exit_reason);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r a7de30ed250d -r 2f9ba6dbbe62 xen/include/asm-x86/hvm/vmx/vmx.h
--- a/xen/include/asm-x86/hvm/vmx/vmx.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h	Thu Apr 22 22:30:09 2010 +0800
@@ -112,6 +112,7 @@
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT              50
 #define EXIT_REASON_RDTSCP              51
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
L2 TSC needs special handling, either rdtsc exiting is
turned on or off
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/nest.c        |   31 +++++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/vmx.c         |    4 ++++
 include/asm-x86/hvm/vmx/nest.h |    2 ++
 3 files changed, 37 insertions(+)
diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -533,6 +533,18 @@
  * Nested VMX context switch
  */
 
+u64 vmx_nest_get_tsc_offset(struct vcpu *v)
+{
+    u64 offset = 0;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    if ( __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) &
+         CPU_BASED_USE_TSC_OFFSETING )
+        offset = __get_vvmcs(nest->vvmcs, TSC_OFFSET);
+
+    return offset;
+}
+
 static unsigned long vmcs_gstate_field[] = {
     /* 16 BITS */
     GUEST_ES_SELECTOR,
@@ -715,6 +727,8 @@
     hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4));
     hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3));
 
+    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO);
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE);
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN);
@@ -837,6 +851,8 @@
     hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4));
     hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3));
 
+    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+
     __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
 }
 
@@ -1116,6 +1132,21 @@
 
         if ( control_bit_for_reason[i].bit & ctrl )
             nest->vmexit_pending = 1;
+        else if ( exit_reason == EXIT_REASON_RDTSC )
+        {
+            uint64_t tsc;
+
+            /*
+             * rdtsc can''t be handled normally in the L0 handler
+             * if L1 doesn''t want it
+             */
+            tsc = hvm_get_guest_tsc(v);
+            tsc += __get_vvmcs(nest->vvmcs, TSC_OFFSET);
+            regs->eax = (uint32_t)tsc;
+            regs->edx = (uint32_t)(tsc >> 32);
+
+            bypass_l0 = 1;
+        }
 
         break;
     }
diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -974,6 +974,10 @@
 static void vmx_set_tsc_offset(struct vcpu *v, u64 offset)
 {
     vmx_vmcs_enter(v);
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+        offset += vmx_nest_get_tsc_offset(v);
+
     __vmwrite(TSC_OFFSET, offset);
 #if defined (__i386__)
     __vmwrite(TSC_OFFSET_HIGH, offset >> 32);
diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -69,6 +69,8 @@
                                             unsigned long value);
 void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
 
+u64 vmx_nest_get_tsc_offset(struct vcpu *v);
+
 void vmx_nest_idtv_handling(void);
 
 int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
1. #NM exits from L2 should be handled by L0 if it wants.
2. HOST_CR0.TS may need to be updated after L1<->L2 switch
Signed-off-by: Qing He <qing.he@intel.com>
---
 nest.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)
diff -r 2332586ff957 -r 25c338cbc024 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -791,6 +791,9 @@
     regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP);
     regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS);
 
+    /* updating host cr0 to sync TS bit */
+    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+
     /* TODO: EPT_POINTER */
 }
 
@@ -927,6 +930,9 @@
     regs->rsp = __get_vvmcs(nest->vvmcs, HOST_RSP);
     regs->rflags = __vmread(GUEST_RFLAGS);
 
+    /* updating host cr0 to sync TS bit */
+    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+
     vmreturn(regs, VMSUCCEED);
 }
 
@@ -1036,13 +1042,18 @@
 
         /*
          * decided by L0 and L1 exception bitmap, if the vetor is set by
-         * both, L0 has priority on #PF, L1 has priority on others
+         * both, L0 has priority on #PF and #NM, L1 has priority on others
          */
         if ( vector == TRAP_page_fault )
         {
             if ( paging_mode_hap(v->domain) )
                 nest->vmexit_pending = 1;
         }
+        else if ( vector == TRAP_no_device )
+        {
+            if ( v->fpu_dirtied )
+                nest->vmexit_pending = 1;
+        }
         else if ( (intr_info & valid_mask) == valid_mask )
         {
             exec_bitmap =__get_vvmcs(nest->vvmcs, EXCEPTION_BITMAP);
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 13/17] vmx: nest: capability reporting MSRs
handles VMX capability reporting MSRs.
Some features are masked so L1 would see a rather
simple configuration
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/nest.c        |   94 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/hvm/vmx/vmx.c         |   14 ++++--
 include/asm-x86/hvm/vmx/nest.h |    5 ++
 include/asm-x86/hvm/vmx/vmcs.h |    5 ++
 include/asm-x86/msr-index.h    |    1 
 5 files changed, 115 insertions(+), 4 deletions(-)
diff -r 25c338cbc024 -r 0f0e32a70c02 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
@@ -1200,3 +1200,97 @@
 
     return bypass_l0;
 }
+
+/*
+ * Capability reporting
+ */
+int vmx_nest_msr_read_intercept(struct cpu_user_regs *regs, u64 *msr_content)
+{
+    u32 eax, edx;
+    u64 data = 0;
+    int r = 1;
+    u32 mask = 0;
+
+    if ( !current->domain->arch.hvm_domain.nesting_avail )
+        return 0;
+
+    switch (regs->ecx) {
+    case MSR_IA32_VMX_BASIC:
+        rdmsr(regs->ecx, eax, edx);
+        data = edx;
+        data = (data & ~0x1fff) | 0x1000;     /* request 4KB for guest VMCS
*/
+        data &= ~(1 << 23);                   /* disable
TRUE_xxx_CTLS */
+        data = (data << 32) | VVMCS_REVISION; /* VVMCS revision */
+        break;
+    case MSR_IA32_VMX_PINBASED_CTLS:
+#define REMOVED_PIN_CONTROL_CAP (PIN_BASED_PREEMPT_TIMER)
+        rdmsr(regs->ecx, eax, edx);
+        data = edx;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_PROCBASED_CTLS:
+        rdmsr(regs->ecx, eax, edx);
+#define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \
+            | CPU_BASED_ACTIVATE_MSR_BITMAP            \
+            | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
+        data = edx & ~REMOVED_EXEC_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_EXIT_CTLS:
+        rdmsr(regs->ecx, eax, edx);
+#define REMOVED_EXIT_CONTROL_CAP (VM_EXIT_SAVE_GUEST_PAT \
+            | VM_EXIT_LOAD_HOST_PAT                      \
+            | VM_EXIT_SAVE_GUEST_EFER                    \
+            | VM_EXIT_LOAD_HOST_EFER                     \
+            | VM_EXIT_SAVE_PREEMPT_TIMER)
+        data = edx & ~REMOVED_EXIT_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_ENTRY_CTLS:
+        rdmsr(regs->ecx, eax, edx);
+#define REMOVED_ENTRY_CONTROL_CAP (VM_ENTRY_LOAD_GUEST_PAT \
+            | VM_ENTRY_LOAD_GUEST_EFER)
+        data = edx & ~REMOVED_ENTRY_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_PROCBASED_CTLS2:
+        mask = 0;
+
+        rdmsr(regs->ecx, eax, edx);
+        data = edx & mask;
+        data = (data << 32) | eax;
+        break;
+
+    /* pass through MSRs */
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_MISC:
+    case MSR_IA32_VMX_CR0_FIXED0:
+    case MSR_IA32_VMX_CR0_FIXED1:
+    case MSR_IA32_VMX_CR4_FIXED0:
+    case MSR_IA32_VMX_CR4_FIXED1:
+    case MSR_IA32_VMX_VMCS_ENUM:
+        rdmsr(regs->ecx, eax, edx);
+        data = edx;
+        data = (data << 32) | eax;
+        gdprintk(XENLOG_INFO,
+            "nest: pass through VMX cap reporting register, %lx\n",
+            regs->ecx);
+        break;
+    default:
+        r = 0;
+        break;
+    }
+
+    if (r == 1)
+        gdprintk(XENLOG_DEBUG, "nest: intercepted msr access: %lx:
%lx\n",
+            regs->ecx, data);
+
+    *msr_content = data;
+    return r;
+}
+
+int vmx_nest_msr_write_intercept(struct cpu_user_regs *regs, u64 msr_content)
+{
+    /* silently ignore for now */
+    return 1;
+}
diff -r 25c338cbc024 -r 0f0e32a70c02 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -1892,8 +1892,11 @@
         msr_content |= (u64)__vmread(GUEST_IA32_DEBUGCTL_HIGH) << 32;
 #endif
         break;
-    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
-        goto gp_fault;
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+        if ( !vmx_nest_msr_read_intercept(regs, &msr_content) )
+            goto gp_fault;
+        break;
     case MSR_IA32_MISC_ENABLE:
         rdmsrl(MSR_IA32_MISC_ENABLE, msr_content);
         /* Debug Trace Store is not supported. */
@@ -2067,8 +2070,11 @@
 
         break;
     }
-    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
-        goto gp_fault;
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+        if ( !vmx_nest_msr_write_intercept(regs, msr_content) )
+            goto gp_fault;
+        break;
     default:
         if ( vpmu_do_wrmsr(regs) )
             return X86EMUL_OKAY;
diff -r 25c338cbc024 -r 0f0e32a70c02 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
@@ -76,4 +76,9 @@
 int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
                                unsigned int exit_reason);
 
+int vmx_nest_msr_read_intercept(struct cpu_user_regs *regs,
+                                u64 *msr_content);
+int vmx_nest_msr_write_intercept(struct cpu_user_regs *regs,
+                                 u64 msr_content);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 25c338cbc024 -r 0f0e32a70c02 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Apr 22 22:30:09 2010 +0800
@@ -157,18 +157,23 @@
 #define PIN_BASED_EXT_INTR_MASK         0x00000001
 #define PIN_BASED_NMI_EXITING           0x00000008
 #define PIN_BASED_VIRTUAL_NMIS          0x00000020
+#define PIN_BASED_PREEMPT_TIMER         0x00000040
 extern u32 vmx_pin_based_exec_control;
 
 #define VM_EXIT_IA32E_MODE              0x00000200
 #define VM_EXIT_ACK_INTR_ON_EXIT        0x00008000
 #define VM_EXIT_SAVE_GUEST_PAT          0x00040000
 #define VM_EXIT_LOAD_HOST_PAT           0x00080000
+#define VM_EXIT_SAVE_GUEST_EFER         0x00100000
+#define VM_EXIT_LOAD_HOST_EFER          0x00200000
+#define VM_EXIT_SAVE_PREEMPT_TIMER      0x00400000
 extern u32 vmx_vmexit_control;
 
 #define VM_ENTRY_IA32E_MODE             0x00000200
 #define VM_ENTRY_SMM                    0x00000400
 #define VM_ENTRY_DEACT_DUAL_MONITOR     0x00000800
 #define VM_ENTRY_LOAD_GUEST_PAT         0x00004000
+#define VM_ENTRY_LOAD_GUEST_EFER        0x00008000
 extern u32 vmx_vmentry_control;
 
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001
diff -r 25c338cbc024 -r 0f0e32a70c02 xen/include/asm-x86/msr-index.h
--- a/xen/include/asm-x86/msr-index.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/msr-index.h	Thu Apr 22 22:30:09 2010 +0800
@@ -165,6 +165,7 @@
 #define MSR_IA32_VMX_CR0_FIXED1                 0x487
 #define MSR_IA32_VMX_CR4_FIXED0                 0x488
 #define MSR_IA32_VMX_CR4_FIXED1                 0x489
+#define MSR_IA32_VMX_VMCS_ENUM                  0x48a
 #define MSR_IA32_VMX_PROCBASED_CTLS2            0x48b
 #define MSR_IA32_VMX_EPT_VPID_CAP               0x48c
 #define MSR_IA32_VMX_TRUE_PINBASED_CTLS         0x48d
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
expose VMX cpuid and allow guest to enable VMX.
Signed-off-by: Qing He <qing.he@intel.com>
---
 arch/x86/hvm/vmx/vmx.c    |    5 +++++
 include/asm-x86/hvm/hvm.h |    3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)
diff -r 0f0e32a70c02 -r 22df5f7ec6d3 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
@@ -1561,6 +1561,11 @@
 
     switch ( input )
     {
+        case 0x1:
+            if ( v->domain->arch.hvm_domain.nesting_avail )
+                *ecx |= 1 << 5;    /* VMX capability */
+            break;
+
         case 0x80000001:
             /* SYSCALL is visible iff running in long mode. */
             hvm_get_segment_register(v, x86_seg_cs, &cs);
diff -r 0f0e32a70c02 -r 22df5f7ec6d3 xen/include/asm-x86/hvm/hvm.h
--- a/xen/include/asm-x86/hvm/hvm.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/hvm.h	Thu Apr 22 22:30:09 2010 +0800
@@ -272,7 +272,8 @@
         X86_CR4_DE  | X86_CR4_PSE | X86_CR4_PAE |       \
         X86_CR4_MCE | X86_CR4_PGE | X86_CR4_PCE |       \
         X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT |           \
-        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0))))
+        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0)   |       \
+        X86_CR4_VMXE)))
 
 /* These exceptions must always be intercepted. */
 #define HVM_TRAP_MASK ((1U << TRAP_machine_check) | (1U <<
TRAP_invalid_op))
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
This patch adds virtual ept capability to L1.
It''s implemented as a simple per vCPU vTLB like component
independent to domain wide p2m.
Signed-off-by: Qing He <qing.he@intel.com>
---
 b/xen/arch/x86/hvm/vmx/vept.c        |  574 +++++++++++++++++++++++++++++++++++
 b/xen/include/asm-x86/hvm/vmx/vept.h |   10 
 xen/arch/x86/hvm/vmx/Makefile        |    1 
 xen/arch/x86/hvm/vmx/nest.c          |  136 +++++++-
 xen/arch/x86/hvm/vmx/vmx.c           |   13 
 xen/include/asm-x86/hvm/vmx/nest.h   |    7 
 6 files changed, 734 insertions(+), 7 deletions(-)
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/Makefile
--- a/xen/arch/x86/hvm/vmx/Makefile	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/Makefile	Thu Apr 22 22:30:10 2010 +0800
@@ -6,3 +6,4 @@
 obj-y += vpmu.o
 obj-y += vpmu_core2.o
 obj-y += nest.o
+obj-y += vept.o
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:10 2010 +0800
@@ -26,6 +26,7 @@
 #include <asm/hvm/vmx/vmx.h>
 #include <asm/hvm/vmx/vvmcs.h>
 #include <asm/hvm/vmx/nest.h>
+#include <asm/hvm/vmx/vept.h>
 
 /*
  * VMX instructions support functions
@@ -295,6 +296,9 @@
     __vmptrld(virt_to_maddr(nest->hvmcs));
     v->arch.hvm_vmx.launched = 0;
 
+    nest->geptp = 0;
+    nest->vept = vept_init(v);
+
     vmreturn(regs, VMSUCCEED);
 
 out:
@@ -313,6 +317,9 @@
     if ( unlikely(!nest->guest_vmxon_pa) )
         goto invalid_op;
 
+    vept_teardown(nest->vept);
+    nest->vept = 0;
+
     nest->guest_vmxon_pa = 0;
     __vmpclear(virt_to_maddr(nest->svmcs));
 
@@ -529,6 +536,67 @@
     return vmx_nest_handle_vmresume(regs);
 }
 
+int vmx_nest_handle_invept(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    mfn_t mfn;
+    u64 eptp;
+    int type;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    decode_vmx_inst(regs, &decode);
+
+    hvm_copy_from_guest_virt(&eptp, decode.mem, sizeof(eptp), 0);
+    type = reg_read(regs, decode.reg2);
+
+    /* TODO: physical invept on other cpus */
+    switch ( type )
+    {
+    case 1:
+        mfn = vept_invalidate(nest->vept, eptp);
+        if ( eptp == nest->geptp )
+            nest->geptp = 0;
+
+        if ( __mfn_valid(mfn_x(mfn)) )
+            __invept(1, mfn_x(mfn) << PAGE_SHIFT | (eptp & 0xfff),
0);
+        break;
+    case 2:
+        vept_invalidate_all(nest->vept);
+        nest->geptp = 0;
+        break;
+    default:
+        gdprintk(XENLOG_ERR, "nest: unsupported invept type %d\n",
type);
+        break;
+    }
+
+    vmreturn(regs, VMSUCCEED);
+
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_vept(struct vcpu *v)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    int r = 0;
+
+    if ( paging_mode_hap(v->domain) &&
+         (__get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) &
+          CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) &&
+         (__get_vvmcs(nest->vvmcs, SECONDARY_VM_EXEC_CONTROL) &
+          SECONDARY_EXEC_ENABLE_EPT) )
+        r = 1;
+
+    return r;
+}
+
 /*
  * Nested VMX context switch
  */
@@ -739,7 +807,14 @@
     vvmcs_to_shadow(nest->vvmcs, CR0_GUEST_HOST_MASK);
     vvmcs_to_shadow(nest->vvmcs, CR4_GUEST_HOST_MASK);
 
-    /* TODO: PDPTRs for nested ept */
+    if ( vmx_nest_vept(v) )
+    {
+        vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR0);
+        vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR1);
+        vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR2);
+        vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR3);
+    }
+
     /* TODO: CR3 target control */
 }
 
@@ -787,14 +862,32 @@
     }
 #endif
 
+
+    /* loading EPT_POINTER for L2 */
+    if ( vmx_nest_vept(v) )
+    {
+        u64 geptp;
+        mfn_t mfn;
+
+        geptp = __get_vvmcs(nest->vvmcs, EPT_POINTER);
+        if ( geptp != nest->geptp )
+        {
+            mfn = vept_load_eptp(nest->vept, geptp);
+            nest->geptp = geptp;
+
+            __vmwrite(EPT_POINTER, (mfn_x(mfn) << PAGE_SHIFT) | 0x1e);
+#ifdef __i386__
+            __vmwrite(EPT_POINTER_HIGH, (mfn_x(mfn) << PAGE_SHIFT)
>> 32);
+#endif
+        }
+    }
+
     regs->rip = __get_vvmcs(nest->vvmcs, GUEST_RIP);
     regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP);
     regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS);
 
     /* updating host cr0 to sync TS bit */
     __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
-
-    /* TODO: EPT_POINTER */
 }
 
 static void sync_vvmcs_guest_state(struct vmx_nest_struct *nest)
@@ -1064,8 +1157,26 @@
         break;
     }
 
+    case EXIT_REASON_EPT_VIOLATION:
+    {
+        unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION);
+        paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS);
+#ifdef __i386__
+        gpa |= (paddr_t)__vmread(GUEST_PHYSICAL_ADDRESS_HIGH) << 32;
+#endif
+        if ( vmx_nest_vept(v) )
+        {
+            if ( !vept_ept_violation(nest->vept, nest->geptp,
+                     exit_qualification, gpa) )
+                bypass_l0 = 1;
+            else
+                nest->vmexit_pending = 1;
+        }
+
+        break;
+    }
+
     case EXIT_REASON_WBINVD:
-    case EXIT_REASON_EPT_VIOLATION:
     case EXIT_REASON_EPT_MISCONFIG:
     case EXIT_REASON_EXTERNAL_INTERRUPT:
         /* pass to L0 handler */
@@ -1229,11 +1340,14 @@
         data = (data << 32) | eax;
         break;
     case MSR_IA32_VMX_PROCBASED_CTLS:
+        mask = paging_mode_hap(current->domain)?
+                   0: CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
         rdmsr(regs->ecx, eax, edx);
 #define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \
-            | CPU_BASED_ACTIVATE_MSR_BITMAP            \
-            | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
+            | CPU_BASED_ACTIVATE_MSR_BITMAP)
         data = edx & ~REMOVED_EXEC_CONTROL_CAP;
+        data = edx & ~mask;
         data = (data << 32) | eax;
         break;
     case MSR_IA32_VMX_EXIT_CTLS:
@@ -1254,12 +1368,20 @@
         data = (data << 32) | eax;
         break;
     case MSR_IA32_VMX_PROCBASED_CTLS2:
-        mask = 0;
+        mask = paging_mode_hap(current->domain)?
+                   SECONDARY_EXEC_ENABLE_EPT : 0;
 
         rdmsr(regs->ecx, eax, edx);
         data = edx & mask;
         data = (data << 32) | eax;
         break;
+    case MSR_IA32_VMX_EPT_VPID_CAP:
+        rdmsr(regs->ecx, eax, edx);
+#define REMOVED_EPT_VPID_CAP_HIGH   ( 1 | 1<<8 | 1<<9 | 1<<10
| 1<<11 )
+#define REMOVED_EPT_VPID_CAP_LOW    ( 1<<16 | 1<<17 | 1<<26 )
+        data = edx & ~REMOVED_EPT_VPID_CAP_HIGH;
+        data = (data << 32) | (eax & ~REMOVED_EPT_VPID_CAP_LOW);
+        break;
 
     /* pass through MSRs */
     case IA32_FEATURE_CONTROL_MSR:
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/vept.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/arch/x86/hvm/vmx/vept.c	Thu Apr 22 22:30:10 2010 +0800
@@ -0,0 +1,574 @@
+/*
+ * vept.c: virtual EPT for nested virtualization
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <xen/paging.h>
+#include <xen/domain_page.h>
+#include <xen/sched.h>
+#include <asm/page.h>
+#include <xen/numa.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <asm/hvm/vmx/vept.h>
+
+#undef mfn_to_page
+#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
+#undef page_to_mfn
+#define page_to_mfn(_pg) _mfn(__page_to_mfn(_pg))
+
+/*
+ * This virtual EPT implementation is independent to p2m facility
+ * and has some different characteristics. It works in a similar
+ * way as shadow page table (guest table and host table composition),
+ * but is per-vcpu, and of vTLB style
+ *   - per vCPU so no lock is required
+ *   - vTLB style signifies honoring all invalidations, and not
+ * write protection. Unlike ordinary page table, since EPT updates
+ * and invalidations are minimal in a well written VMM, overhead
+ * is also minimized.
+ *
+ * The physical root is loaded directly to L2 sVMCS, without entering
+ * any other host controls. Multiple `cache slots'' are maintained
+ * for multiple guest EPTPs, with simple LRU replacement.
+ *
+ * One of the limitations so far, is that it doesn''t work with
+ * L0 emulation code, so L1 p2m_mmio_direct on top of L0 p2m_mmio_dm
+ * is not supported as for now.
+ */
+
+#define VEPT_MAX_SLOTS 8
+#define VEPT_ALLOCATION_SIZE 512
+
+struct vept_slot {
+    u64               eptp;   /* guest eptp */
+    mfn_t             root;   /* root of phys table */
+    struct list_head  list;
+
+    struct page_list_head page_list;
+};
+
+struct vept {
+    struct list_head   used_slots; /* lru: new->tail, old->head */
+    struct list_head   free_slots;
+
+    int                total_pages;
+    int                free_pages;
+    struct page_list_head freelist;
+
+    struct vcpu       *vcpu;
+};
+
+
+static struct vept_slot *__get_eptp_slot(struct vept *vept, u64 geptp)
+{
+    struct vept_slot *slot, *tmp;
+
+    list_for_each_entry_safe( slot, tmp, &vept->used_slots, list )
+        if ( slot->eptp == geptp )
+            return slot;
+
+    return NULL;
+}
+
+static struct vept_slot *get_eptp_slot(struct vept *vept, u64 geptp)
+{
+    struct vept_slot *slot;
+
+    slot = __get_eptp_slot(vept, geptp);
+    if ( slot != NULL )
+        list_del(&slot->list);
+
+    return slot;
+}
+
+static void __clear_slot(struct vept *vept, struct vept_slot *slot)
+{
+    struct page_info *pg;
+
+    slot->eptp = 0;
+
+    while ( !page_list_empty(&slot->page_list) )
+    {
+        pg = page_list_remove_head(&slot->page_list);
+        page_list_add_tail(pg, &vept->freelist);
+
+        vept->free_pages++;
+    }
+}
+
+static struct vept_slot *get_free_slot(struct vept *vept)
+{
+    struct vept_slot *slot = NULL;
+
+    if ( !list_empty(&vept->free_slots) )
+    {
+        slot = list_entry(vept->free_slots.next, struct vept_slot, list);
+        list_del(&slot->list);
+    }
+    else if ( !list_empty(&vept->used_slots) )
+    {
+        slot = list_entry(vept->used_slots.next, struct vept_slot, list);
+        list_del(&slot->list);
+        __clear_slot(vept, slot);
+    }
+
+    return slot;
+}
+
+static void clear_all_slots(struct vept *vept)
+{
+    struct vept_slot *slot, *tmp;
+
+    list_for_each_entry_safe( slot, tmp, &vept->used_slots, list )
+    {
+        list_del(&slot->list);
+        __clear_slot(vept, slot);
+        list_add_tail(&slot->list, &vept->free_slots);
+    }
+}
+
+static int free_some_pages(struct vept *vept, struct vept_slot *curr)
+{
+    struct vept_slot *slot;
+    int r = 0;
+
+    if ( !list_empty(&vept->used_slots) )
+    {
+        slot = list_entry(vept->used_slots.next, struct vept_slot, list);
+        if ( slot != curr )
+        {
+            list_del(&slot->list);
+            __clear_slot(vept, slot);
+            list_add_tail(&slot->list, &vept->free_slots);
+
+            r = 1;
+        }
+    }
+
+    return r;
+}
+
+struct vept *vept_init(struct vcpu *v)
+{
+    struct vept *vept;
+    struct vept_slot *slot;
+    struct page_info *pg;
+    int i;
+
+    vept = xmalloc(struct vept);
+    if ( vept == NULL )
+        goto out;
+
+    memset(vept, 0, sizeof(*vept));
+    vept->vcpu = v;
+
+    INIT_PAGE_LIST_HEAD(&vept->freelist);
+    INIT_LIST_HEAD(&vept->used_slots);
+    INIT_LIST_HEAD(&vept->free_slots);
+
+    for ( i = 0; i < VEPT_MAX_SLOTS; i++ )
+    {
+        slot = xmalloc(struct vept_slot);
+        if ( slot == NULL )
+            break;
+
+        memset(slot, 0, sizeof(*slot));
+
+        INIT_LIST_HEAD(&slot->list);
+        INIT_PAGE_LIST_HEAD(&slot->page_list);
+
+        list_add(&slot->list, &vept->free_slots);
+    }
+
+    for ( i = 0; i < VEPT_ALLOCATION_SIZE; i++ )
+    {
+        pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(v->domain)));
+        if ( pg == NULL )
+            break;
+
+        page_list_add_tail(pg, &vept->freelist);
+        vept->total_pages++;
+        vept->free_pages++;
+    }
+
+ out:
+    return vept;
+}
+
+void vept_teardown(struct vept *vept)
+{
+    struct page_info *pg;
+    struct vept_slot *slot, *tmp;
+
+    clear_all_slots(vept);
+
+    while ( !page_list_empty(&vept->freelist) )
+    {
+        pg = page_list_remove_head(&vept->freelist);
+        free_domheap_page(pg);
+        vept->free_pages++;
+        vept->total_pages++;
+    }
+
+    list_for_each_entry_safe( slot, tmp, &vept->free_slots, list )
+        xfree(slot);
+
+    xfree(vept);
+}
+
+mfn_t vept_load_eptp(struct vept *vept, u64 geptp)
+{
+    struct page_info *pg;
+    struct vept_slot *slot;
+    mfn_t mfn = _mfn(INVALID_MFN);
+    void *addr;
+
+    ASSERT(vept->vcpu == current);
+
+    slot = get_eptp_slot(vept, geptp);
+    if ( slot == NULL )
+    {
+        slot = get_free_slot(vept);
+        if ( unlikely(slot == NULL) )
+        {
+            gdprintk(XENLOG_ERR, "nest: can''t get free
slot\n");
+            return mfn;
+        }
+
+        while ( !vept->free_pages )
+            if ( !free_some_pages(vept, slot) )
+            {
+                slot->eptp = 0;
+                list_add_tail(&slot->list, &vept->free_slots);
+                gdprintk(XENLOG_ERR, "nest: vept no free pages\n");
+
+                return mfn;
+            }
+
+        vept->free_pages--;
+        pg = page_list_remove_head(&vept->freelist);
+
+        mfn = page_to_mfn(pg);
+        addr = map_domain_page(mfn_x(mfn));
+        clear_page(addr);
+        unmap_domain_page(addr);
+        page_list_add_tail(pg, &slot->page_list);
+        slot->eptp = geptp;
+        slot->root = mfn;
+    }
+
+    mfn = slot->root;
+    list_add_tail(&slot->list, &vept->used_slots);
+
+    return mfn;
+}
+
+mfn_t vept_invalidate(struct vept *vept, u64 geptp)
+{
+    struct vept_slot *slot;
+    mfn_t mfn = _mfn(INVALID_MFN);
+
+    ASSERT(vept->vcpu == current);
+
+    slot = get_eptp_slot(vept, geptp);
+    if ( slot != NULL )
+    {
+        mfn = slot->root;
+        __clear_slot(vept, slot);
+        list_add_tail(&slot->list, &vept->free_slots);
+    }
+
+    return mfn;
+}
+
+void vept_invalidate_all(struct vept *vept)
+{
+    ASSERT(vept->vcpu == current);
+
+    clear_all_slots(vept);
+}
+
+/*
+ * guest EPT walk and EPT violation
+ */
+struct ept_walk {
+    unsigned long gfn;
+    unsigned long gfn_remainder;
+    ept_entry_t l4e, l3e, l2e, l1e;
+    mfn_t l4mfn, l3mfn, l2mfn, l1mfn;
+    int sp;
+};
+typedef struct ept_walk ept_walk_t;
+
+#define GEPT_NORMAL_PAGE  0
+#define GEPT_SUPER_PAGE   1
+#define GEPT_NOT_PRESENT  2
+static int guest_ept_next_level(struct vcpu *v, ept_entry_t **table,
+               unsigned long *gfn_remainder, int level, u32 *ar,
+               ept_entry_t *entry, mfn_t *next_mfn)
+{
+    int index;
+    ept_entry_t *ept_entry;
+    ept_entry_t *next;
+    p2m_type_t p2mt;
+    int rc = GEPT_NORMAL_PAGE;
+    mfn_t mfn;
+
+    index = *gfn_remainder >> (level * EPT_TABLE_ORDER);
+
+    ept_entry = (*table) + index;
+    *entry = *ept_entry;
+    *ar &= entry->epte & 0x7;
+
+    *gfn_remainder &= (1UL << (level * EPT_TABLE_ORDER)) - 1;
+
+    if ( !(ept_entry->epte & 0x7) )
+        rc = GEPT_NOT_PRESENT;
+    else if ( ept_entry->sp_avail )
+        rc = GEPT_SUPER_PAGE;
+    else
+    {
+        mfn = gfn_to_mfn(v->domain, ept_entry->mfn, &p2mt);
+        if ( !p2m_is_ram(p2mt) )
+            return GEPT_NOT_PRESENT;
+
+        if ( next_mfn )
+        {
+            next = map_domain_page(mfn_x(mfn));
+            unmap_domain_page(*table);
+
+            *table = next;
+            *next_mfn = mfn;
+        }
+    }
+
+    return rc;
+}
+
+static u32 guest_walk_ept(struct vcpu *v, ept_walk_t *gw,
+                          u64 geptp, u64 ggpa)
+{
+    ept_entry_t *table;
+    p2m_type_t p2mt;
+    int rc;
+    u32 ar = 0x7;
+
+    unsigned long gfn = (unsigned long) (ggpa >> PAGE_SHIFT);
+    unsigned long gfn_remainder = gfn;
+
+    memset(gw, 0, sizeof(*gw));
+    gw->gfn = gfn;
+    gw->sp = 0;
+
+    gw->l4mfn = gfn_to_mfn(v->domain, geptp >> PAGE_SHIFT,
&p2mt);
+    if ( !p2m_is_ram(p2mt) )
+        return 0;
+
+    table = map_domain_page(mfn_x(gw->l4mfn));
+
+    rc = guest_ept_next_level(v, &table, &gfn_remainder, 3, &ar,
+                              &gw->l4e, &gw->l3mfn);
+
+    if ( rc )
+        goto out;
+
+    rc = guest_ept_next_level(v, &table, &gfn_remainder, 2, &ar,
+                              &gw->l3e, &gw->l2mfn);
+
+    if ( rc == GEPT_SUPER_PAGE )
+        gw->sp = 2;
+    if ( rc )
+        goto out;
+
+    rc = guest_ept_next_level(v, &table, &gfn_remainder, 1, &ar,
+                              &gw->l2e, &gw->l1mfn);
+
+    if ( rc == GEPT_SUPER_PAGE )
+        gw->sp = 1;
+    if ( rc )
+        goto out;
+
+    rc = guest_ept_next_level(v, &table, &gfn_remainder, 0, &ar,
+                              &gw->l1e, NULL);
+
+ out:
+    gw->gfn_remainder = gfn_remainder;
+    unmap_domain_page(*table);
+    return ar;
+}
+
+static void epte_set_ar_bits(ept_entry_t *entry, unsigned long ar)
+{
+    entry->epte &= ~0x7f;
+    entry->epte |= ar & 0x7f;
+}
+
+static int shadow_ept_next_level(struct vept *vept, struct vept_slot *slot,
+                       ept_entry_t **table, unsigned long *gfn_remainder,
+                       int level, u32 *ar, ept_entry_t gentry)
+{
+    int index;
+    ept_entry_t *sentry;
+    ept_entry_t *next;
+    mfn_t mfn;
+    struct page_info *pg;
+
+    index = *gfn_remainder >> (level * EPT_TABLE_ORDER);
+
+    sentry = (*table) + index;
+    *ar = sentry->epte & 0x7;
+
+    *gfn_remainder &= (1UL << (level * EPT_TABLE_ORDER)) - 1;
+
+    if ( !(sentry->epte & 0x7) )
+    {
+        while ( !vept->free_pages )
+            if ( !free_some_pages(vept, slot) )
+            {
+                gdprintk(XENLOG_ERR, "nest: vept no free pages\n");
+                return 0;
+            }
+
+        vept->free_pages--;
+        pg = page_list_remove_head(&vept->freelist);
+        page_list_add_tail(pg, &slot->page_list);
+        mfn = page_to_mfn(pg);
+        next = map_domain_page(mfn_x(mfn));
+        clear_page(next);
+
+        sentry->mfn = mfn_x(mfn);
+    }
+    else
+    {
+        next = map_domain_page(sentry->mfn);
+    }
+
+    epte_set_ar_bits(sentry, gentry.epte);
+
+    unmap_domain_page(*table);
+    *table = next;
+
+    return 1;
+}
+
+int vept_ept_violation(struct vept *vept, u64 geptp,
+                       unsigned long qualification, paddr_t addr)
+{
+    ept_walk_t gw;
+    struct vept_slot *slot;
+    ept_entry_t *table, *gept;
+    ept_entry_t *sentry, *gentry;
+    u32 old_entry, sp_ar = 0;
+    p2m_type_t p2mt;
+    unsigned long mfn_start = 0;
+    unsigned long gfn_remainder;
+    int rc, i;
+
+    ASSERT(vept->vcpu == current);
+
+    slot = __get_eptp_slot(vept, geptp);
+    if ( unlikely(slot == NULL) )
+        return 0;
+
+    rc = guest_walk_ept(vept->vcpu, &gw, geptp, addr);
+
+    if ( !(rc & (qualification & 0x7)) )    /* inject to guest */
+        return 1;
+
+    if ( gw.sp == 2 )  /* 1G */
+    {
+        sp_ar = gw.l3e.epte & 0x7;
+        mfn_start = gw.l3e.mfn +
+                    (gw.gfn_remainder & (~(1 << EPT_TABLE_ORDER) -
1));
+    }
+    if ( gw.sp == 1 )  /* 2M */
+    {
+        sp_ar = gw.l2e.epte & 0x7;
+        mfn_start = gw.l2e.mfn;
+    }
+    else
+        mfn_start = 0;
+
+    table = map_domain_page(mfn_x(slot->root));
+    gfn_remainder = gw.gfn;
+
+    shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 3,
+                          &old_entry, gw.l4e);
+
+    shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 2,
+                          &old_entry, gw.l3e);
+
+    shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 1,
+                          &old_entry, (gw.sp == 2) ? gw.l3e : gw.l2e);
+
+    /* if l1p is just allocated, do a full prefetch */
+    if ( !old_entry && !gw.sp )
+    {
+        gept = map_domain_page(mfn_x(gw.l1mfn));
+        for ( i = 0; i < 512; i++ )
+        {
+            gentry = gept + i;
+            sentry = table + i;
+            if ( gentry->epte & 0x7 )
+            {
+                sentry->mfn =
mfn_x(gfn_to_mfn_guest(vept->vcpu->domain,
+                                        gentry->mfn, &p2mt));
+                epte_set_ar_bits(sentry, gentry->epte);
+            }
+            else
+                sentry->epte = 0;
+        }
+        unmap_domain_page(gept);
+    }
+    else if ( !old_entry && gw.sp )
+    {
+        for ( i = 0; i < 512; i++ )
+        {
+            sentry = table + i;
+            sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain,
+                                    mfn_start + i, &p2mt));
+            epte_set_ar_bits(sentry, sp_ar);
+        }
+    }
+    else if ( old_entry && !gw.sp )
+    {
+        i = gw.gfn & ((1 << EPT_TABLE_ORDER) - 1);
+        sentry = table + i;
+        sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain,
+                                gw.l1e.mfn, &p2mt));
+        epte_set_ar_bits(sentry, gw.l1e.epte);
+    }
+    else    // old_entry && gw.sp
+    {
+        i = gw.gfn & ((1 << EPT_TABLE_ORDER) - 1);
+        sentry = table + i;
+        sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain,
+                                mfn_start + i, &p2mt));
+        epte_set_ar_bits(sentry, sp_ar);
+    }
+
+    unmap_domain_page(table);
+    return 0;
+}
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 22 22:30:10 2010 +0800
@@ -1032,6 +1032,14 @@
     p2m_type_t p2mt;
     char *p;
 
+    /*
+     * If in nesting EPT operation, L0 doesn''t have the knowledge on
+     * how to interpret CR3, it''s L1''s responsibility to
provide
+     * GUEST_PDPTRn, we rely solely on them.
+     */
+    if ( v->arch.hvm_vcpu.in_nesting && vmx_nest_vept(v) )
+        return;
+
     /* EPT needs to load PDPTRS into VMCS for PAE. */
     if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA)
)
         return;
@@ -2705,6 +2713,11 @@
         if ( vmx_nest_handle_vmxon(regs) == X86EMUL_OKAY )
             __update_guest_eip(inst_len);
         break;
+    case EXIT_REASON_INVEPT:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_invept(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
 
     case EXIT_REASON_MWAIT_INSTRUCTION:
     case EXIT_REASON_MONITOR_INSTRUCTION:
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:09 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Thu Apr 22 22:30:10 2010 +0800
@@ -47,6 +47,9 @@
 
     unsigned long        intr_info;
     unsigned long        error_code;
+
+    u64                  geptp;
+    struct vept         *vept;
 };
 
 asmlinkage void vmx_nest_switch_mode(void);
@@ -64,6 +67,8 @@
 int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
 
+int vmx_nest_handle_invept(struct cpu_user_regs *regs);
+
 void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value);
 void vmx_nest_update_secondary_exec_control(struct vcpu *v,
                                             unsigned long value);
@@ -81,4 +86,6 @@
 int vmx_nest_msr_write_intercept(struct cpu_user_regs *regs,
                                  u64 msr_content);
 
+int vmx_nest_vept(struct vcpu *v);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/include/asm-x86/hvm/vmx/vept.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/hvm/vmx/vept.h	Thu Apr 22 22:30:10 2010 +0800
@@ -0,0 +1,10 @@
+#include <asm/hvm/vmx/vmx.h>
+
+
+struct vept *vept_init(struct vcpu *v);
+void vept_teardown(struct vept *vept);
+mfn_t vept_load_eptp(struct vept *vept, u64 eptp);
+mfn_t vept_invalidate(struct vept *vept, u64 eptp);
+void vept_invalidate_all(struct vept *vept);
+int vept_ept_violation(struct vept *vept, u64 eptp,
+                       unsigned long qualification, paddr_t addr);
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
xentrace can now trace virtual vmexit and virtual vmentry
in a separate category.
Signed-off-by: Qing He <qing.he@intel.com>
---
 tools/xentrace/formats          |    2 ++
 xen/arch/x86/hvm/vmx/nest.c     |   10 ++++++++++
 xen/include/asm-x86/hvm/trace.h |    3 +++
 xen/include/public/trace.h      |    4 ++++
 4 files changed, 19 insertions(+)
diff -r 7f54e6615e1e -r 3a7b55a0be9c tools/xentrace/formats
--- a/tools/xentrace/formats	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/xentrace/formats	Thu Apr 22 22:30:10 2010 +0800
@@ -70,6 +70,8 @@
 0x00082018  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  CLTS
 0x00082019  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  LMSW        [ value = 0x%(1)08x
]
 0x00082119  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  LMSW        [ value =
0x%(2)08x%(1)08x ]
+0x00083001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NEST_ENTRY  [ entry_intr =
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
+0x00083002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NEST_EXIT   [ exitcode =
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid =
%(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid =
%(1)d ]
diff -r 7f54e6615e1e -r 3a7b55a0be9c xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:10 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 22:30:10 2010 +0800
@@ -888,6 +888,11 @@
 
     /* updating host cr0 to sync TS bit */
     __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+
+    HVMTRACE_ND(NEST_ENTRY, 1/*cycles*/, 3,
+                __get_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO),
+                (uint32_t)regs->rip, (uint32_t)((uint64_t)regs->rip
>> 32),
+                0, 0, 0);
 }
 
 static void sync_vvmcs_guest_state(struct vmx_nest_struct *nest)
@@ -1026,6 +1031,11 @@
     /* updating host cr0 to sync TS bit */
     __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
 
+    HVMTRACE_ND(NEST_EXIT, 1/*cycles*/, 3,
+                __get_vvmcs(nest->vvmcs, VM_EXIT_REASON),
+                (uint32_t)regs->rip, (uint32_t)((uint64_t)regs->rip
>> 32),
+                0, 0, 0);
+
     vmreturn(regs, VMSUCCEED);
 }
 
diff -r 7f54e6615e1e -r 3a7b55a0be9c xen/include/asm-x86/hvm/trace.h
--- a/xen/include/asm-x86/hvm/trace.h	Thu Apr 22 22:30:10 2010 +0800
+++ b/xen/include/asm-x86/hvm/trace.h	Thu Apr 22 22:30:10 2010 +0800
@@ -13,6 +13,7 @@
 #define DEFAULT_HVM_REGACCESS  DEFAULT_HVM_TRACE_ON
 #define DEFAULT_HVM_MISC       DEFAULT_HVM_TRACE_ON
 #define DEFAULT_HVM_INTR       DEFAULT_HVM_TRACE_ON
+#define DEFAULT_HVM_NEST       DEFAULT_HVM_TRACE_ON
 
 #define DO_TRC_HVM_VMENTRY     DEFAULT_HVM_VMSWITCH
 #define DO_TRC_HVM_VMEXIT      DEFAULT_HVM_VMSWITCH
@@ -49,6 +50,8 @@
 #define DO_TRC_HVM_CLTS        DEFAULT_HVM_MISC
 #define DO_TRC_HVM_LMSW        DEFAULT_HVM_MISC
 #define DO_TRC_HVM_LMSW64      DEFAULT_HVM_MISC
+#define DO_TRC_HVM_NEST_ENTRY  DEFAULT_HVM_NEST
+#define DO_TRC_HVM_NEST_EXIT   DEFAULT_HVM_NEST
 
 
 #ifdef __x86_64__
diff -r 7f54e6615e1e -r 3a7b55a0be9c xen/include/public/trace.h
--- a/xen/include/public/trace.h	Thu Apr 22 22:30:10 2010 +0800
+++ b/xen/include/public/trace.h	Thu Apr 22 22:30:10 2010 +0800
@@ -51,6 +51,7 @@
 /* trace subclasses for SVM */
 #define TRC_HVM_ENTRYEXIT 0x00081000   /* VMENTRY and #VMEXIT       */
 #define TRC_HVM_HANDLER   0x00082000   /* various HVM handlers      */
+#define TRC_HVM_NEST      0x00083000   /* nested virtualization     */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
 #define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
@@ -161,6 +162,9 @@
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
 
+#define TRC_HVM_NEST_ENTRY      (TRC_HVM_NEST + 0x01)
+#define TRC_HVM_NEST_EXIT       (TRC_HVM_NEST + 0x02)
+
 /* trace subclasses for power management */
 #define TRC_PM_FREQ     0x00801000      /* xen cpu freq events */
 #define TRC_PM_IDLE     0x00802000      /* xen cpu idle events */
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Qing He
2010-Apr-22  09:41 UTC
[Xen-devel] [PATCH 17/17] tools: nest: allow enabling nesting
Add hvm config option to allow nesting
Signed-off-by: Qing He <qing.he@intel.com>
---
 examples/xmexample.hvm            |   12 ++++++++++++
 python/xen/xend/XendConfig.py     |    4 ++++
 python/xen/xend/XendDomainInfo.py |    5 ++++-
 python/xen/xm/create.py           |    7 ++++++-
 python/xen/xm/xenapi_create.py    |    1 +
 5 files changed, 27 insertions(+), 2 deletions(-)
diff -r 3a7b55a0be9c -r 682f3d39f719 tools/examples/xmexample.hvm
--- a/tools/examples/xmexample.hvm	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/examples/xmexample.hvm	Thu Apr 22 22:39:56 2010 +0800
@@ -371,3 +371,15 @@
 #
 
 #vscsi = [ ''/dev/sdx, 0:0:0:0'' ]
+
+#   Enable nested virtualization
+#
+#   Turn on this option allow the guest to present virtualization
+# hardware. So that guest can use VMX or SVM to run another guest.
+# Turning off the option not only masks availability reporting, but
+# also disables all related controls, including related instructions,
+# cpuid, msr, etc. for security concerns.
+#
+# This feature is experimental, and the default is off.
+#
+#nesting=0
diff -r 3a7b55a0be9c -r 682f3d39f719 tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/python/xen/xend/XendConfig.py	Thu Apr 22 22:39:56 2010 +0800
@@ -176,6 +176,7 @@
     ''vhpt'': int,
     ''guest_os_type'': str,
     ''hap'': int,
+    ''nesting'': int,
     ''xen_extended_power_mgmt'': int,
     ''pci_msitranslate'': int,
     ''pci_power_mgmt'': int,
@@ -2219,6 +2220,9 @@
     def is_hap(self):
         return self[''platform''].get(''hap'',
0)
 
+    def is_nesting(self):
+        return
self[''platform''].get(''nesting'', 0)
+
     def is_pv_and_has_pci(self):
         for dev_type, dev_info in self.all_devices_sxpr():
             if dev_type != ''pci'':
diff -r 3a7b55a0be9c -r 682f3d39f719 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/python/xen/xend/XendDomainInfo.py	Thu Apr 22 22:39:56 2010 +0800
@@ -2511,9 +2511,11 @@
         self.restart_in_progress = False
 
         hap = 0
+        nesting = 0
         hvm = self.info.is_hvm()
         if hvm:
             hap = self.info.is_hap()
+            nesting = self.info.is_nesting()
             info = xc.xeninfo()
             if ''hvm'' not in
info[''xen_caps'']:
                 raise VmError("HVM guest support is unavailable: is
VT/AMD-V "
@@ -2540,7 +2542,8 @@
         oos =
self.info[''platform''].get(''oos'', 1)
         oos_off = 1 - int(oos)
 
-        flags = (int(hvm) << 0) | (int(hap) << 1) |
(int(s3_integrity) << 2) | (int(oos_off) << 3)
+        flags = (int(hvm) << 0) | (int(hap) << 1) |
(int(s3_integrity) << 2) \
+              | (int(oos_off) << 3) | (int(nesting) << 4)
 
         try:
             self.domid = xc.domain_create(
diff -r 3a7b55a0be9c -r 682f3d39f719 tools/python/xen/xm/create.py
--- a/tools/python/xen/xm/create.py	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/python/xen/xm/create.py	Thu Apr 22 22:39:56 2010 +0800
@@ -643,6 +643,11 @@
           use="""Should out-of-sync shadow page tabled be
enabled?
           (0=OOS is disabled; 1=OOS is enabled.""")
 
+gopts.var(''nesting'', val=''Nesting'',
+          fn=set_int, default=0,
+          use="""Nesting availability (0=nesting is forbidden;
+          1=nesting is allowed.""")
+
 gopts.var(''cpuid'',
val="IN[,SIN]:eax=EAX,ebx=EBX,ecx=ECX,edx=EDX",
           fn=append_value, default=[],
           use="""Cpuid description.""")
@@ -1065,7 +1070,7 @@
              ''device_model'', ''display'',
              ''fda'', ''fdb'',
              ''gfx_passthru'',
''guest_os_type'',
-             ''hap'', ''hpet'',
+             ''hap'', ''hpet'',
''nesting'',
              ''isa'',
              ''keymap'',
              ''localtime'',
diff -r 3a7b55a0be9c -r 682f3d39f719 tools/python/xen/xm/xenapi_create.py
--- a/tools/python/xen/xm/xenapi_create.py	Thu Apr 22 22:30:10 2010 +0800
+++ b/tools/python/xen/xm/xenapi_create.py	Thu Apr 22 22:39:56 2010 +0800
@@ -1105,6 +1105,7 @@
             ''guest_os_type'',
             ''hap'',
             ''oos'',
+            ''nesting'',
             ''pci_msitranslate'',
             ''pci_power_mgmt'',
             ''xen_platform_pci'',
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Christoph Egger
2010-Apr-22  10:15 UTC
Re: [Xen-devel] [PATCH 00/17][RFC] Nested virtualization for VMX
On Thursday 22 April 2010 11:41:12 Qing He wrote:> This patch set enables nested virtualization for VMX, That > is to allow a VMX guest (L1) to run other VMX guests (L2). > > The patch can generally run on different configurations: > - EPT-on-EPT, shadow-on-EPT, shadow-on-shadow > - different 32/64 combination of L1 and L2 > - L1/L2 SMP > > EPT-on-EPT is however, preferrable due to performance > advantage, I''ve tested the patch on a 64bit NHM L0, > against Xen cs. 21190. With EPT-on-EPT and a a kernel > build workload, L2 needs around 17% more time to complete. > > > Known problems: > - L1/L2=64/64, shadow-on-shadow doesn''t work as for now > - On 21190, even without nested patchset, Xen as L1 > suffers a considerable booting lag, this phenomenon > was not observed on my previous base, around cs. > 20200I can reproduce this bug as well. Last known working c/s is 20382 and known broken c/s is 20390. Potential candidates are c/s 20384, 20386, 20389 and 20390 which introduced the bug. I wasn''t able to verify c/s 20384, 20386 and 20389 due to build or boot problems.> - multiple L2 in one L1 hasn''t been testedI can run Windows 7 and NetBSD with hap-on-hap and with shadow-on-hap simultanously. Thanks to Tim''s review I found and fixed some bugs. I have Window 7 XP mode working. I will resend my nested virtualization patchset with the fixes soon. Do you also have a paper how your patchset works ?> > > The patch list is as below, it contains 3 preparation > patches (01 -- 03), 11 generic patches (04 -- 14), 1 to > enable EPT-on-EPT (15), and 2 support patches (16, 17). > > [PATCH 01/17] vmx: nest: fix CR4.VME in update_guest_cr > [PATCH 02/17] vmx: nest: rename host_vmcs > [PATCH 03/17] vmx: nest: wrapper for control update > [PATCH 04/17] vmx: nest: domain and vcpu flags > [PATCH 05/17] vmx: nest: nested control structure > [PATCH 06/17] vmx: nest: virtual vmcs layout > [PATCH 07/17] vmx: nest: handling VMX instruction exits > [PATCH 08/17] vmx: nest: L1 <-> L2 context switch > [PATCH 09/17] vmx: nest: interrupt > [PATCH 10/17] vmx: nest: VMExit handler in L2 > [PATCH 11/17] vmx: nest: L2 tsc > [PATCH 12/17] vmx: nest: CR0.TS and #NM > [PATCH 13/17] vmx: nest: capability reporting MSRs > [PATCH 14/17] vmx: nest: enable virtual VMX > [PATCH 15/17] vmx: nest: virtual ept for nested > [PATCH 16/17] vmx: nest: hvmtrace for nested > [PATCH 17/17] tools: nest: allow enabling nesting > > Thanks, > Qing He > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- ---to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
He, Qing
2010-Apr-23  10:10 UTC
RE: [Xen-devel] [PATCH 00/17][RFC] Nested virtualization for VMX
On Thu, 2010-04-22 at 18:16 +0800, Christoph Egger wrote:>On Thursday 22 April 2010 11:41:12 Qing He wrote: >> - On 21190, even without nested patchset, Xen as L1 >> suffers a considerable booting lag, this phenomenon >> was not observed on my previous base, around cs. >> 20200 > >I can reproduce this bug as well. Last known working c/s is 20382 >and known broken c/s is 20390. >Potential candidates are c/s 20384, 20386, 20389 and 20390 >which introduced the bug. >I wasn''t able to verify c/s 20384, 20386 and 20389 due to >build or boot problems.That''s a pretty narrower range, and easier to root cause. I ever tried to do a bisect when I met the problem, but didn''t get much out of it because of ioemu/xen dependencies.> >Do you also have a paper how your patchset works ?While I don''t have a long description about details, there was a Xensummit talk to explain the basic ideas, the foil can be found below http://www.xen.org/files/xensummit_intel09/xensummit-nested-virt.pdf Basically, it''s build on homogeneity to gain better performance. A ``shadow VMCS'''' is constructed from host VMCS and virtual VMCS, and then gets loaded to physical VMCS to control the L2 guest behavior. The policy of shadow VMCS construction and VMExits handling is a result of inspecting individual fields and VMExit types. There are some comments in the code to address the policy used, which you can have a look if interested. Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  09:26 UTC
Re: [Xen-devel] [PATCH 01/17] vmx: nest: fix CR4.VME in update_guest_cr
At 10:41 +0100 on 22 Apr (1271932873), Qing He wrote:> X86_CR4_VME in guest_cr[4] is updated in cr0 handling, but not in > cr4 handling, fix it for guest VM86.Nack. This patch doesn''t actually do anything. Cheers, Tim.> Signed-off-by: Qing He <qing.he@intel.com> > > --- > vmx.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff -r 9be1d3918ec7 -r ca507122f84e xen/arch/x86/hvm/vmx/vmx.c > --- a/xen/arch/x86/hvm/vmx/vmx.c Wed Apr 21 23:43:59 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 21:28:41 2010 +0800 > @@ -1174,7 +1174,8 @@ > if ( paging_mode_hap(v->domain) ) > v->arch.hvm_vcpu.hw_cr[4] &= ~X86_CR4_PAE; > v->arch.hvm_vcpu.hw_cr[4] |= v->arch.hvm_vcpu.guest_cr[4]; > - if ( v->arch.hvm_vmx.vmx_realmode ) > + if ( v->arch.hvm_vmx.vmx_realmode || > + (v->arch.hvm_vcpu.hw_cr[4] & X86_CR4_VME) ) > v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME; > if ( paging_mode_hap(v->domain) && !hvm_paging_enabled(v) ) > { > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  09:34 UTC
Re: [Xen-devel] [PATCH 03/17] vmx: nest: wrapper for control update
At 10:41 +0100 on 22 Apr (1271932875), Qing He wrote:> In nested virtualization, the L0 controls may not be the same > with controls in physical VMCS. > Explict maintain guest controls in variables and use wrappers > for control update, do not rely on physical control value. > > Signed-off-by: Qing He <qing.he@intel.com>> diff -r fe49b7452637 -r a0bbec37b529 xen/arch/x86/hvm/vmx/vmcs.c > --- a/xen/arch/x86/hvm/vmx/vmcs.c Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/vmcs.c Thu Apr 22 21:49:38 2010 +0800 > @@ -737,10 +737,10 @@ > __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL); > #endif > > - __vmwrite(EXCEPTION_BITMAP, > - HVM_TRAP_MASK > + v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK > | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault)) > - | (1U << TRAP_no_device)); > + | (1U << TRAP_no_device); > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);Shouldn''t this use the new vmx_update_exception_bitmap()? Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  09:36 UTC
Re: [Xen-devel] [PATCH 01/17] vmx: nest: fix CR4.VME in update_guest_cr
On Thu, 2010-05-20 at 17:26 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932873), Qing He wrote: > > X86_CR4_VME in guest_cr[4] is updated in cr0 handling, but not in > > cr4 handling, fix it for guest VM86. > > Nack. This patch doesn''t actually do anything. >Thank you. I intended to write ''if ( realmode || (guest_cr[4] & VME) ) hw_cr[4]...'' until just now I notice there is a hw_cr[4] |= guest_cr[4] right above it.> Cheers, > > Tim. > > > Signed-off-by: Qing He <qing.he@intel.com> > > > > --- > > vmx.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff -r 9be1d3918ec7 -r ca507122f84e xen/arch/x86/hvm/vmx/vmx.c > > --- a/xen/arch/x86/hvm/vmx/vmx.c Wed Apr 21 23:43:59 2010 +0800 > > +++ b/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 21:28:41 2010 +0800 > > @@ -1174,7 +1174,8 @@ > > if ( paging_mode_hap(v->domain) ) > > v->arch.hvm_vcpu.hw_cr[4] &= ~X86_CR4_PAE; > > v->arch.hvm_vcpu.hw_cr[4] |= v->arch.hvm_vcpu.guest_cr[4]; > > - if ( v->arch.hvm_vmx.vmx_realmode ) > > + if ( v->arch.hvm_vmx.vmx_realmode || > > + (v->arch.hvm_vcpu.hw_cr[4] & X86_CR4_VME) ) > > v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME; > > if ( paging_mode_hap(v->domain) && !hvm_paging_enabled(v) ) > > { > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  09:37 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
At 10:41 +0100 on 22 Apr (1271932876), Qing He wrote:> Introduce a domain create flag to allow user to set availability > of nested virtualization. > The flag will be used to disable all reporting and function > facilities, improving guest security.I have the same reservation about this as Christoph''s patch: I don''t think this needs to be a create-time flag - there''s no reason it can''t be enabled or disabled with a domctl after domain creation. (And of course we''ll want it to bve the same interface on both SVM and VMX.) Tim.> Another per vcpu flag is used to indicate whether the vcpu > is in L1 or L2 context. > > Signed-off-by: Qing He <qing.he@intel.com> > > --- > arch/x86/domain.c | 4 ++++ > common/domctl.c | 5 ++++- > include/asm-x86/hvm/domain.h | 1 + > include/asm-x86/hvm/vcpu.h | 2 ++ > include/public/domctl.h | 3 +++ > include/xen/sched.h | 3 +++ > 6 files changed, 17 insertions(+), 1 deletion(-) > > diff -r a0bbec37b529 -r 6f0f41f80285 xen/arch/x86/domain.c > --- a/xen/arch/x86/domain.c Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/arch/x86/domain.c Thu Apr 22 22:30:00 2010 +0800 > @@ -413,6 +413,10 @@ > > d->arch.s3_integrity = !!(domcr_flags & DOMCRF_s3_integrity); > > + d->arch.hvm_domain.nesting_avail > + is_hvm_domain(d) && > + (domcr_flags & DOMCRF_nesting); > + > INIT_LIST_HEAD(&d->arch.pdev_list); > > d->arch.relmem = RELMEM_not_started; > diff -r a0bbec37b529 -r 6f0f41f80285 xen/common/domctl.c > --- a/xen/common/domctl.c Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/common/domctl.c Thu Apr 22 22:30:00 2010 +0800 > @@ -393,7 +393,8 @@ > if ( supervisor_mode_kernel || > (op->u.createdomain.flags & > ~(XEN_DOMCTL_CDF_hvm_guest | XEN_DOMCTL_CDF_hap | > - XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off)) ) > + XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off | > + XEN_DOMCTL_CDF_nesting)) ) > break; > > dom = op->domain; > @@ -429,6 +430,8 @@ > domcr_flags |= DOMCRF_s3_integrity; > if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_oos_off ) > domcr_flags |= DOMCRF_oos_off; > + if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_nesting ) > + domcr_flags |= DOMCRF_nesting; > > ret = -ENOMEM; > d = domain_create(dom, domcr_flags, op->u.createdomain.ssidref); > diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/asm-x86/hvm/domain.h > --- a/xen/include/asm-x86/hvm/domain.h Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/include/asm-x86/hvm/domain.h Thu Apr 22 22:30:00 2010 +0800 > @@ -93,6 +93,7 @@ > bool_t mem_sharing_enabled; > bool_t qemu_mapcache_invalidate; > bool_t is_s3_suspended; > + bool_t nesting_avail; > > union { > struct vmx_domain vmx; > diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/asm-x86/hvm/vcpu.h > --- a/xen/include/asm-x86/hvm/vcpu.h Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/include/asm-x86/hvm/vcpu.h Thu Apr 22 22:30:00 2010 +0800 > @@ -70,6 +70,8 @@ > bool_t debug_state_latch; > bool_t single_step; > > + bool_t in_nesting; > + > u64 asid_generation; > u32 asid; > > diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/public/domctl.h > --- a/xen/include/public/domctl.h Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/include/public/domctl.h Thu Apr 22 22:30:00 2010 +0800 > @@ -64,6 +64,9 @@ > /* Disable out-of-sync shadow page tables? */ > #define _XEN_DOMCTL_CDF_oos_off 3 > #define XEN_DOMCTL_CDF_oos_off (1U<<_XEN_DOMCTL_CDF_oos_off) > + /* Is nested virtualization allowed */ > +#define _XEN_DOMCTL_CDF_nesting 4 > +#define XEN_DOMCTL_CDF_nesting (1U<<_XEN_DOMCTL_CDF_nesting) > }; > typedef struct xen_domctl_createdomain xen_domctl_createdomain_t; > DEFINE_XEN_GUEST_HANDLE(xen_domctl_createdomain_t); > diff -r a0bbec37b529 -r 6f0f41f80285 xen/include/xen/sched.h > --- a/xen/include/xen/sched.h Thu Apr 22 21:49:38 2010 +0800 > +++ b/xen/include/xen/sched.h Thu Apr 22 22:30:00 2010 +0800 > @@ -393,6 +393,9 @@ > /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */ > #define _DOMCRF_oos_off 4 > #define DOMCRF_oos_off (1U<<_DOMCRF_oos_off) > + /* DOMCRF_nesting: Create a domain that allows nested virtualization . */ > +#define _DOMCRF_nesting 5 > +#define DOMCRF_nesting (1U<<_DOMCRF_nesting) > > /* > * rcu_lock_domain_by_id() is more efficient than get_domain_by_id(). > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  09:46 UTC
Re: [Xen-devel] [PATCH 03/17] vmx: nest: wrapper for control update
On Thu, 2010-05-20 at 17:34 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932875), Qing He wrote: > > In nested virtualization, the L0 controls may not be the same > > with controls in physical VMCS. > > Explict maintain guest controls in variables and use wrappers > > for control update, do not rely on physical control value. > > > > Signed-off-by: Qing He <qing.he@intel.com> > > > diff -r fe49b7452637 -r a0bbec37b529 xen/arch/x86/hvm/vmx/vmcs.c > > --- a/xen/arch/x86/hvm/vmx/vmcs.c Thu Apr 22 21:49:38 2010 +0800 > > +++ b/xen/arch/x86/hvm/vmx/vmcs.c Thu Apr 22 21:49:38 2010 +0800 > > @@ -737,10 +737,10 @@ > > __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL); > > #endif > > > > - __vmwrite(EXCEPTION_BITMAP, > > - HVM_TRAP_MASK > > + v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK > > | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault)) > > - | (1U << TRAP_no_device)); > > + | (1U << TRAP_no_device); > > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); > > Shouldn''t this use the new vmx_update_exception_bitmap()?I left it unchanged because it''s in vmcs.c. To me, vmx.c is on top of vmcs.c and I feel against inter-dependeny. Anyway this feeling is not strong. And I''m fine with using vmx_update_exception_bitmap here since inter-dependency is already the case. Thanks, Qing> > Cheers, > > Tim. > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2010-May-20  09:51 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
On Thursday 20 May 2010 11:37:53 Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932876), Qing He wrote: > > Introduce a domain create flag to allow user to set availability > > of nested virtualization. > > The flag will be used to disable all reporting and function > > facilities, improving guest security. > > I have the same reservation about this as Christoph''s patch: I don''t > think this needs to be a create-time flag - there''s no reason it can''t > be enabled or disabled with a domctl after domain creation. (And of > course we''ll want it to bve the same interface on both SVM and VMX.)I already reworked that part to use HVM_PARAM_*. It showed up one caveat: The nestedhvm_enabled() becomes true after p2m_init() run. So the hap-on-hap code wasn''t initialized. I worked around that by initialising nestedp2m''s in p2m_init() unconditionally of having nestedhvm=1 in the guest config file or not. Christoph -- ---to satisfy European Law for business letters: Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  09:54 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
On Thu, 2010-05-20 at 17:37 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932876), Qing He wrote: > > Introduce a domain create flag to allow user to set availability > > of nested virtualization. > > The flag will be used to disable all reporting and function > > facilities, improving guest security. > > I have the same reservation about this as Christoph''s patch: I don''t > think this needs to be a create-time flag - there''s no reason it can''t > be enabled or disabled with a domctl after domain creation.I had seen the discussion before I posted this patch set. But I still put this flags here because there have been some people expressing security concerns, that in some situations, hardware virtualization needs to be explicitly disabled to avoid stealth VMM. This doesn''t mean not reporting the feature, but disabling it altogether. By using domctl, you mean to put the flag in xenstore and let QEmu to do this? It looks good to me.> (And of course we''ll want it to bve the same interface on both SVM > and VMX.) >Yeah, I just wanted to show my original intention. After discussion, we can use the same interface. Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  10:53 UTC
Re: [Xen-devel] [PATCH 07/17] vmx: nest: handling VMX instruction exits
At 10:41 +0100 on 22 Apr (1271932879), Qing He wrote:> add a VMX instruction decoder and handle simple VMX instructions > except vmlaunch/vmresume and invept > > Signed-off-by: Qing He <qing.he@intel.com>> +static void decode_vmx_inst(struct cpu_user_regs *regs, > + struct vmx_inst_decoded *decode) > +{ > + struct vcpu *v = current; > + union vmx_inst_info info; > + struct segment_register seg; > + unsigned long base, index, seg_base, disp; > + int scale; > + > + info.word = __vmread(VMX_INSTRUCTION_INFO); > + > + if ( info.fields.memreg ) { > + decode->type = VMX_INST_MEMREG_TYPE_REG; > + decode->reg1 = info.fields.reg1; > + } > + else > + { > + decode->type = VMX_INST_MEMREG_TYPE_MEMORY; > + hvm_get_segment_register(v, sreg_to_index[info.fields.segment], &seg); > + seg_base = seg.base; > + > + base = info.fields.base_reg_invalid ? 0 : > + reg_read(regs, info.fields.base_reg); > + > + index = info.fields.index_reg_invalid ? 0 : > + reg_read(regs, info.fields.index_reg); > + > + scale = 1 << info.fields.scaling; > + > + disp = __vmread(EXIT_QUALIFICATION); > + > + > + decode->mem = seg_base + base + index * scale + disp; > + decode->len = 1 << (info.fields.addr_size + 1);Don''t we need to check the segment limit, type &c here?> + } > + > + decode->reg2 = info.fields.reg2; > +} > + > +static void vmreturn(struct cpu_user_regs *regs, enum vmx_ops_result res) > +{ > + unsigned long eflags = regs->eflags; > + unsigned long mask = X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | > + X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF; > + > + eflags &= ~mask; > + > + switch ( res ) { > + case VMSUCCEED: > + break; > + case VMFAIL_VALID: > + /* TODO: error number of VMFailValid */? :)> + eflags |= X86_EFLAGS_ZF; > + break; > + case VMFAIL_INVALID: > + default: > + eflags |= X86_EFLAGS_CF; > + break; > + } > + > + regs->eflags = eflags; > +} > + > +static void __clear_current_vvmcs(struct vmx_nest_struct *nest) > +{ > + if ( nest->svmcs ) > + __vmpclear(virt_to_maddr(nest->svmcs)); > + > + hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs, PAGE_SIZE);Do we care about failure here?> + nest->vmcs_invalid = 1; > +} > + > +/* > + * VMX instructions handling > + */ > + > +int vmx_nest_handle_vmxon(struct cpu_user_regs *regs) > +{ > + struct vcpu *v = current; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + struct vmx_inst_decoded decode; > + unsigned long gpa = 0; > + > + if ( !v->domain->arch.hvm_domain.nesting_avail ) > + goto invalid_op; > + > + decode_vmx_inst(regs, &decode); > + > + ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY); > + hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);We _definitely_ care about failure here! We need to inject #PF rather than just using zero (and #GP/#SS based on the segment limit check I mentioned above). Also somewhere we should be checking CR0.PE, CR4.VMXE and RFLAGS.VM and returning #UD if they''re not correct. And checking that CPL == 0, too.> + nest->guest_vmxon_pa = gpa; > + nest->gvmcs_pa = 0; > + nest->vmcs_invalid = 1; > + nest->vvmcs = alloc_xenheap_page(); > + if ( !nest->vvmcs ) > + { > + gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs failed\n"); > + vmreturn(regs, VMFAIL_INVALID); > + goto out; > + }Could we just take a writeable refcount of the guest memory rather than allocating our own copy? ISTR the guest''s not allowed to write directly to the VMCS memory anyway. It would be expensive on 32-bit Xen (because of having to map/unmap all the time) but cheaper on 64-bit Xen (by skipping various 4k memcpy()s)> + nest->svmcs = alloc_xenheap_page(); > + if ( !nest->svmcs ) > + { > + gdprintk(XENLOG_ERR, "nest: allocation for shadow vmcs failed\n"); > + free_xenheap_page(nest->vvmcs); > + vmreturn(regs, VMFAIL_INVALID); > + goto out; > + } > + > + /* > + * `fork'' the host vmcs to shadow_vmcs > + * vmcs_lock is not needed since we are on current > + */ > + nest->hvmcs = v->arch.hvm_vmx.vmcs; > + __vmpclear(virt_to_maddr(nest->hvmcs)); > + memcpy(nest->svmcs, nest->hvmcs, PAGE_SIZE); > + __vmptrld(virt_to_maddr(nest->hvmcs)); > + v->arch.hvm_vmx.launched = 0; > + > + vmreturn(regs, VMSUCCEED); > + > +out: > + return X86EMUL_OKAY; > + > +invalid_op: > + hvm_inject_exception(TRAP_invalid_op, 0, 0); > + return X86EMUL_EXCEPTION; > +} > + > +int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs) > +{Needs error handling...> + struct vcpu *v = current; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + > + if ( unlikely(!nest->guest_vmxon_pa) ) > + goto invalid_op; > + > + nest->guest_vmxon_pa = 0; > + __vmpclear(virt_to_maddr(nest->svmcs)); > + > + free_xenheap_page(nest->vvmcs); > + free_xenheap_page(nest->svmcs); > + > + vmreturn(regs, VMSUCCEED); > + return X86EMUL_OKAY; > + > +invalid_op: > + hvm_inject_exception(TRAP_invalid_op, 0, 0); > + return X86EMUL_EXCEPTION; > +} > + > +int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs) > +{ > + struct vcpu *v = current; > + struct vmx_inst_decoded decode; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + unsigned long gpa = 0; > + > + if ( unlikely(!nest->guest_vmxon_pa) ) > + goto invalid_op; > + > + decode_vmx_inst(regs, &decode); > + > + ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY); > + hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);Error handling... #PF, segments, CPL != 0> + if ( gpa == nest->guest_vmxon_pa || gpa & 0xfff ) > + { > + vmreturn(regs, VMFAIL_INVALID); > + goto out; > + } > + > + if ( nest->gvmcs_pa != gpa ) > + { > + if ( !nest->vmcs_invalid ) > + __clear_current_vvmcs(nest); > + nest->gvmcs_pa = gpa; > + ASSERT(nest->vmcs_invalid == 1); > + } > + > + > + if ( nest->vmcs_invalid ) > + { > + hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa, PAGE_SIZE);I think you know what I''m going to say here. :) Apart from the error paths the rest of this patch looks OK to me. Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  10:55 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
At 10:54 +0100 on 20 May (1274352874), Qing He wrote:> But I still put this flags here because there have been some people > expressing security concerns, that in some situations, hardware > virtualization needs to be explicitly disabled to avoid stealth VMM.I understand that people might want to disable nested HVM, and it''s fine to do that in the domain builder; I just don''t think that domcrf is te right Xen interface. Christoph''s use of HVM_PARAM sounds right to me. Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  11:11 UTC
Re: [Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
At 10:41 +0100 on 22 Apr (1271932880), Qing He wrote:> This patch adds mode switch between L1 and L2, many controls > and states handling may need additioinal scrutiny.Yep - this clearly needs some more work. I''m going to wait for a later version that''s got the additional scrutiny. :) Cheers, Tim.> Roughly, at virtual VMEntry time, sVMCS is loaded, L2 control > is combined from controls of L0 and vVMCS, L2 state from vVMCS > guest state. > when virtual VMExit, host VMCS is loaded, L1 control is from L0, > L1 state from vVMCS host state. > > Signed-off-by: Qing He <qing.he@intel.com>-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
At 10:41 +0100 on 22 Apr (1271932881), Qing He wrote:> +/* > + * Nested virtualization interrupt handling: > + * > + * When vcpu runs in nested context (L2), the event delivery from > + * L0 to L1 may be blocked by several reasons: > + * - virtual VMExit > + * - virtual VMEntryI''m not sure I understand what the plan is here. It looks like you queue up a virtual vmentry or vmexit so that it happens just before the real vmentry and then have to hold off interrupt injection because of it. I''m a little worried that we''ll end up taking a virtual vmexit for an interrupt, and then not injecting the interrupt. Maybe you could outline the overall design of how interrupt delivery and virtual vmenter/vmexit should work in nested VMX. I suspect that I''ve just misunderstood the code. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  11:44 UTC
Re: [Xen-devel] [PATCH 10/17] vmx: nest: VMExit handler in L2
At 10:41 +0100 on 22 Apr (1271932882), Qing He wrote:> Handles VMExits happened in L2 > > Signed-off-by: Qing He <qing.he@intel.com> > > --- > arch/x86/hvm/vmx/nest.c | 182 +++++++++++++++++++++++++++++++++++++++++ > arch/x86/hvm/vmx/vmx.c | 6 + > include/asm-x86/hvm/vmx/nest.h | 3 > include/asm-x86/hvm/vmx/vmx.h | 1 > 4 files changed, 192 insertions(+) > > diff -r a7de30ed250d -r 2f9ba6dbbe62 xen/arch/x86/hvm/vmx/nest.c > --- a/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > @@ -976,3 +976,185 @@ > > /* TODO: NMI */ > } > + > +/* > + * L2 VMExit handling > + */ > + > +static struct control_bit_for_reason { > + int reason; > + unsigned long bit; > +} control_bit_for_reason [] = { > + {EXIT_REASON_PENDING_VIRT_INTR, CPU_BASED_VIRTUAL_INTR_PENDING}, > + {EXIT_REASON_HLT, CPU_BASED_HLT_EXITING}, > + {EXIT_REASON_INVLPG, CPU_BASED_INVLPG_EXITING}, > + {EXIT_REASON_MWAIT_INSTRUCTION, CPU_BASED_MWAIT_EXITING}, > + {EXIT_REASON_RDPMC, CPU_BASED_RDPMC_EXITING}, > + {EXIT_REASON_RDTSC, CPU_BASED_RDTSC_EXITING}, > + {EXIT_REASON_PENDING_VIRT_NMI, CPU_BASED_VIRTUAL_NMI_PENDING}, > + {EXIT_REASON_DR_ACCESS, CPU_BASED_MOV_DR_EXITING}, > + {EXIT_REASON_MONITOR_INSTRUCTION, CPU_BASED_MONITOR_EXITING}, > + {EXIT_REASON_PAUSE_INSTRUCTION, CPU_BASED_PAUSE_EXITING}, > +}; > + > +int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs, > + unsigned int exit_reason) > +{ > + struct vcpu *v = current; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + u32 ctrl; > + int bypass_l0 = 0; > + > + nest->vmexit_pending = 0; > + nest->intr_info = 0; > + nest->error_code = 0; > + > + switch (exit_reason) { > + case EXIT_REASON_EXCEPTION_NMI: > + { > + u32 intr_info = __vmread(VM_EXIT_INTR_INFO); > + u32 valid_mask = (X86_EVENTTYPE_HW_EXCEPTION << 8) | > + INTR_INFO_VALID_MASK; > + u64 exec_bitmap; > + int vector = intr_info & INTR_INFO_VECTOR_MASK; > + > + /* > + * decided by L0 and L1 exception bitmap, if the vetor is set by > + * both, L0 has priority on #PF, L1 has priority on others > + */ > + if ( vector == TRAP_page_fault ) > + { > + if ( paging_mode_hap(v->domain) ) > + nest->vmexit_pending = 1; > + } > + else if ( (intr_info & valid_mask) == valid_mask ) > + { > + exec_bitmap =__get_vvmcs(nest->vvmcs, EXCEPTION_BITMAP); > + > + if ( exec_bitmap & (1 << vector) ) > + nest->vmexit_pending = 1; > + } > + break; > + } > + > + case EXIT_REASON_WBINVD: > + case EXIT_REASON_EPT_VIOLATION: > + case EXIT_REASON_EPT_MISCONFIG: > + case EXIT_REASON_EXTERNAL_INTERRUPT: > + /* pass to L0 handler */ > + break; > + > + case VMX_EXIT_REASONS_FAILED_VMENTRY: > + case EXIT_REASON_TRIPLE_FAULT: > + case EXIT_REASON_TASK_SWITCH: > + case EXIT_REASON_IO_INSTRUCTION: > + case EXIT_REASON_CPUID: > + case EXIT_REASON_MSR_READ: > + case EXIT_REASON_MSR_WRITE:Aren''t these gated on a control bitmap in the L1 VMCS?> + case EXIT_REASON_VMCALL: > + case EXIT_REASON_VMCLEAR: > + case EXIT_REASON_VMLAUNCH: > + case EXIT_REASON_VMPTRLD: > + case EXIT_REASON_VMPTRST: > + case EXIT_REASON_VMREAD: > + case EXIT_REASON_VMRESUME: > + case EXIT_REASON_VMWRITE: > + case EXIT_REASON_VMXOFF: > + case EXIT_REASON_VMXON: > + case EXIT_REASON_INVEPT: > + /* inject to L1 */ > + nest->vmexit_pending = 1; > + break; > + > + case EXIT_REASON_PENDING_VIRT_INTR: > + { > + ctrl = v->arch.hvm_vmx.exec_control; > + > + /* > + * if both open intr/nmi window, L0 has priority. > + * > + * Note that this is not strictly correct, in L2 context, > + * L0''s intr/nmi window flag should be replaced to MTF, > + * causing an imediate VMExit, but MTF may not be available > + * on all hardware. > + */ > + if ( !(ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) ) > + nest->vmexit_pending = 1; > + > + break; > + } > + case EXIT_REASON_PENDING_VIRT_NMI: > + { > + ctrl = v->arch.hvm_vmx.exec_control; > + > + if ( !(ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) ) > + nest->vmexit_pending = 1; > + > + break; > + } > + > + case EXIT_REASON_HLT: > + case EXIT_REASON_RDTSC: > + case EXIT_REASON_RDPMC: > + case EXIT_REASON_MWAIT_INSTRUCTION: > + case EXIT_REASON_PAUSE_INSTRUCTION: > + case EXIT_REASON_MONITOR_INSTRUCTION: > + case EXIT_REASON_DR_ACCESS: > + case EXIT_REASON_INVLPG: > + { > + int i; > + > + /* exit according to guest exec_control */ > + ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL); > + > + for ( i = 0; i < ARRAY_SIZE(control_bit_for_reason); i++ ) > + if ( control_bit_for_reason[i].reason == exit_reason ) > + break;You''ve already got a switch statement - why not gate these individually rather than bundling them together and scanning an array?> + if ( i == ARRAY_SIZE(control_bit_for_reason) ) > + break; > + > + if ( control_bit_for_reason[i].bit & ctrl ) > + nest->vmexit_pending = 1; > + > + break; > + } > + case EXIT_REASON_CR_ACCESS: > + { > + u64 exit_qualification = __vmread(EXIT_QUALIFICATION); > + int cr = exit_qualification & 15; > + int write = (exit_qualification >> 4) & 3; > + u32 mask = 0; > + > + /* also according to guest exec_control */ > + ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL); > + > + if ( cr == 3 ) > + { > + mask = write? CPU_BASED_CR3_STORE_EXITING: > + CPU_BASED_CR3_LOAD_EXITING; > + if ( ctrl & mask ) > + nest->vmexit_pending = 1; > + } > + else if ( cr == 8 ) > + { > + mask = write? CPU_BASED_CR8_STORE_EXITING: > + CPU_BASED_CR8_LOAD_EXITING; > + if ( ctrl & mask ) > + nest->vmexit_pending = 1; > + } > + else /* CR0, CR4, CLTS, LMSW */ > + nest->vmexit_pending = 1; > + > + break; > + } > + default: > + gdprintk(XENLOG_WARNING, "Unknown nested vmexit reason %x.\n", > + exit_reason); > + } > + > + if ( nest->vmexit_pending ) > + bypass_l0 = 1;This variable doesn''t seem to do anything useful.> + return bypass_l0; > +}Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
At 10:41 +0100 on 22 Apr (1271932883), Qing He wrote:> L2 TSC needs special handling, either rdtsc exiting is > turned on or offLooks OK to me, but I''m a bit behind on some of the recent changes to TSC handling. Tim.> Signed-off-by: Qing He <qing.he@intel.com> > > --- > arch/x86/hvm/vmx/nest.c | 31 +++++++++++++++++++++++++++++++ > arch/x86/hvm/vmx/vmx.c | 4 ++++ > include/asm-x86/hvm/vmx/nest.h | 2 ++ > 3 files changed, 37 insertions(+) > > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/nest.c > --- a/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > @@ -533,6 +533,18 @@ > * Nested VMX context switch > */ > > +u64 vmx_nest_get_tsc_offset(struct vcpu *v) > +{ > + u64 offset = 0; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + > + if ( __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) & > + CPU_BASED_USE_TSC_OFFSETING ) > + offset = __get_vvmcs(nest->vvmcs, TSC_OFFSET); > + > + return offset; > +} > + > static unsigned long vmcs_gstate_field[] = { > /* 16 BITS */ > GUEST_ES_SELECTOR, > @@ -715,6 +727,8 @@ > hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4)); > hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3)); > > + hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset); > + > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO); > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE); > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN); > @@ -837,6 +851,8 @@ > hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4)); > hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3)); > > + hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset); > + > __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0); > } > > @@ -1116,6 +1132,21 @@ > > if ( control_bit_for_reason[i].bit & ctrl ) > nest->vmexit_pending = 1; > + else if ( exit_reason == EXIT_REASON_RDTSC ) > + { > + uint64_t tsc; > + > + /* > + * rdtsc can''t be handled normally in the L0 handler > + * if L1 doesn''t want it > + */ > + tsc = hvm_get_guest_tsc(v); > + tsc += __get_vvmcs(nest->vvmcs, TSC_OFFSET); > + regs->eax = (uint32_t)tsc; > + regs->edx = (uint32_t)(tsc >> 32); > + > + bypass_l0 = 1; > + } > > break; > } > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/vmx.c > --- a/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:09 2010 +0800 > @@ -974,6 +974,10 @@ > static void vmx_set_tsc_offset(struct vcpu *v, u64 offset) > { > vmx_vmcs_enter(v); > + > + if ( v->arch.hvm_vcpu.in_nesting ) > + offset += vmx_nest_get_tsc_offset(v); > + > __vmwrite(TSC_OFFSET, offset); > #if defined (__i386__) > __vmwrite(TSC_OFFSET_HIGH, offset >> 32); > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/include/asm-x86/hvm/vmx/nest.h > --- a/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:09 2010 +0800 > @@ -69,6 +69,8 @@ > unsigned long value); > void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value); > > +u64 vmx_nest_get_tsc_offset(struct vcpu *v); > + > void vmx_nest_idtv_handling(void); > > int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs, > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  11:52 UTC
Re: [Xen-devel] [PATCH 13/17] vmx: nest: capability reporting MSRs
At 10:41 +0100 on 22 Apr (1271932885), Qing He wrote:> handles VMX capability reporting MSRs. > Some features are masked so L1 would see a rather > simple configurationWould it be better to whitelist features that we know are safely virtualized?> Signed-off-by: Qing He <qing.he@intel.com> > > --- > arch/x86/hvm/vmx/nest.c | 94 +++++++++++++++++++++++++++++++++++++++++ > arch/x86/hvm/vmx/vmx.c | 14 ++++-- > include/asm-x86/hvm/vmx/nest.h | 5 ++ > include/asm-x86/hvm/vmx/vmcs.h | 5 ++ > include/asm-x86/msr-index.h | 1 > 5 files changed, 115 insertions(+), 4 deletions(-) > > diff -r 25c338cbc024 -r 0f0e32a70c02 xen/arch/x86/hvm/vmx/nest.c > --- a/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > @@ -1200,3 +1200,97 @@ > > return bypass_l0; > } > + > +/* > + * Capability reporting > + */ > +int vmx_nest_msr_read_intercept(struct cpu_user_regs *regs, u64 *msr_content) > +{ > + u32 eax, edx; > + u64 data = 0; > + int r = 1; > + u32 mask = 0; > + > + if ( !current->domain->arch.hvm_domain.nesting_avail ) > + return 0; > + > + switch (regs->ecx) { > + case MSR_IA32_VMX_BASIC: > + rdmsr(regs->ecx, eax, edx); > + data = edx; > + data = (data & ~0x1fff) | 0x1000; /* request 4KB for guest VMCS */ > + data &= ~(1 << 23); /* disable TRUE_xxx_CTLS */ > + data = (data << 32) | VVMCS_REVISION; /* VVMCS revision */ > + break; > + case MSR_IA32_VMX_PINBASED_CTLS: > +#define REMOVED_PIN_CONTROL_CAP (PIN_BASED_PREEMPT_TIMER)Did you mean to use this to mask the value below?> + rdmsr(regs->ecx, eax, edx); > + data = edx; > + data = (data << 32) | eax; > + break; > + case MSR_IA32_VMX_PROCBASED_CTLS: > + rdmsr(regs->ecx, eax, edx); > +#define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \ > + | CPU_BASED_ACTIVATE_MSR_BITMAP \ > + | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) > + data = edx & ~REMOVED_EXEC_CONTROL_CAP; > + data = (data << 32) | eax; > + break; > + case MSR_IA32_VMX_EXIT_CTLS: > + rdmsr(regs->ecx, eax, edx); > +#define REMOVED_EXIT_CONTROL_CAP (VM_EXIT_SAVE_GUEST_PAT \ > + | VM_EXIT_LOAD_HOST_PAT \ > + | VM_EXIT_SAVE_GUEST_EFER \ > + | VM_EXIT_LOAD_HOST_EFER \ > + | VM_EXIT_SAVE_PREEMPT_TIMER) > + data = edx & ~REMOVED_EXIT_CONTROL_CAP; > + data = (data << 32) | eax; > + break; > + case MSR_IA32_VMX_ENTRY_CTLS: > + rdmsr(regs->ecx, eax, edx); > +#define REMOVED_ENTRY_CONTROL_CAP (VM_ENTRY_LOAD_GUEST_PAT \ > + | VM_ENTRY_LOAD_GUEST_EFER) > + data = edx & ~REMOVED_ENTRY_CONTROL_CAP; > + data = (data << 32) | eax; > + break; > + case MSR_IA32_VMX_PROCBASED_CTLS2: > + mask = 0; > + > + rdmsr(regs->ecx, eax, edx); > + data = edx & mask; > + data = (data << 32) | eax; > + break; > + > + /* pass through MSRs */ > + case IA32_FEATURE_CONTROL_MSR: > + case MSR_IA32_VMX_MISC: > + case MSR_IA32_VMX_CR0_FIXED0: > + case MSR_IA32_VMX_CR0_FIXED1: > + case MSR_IA32_VMX_CR4_FIXED0: > + case MSR_IA32_VMX_CR4_FIXED1: > + case MSR_IA32_VMX_VMCS_ENUM: > + rdmsr(regs->ecx, eax, edx); > + data = edx; > + data = (data << 32) | eax; > + gdprintk(XENLOG_INFO, > + "nest: pass through VMX cap reporting register, %lx\n", > + regs->ecx); > + break; > + default: > + r = 0; > + break; > + } > + > + if (r == 1) > + gdprintk(XENLOG_DEBUG, "nest: intercepted msr access: %lx: %lx\n", > + regs->ecx, data);These debug printks should go.> + > + *msr_content = data; > + return r; > +} > + > +int vmx_nest_msr_write_intercept(struct cpu_user_regs *regs, u64 msr_content) > +{ > + /* silently ignore for now */ > + return 1; > +}Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-20  12:21 UTC
Re: [Xen-devel] [PATCH 15/17] vmx: nest: virtual ept for nested
At 10:41 +0100 on 22 Apr (1271932887), Qing He wrote:> This patch adds virtual ept capability to L1. > It''s implemented as a simple per vCPU vTLB like component > independent to domain wide p2m. > > Signed-off-by: Qing He <qing.he@intel.com>> diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/nest.c > --- a/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:10 2010 +0800 > @@ -26,6 +26,7 @@ > #include <asm/hvm/vmx/vmx.h> > #include <asm/hvm/vmx/vvmcs.h> > #include <asm/hvm/vmx/nest.h> > +#include <asm/hvm/vmx/vept.h> > > /* > * VMX instructions support functions > @@ -295,6 +296,9 @@ > __vmptrld(virt_to_maddr(nest->hvmcs)); > v->arch.hvm_vmx.launched = 0; > > + nest->geptp = 0; > + nest->vept = vept_init(v); > + > vmreturn(regs, VMSUCCEED); > > out: > @@ -313,6 +317,9 @@ > if ( unlikely(!nest->guest_vmxon_pa) ) > goto invalid_op; > > + vept_teardown(nest->vept); > + nest->vept = 0; > + > nest->guest_vmxon_pa = 0; > __vmpclear(virt_to_maddr(nest->svmcs)); > > @@ -529,6 +536,67 @@ > return vmx_nest_handle_vmresume(regs); > } > > +int vmx_nest_handle_invept(struct cpu_user_regs *regs) > +{ > + struct vcpu *v = current; > + struct vmx_inst_decoded decode; > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + mfn_t mfn; > + u64 eptp; > + int type; > + > + if ( unlikely(!nest->guest_vmxon_pa) ) > + goto invalid_op; > + > + decode_vmx_inst(regs, &decode); > + > + hvm_copy_from_guest_virt(&eptp, decode.mem, sizeof(eptp), 0); > + type = reg_read(regs, decode.reg2);Needs error handling like the other new instructions.> + /* TODO: physical invept on other cpus */?> + switch ( type ) > + { > + case 1: > + mfn = vept_invalidate(nest->vept, eptp); > + if ( eptp == nest->geptp ) > + nest->geptp = 0; > + > + if ( __mfn_valid(mfn_x(mfn)) ) > + __invept(1, mfn_x(mfn) << PAGE_SHIFT | (eptp & 0xfff), 0); > + break; > + case 2: > + vept_invalidate_all(nest->vept); > + nest->geptp = 0; > + break; > + default: > + gdprintk(XENLOG_ERR, "nest: unsupported invept type %d\n", type); > + break; > + } > + > + vmreturn(regs, VMSUCCEED); > + > + return X86EMUL_OKAY; > + > +invalid_op: > + hvm_inject_exception(TRAP_invalid_op, 0, 0); > + return X86EMUL_EXCEPTION; > +} > + > +int vmx_nest_vept(struct vcpu *v) > +{ > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > + int r = 0; > + > + if ( paging_mode_hap(v->domain) && > + (__get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) & > + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) && > + (__get_vvmcs(nest->vvmcs, SECONDARY_VM_EXEC_CONTROL) & > + SECONDARY_EXEC_ENABLE_EPT) ) > + r = 1; > + > + return r; > +} > + > /* > * Nested VMX context switch > */ > @@ -739,7 +807,14 @@ > vvmcs_to_shadow(nest->vvmcs, CR0_GUEST_HOST_MASK); > vvmcs_to_shadow(nest->vvmcs, CR4_GUEST_HOST_MASK); > > - /* TODO: PDPTRs for nested ept */ > + if ( vmx_nest_vept(v) ) > + { > + vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR0); > + vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR1); > + vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR2); > + vvmcs_to_shadow(nest->vvmcs, GUEST_PDPTR3); > + } > + > /* TODO: CR3 target control */ > } > > @@ -787,14 +862,32 @@ > } > #endif > > + > + /* loading EPT_POINTER for L2 */ > + if ( vmx_nest_vept(v) ) > + { > + u64 geptp; > + mfn_t mfn; > + > + geptp = __get_vvmcs(nest->vvmcs, EPT_POINTER); > + if ( geptp != nest->geptp ) > + { > + mfn = vept_load_eptp(nest->vept, geptp);What if vept_load_eptp() returns INVALID_MFN?> + nest->geptp = geptp; > + > + __vmwrite(EPT_POINTER, (mfn_x(mfn) << PAGE_SHIFT) | 0x1e); > +#ifdef __i386__ > + __vmwrite(EPT_POINTER_HIGH, (mfn_x(mfn) << PAGE_SHIFT) >> 32); > +#endif > + } > + } > + > regs->rip = __get_vvmcs(nest->vvmcs, GUEST_RIP); > regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP); > regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS); > > /* updating host cr0 to sync TS bit */ > __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); > - > - /* TODO: EPT_POINTER */ > } > > static void sync_vvmcs_guest_state(struct vmx_nest_struct *nest) > @@ -1064,8 +1157,26 @@ > break; > } > > + case EXIT_REASON_EPT_VIOLATION: > + { > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); > +#ifdef __i386__ > + gpa |= (paddr_t)__vmread(GUEST_PHYSICAL_ADDRESS_HIGH) << 32; > +#endif > + if ( vmx_nest_vept(v) ) > + { > + if ( !vept_ept_violation(nest->vept, nest->geptp, > + exit_qualification, gpa) ) > + bypass_l0 = 1; > + else > + nest->vmexit_pending = 1;Since bypass_l0 is set from vmexit_pending() here it looks like it''s always going to be set. Does that mean we never handle a real EPT violation at L0? I would expect there to be three possible outcomes here: give the violation to L1, give it to L0, or fix it in the vept and discard it.> + } > + > + break; > + } > + > case EXIT_REASON_WBINVD: > - case EXIT_REASON_EPT_VIOLATION: > case EXIT_REASON_EPT_MISCONFIG: > case EXIT_REASON_EXTERNAL_INTERRUPT: > /* pass to L0 handler */ > @@ -1229,11 +1340,14 @@ > data = (data << 32) | eax; > break; > case MSR_IA32_VMX_PROCBASED_CTLS: > + mask = paging_mode_hap(current->domain)? > + 0: CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > + > rdmsr(regs->ecx, eax, edx); > #define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \ > - | CPU_BASED_ACTIVATE_MSR_BITMAP \ > - | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) > + | CPU_BASED_ACTIVATE_MSR_BITMAP) > data = edx & ~REMOVED_EXEC_CONTROL_CAP; > + data = edx & ~mask; > data = (data << 32) | eax; > break; > case MSR_IA32_VMX_EXIT_CTLS: > @@ -1254,12 +1368,20 @@ > data = (data << 32) | eax; > break; > case MSR_IA32_VMX_PROCBASED_CTLS2: > - mask = 0; > + mask = paging_mode_hap(current->domain)? > + SECONDARY_EXEC_ENABLE_EPT : 0; > > rdmsr(regs->ecx, eax, edx); > data = edx & mask; > data = (data << 32) | eax; > break; > + case MSR_IA32_VMX_EPT_VPID_CAP: > + rdmsr(regs->ecx, eax, edx); > +#define REMOVED_EPT_VPID_CAP_HIGH ( 1 | 1<<8 | 1<<9 | 1<<10 | 1<<11 ) > +#define REMOVED_EPT_VPID_CAP_LOW ( 1<<16 | 1<<17 | 1<<26 ) > + data = edx & ~REMOVED_EPT_VPID_CAP_HIGH; > + data = (data << 32) | (eax & ~REMOVED_EPT_VPID_CAP_LOW); > + break; > > /* pass through MSRs */ > case IA32_FEATURE_CONTROL_MSR: > diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/vept.c > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/xen/arch/x86/hvm/vmx/vept.c Thu Apr 22 22:30:10 2010 +0800 > @@ -0,0 +1,574 @@ > +/* > + * vept.c: virtual EPT for nested virtualization > + * > + * Copyright (c) 2010, Intel Corporation. > + * Author: Qing He <qing.he@intel.com> > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + * You should have received a copy of the GNU General Public License along with > + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple > + * Place - Suite 330, Boston, MA 02111-1307 USA. > + * > + */ > + > +#include <xen/config.h> > +#include <xen/types.h> > +#include <xen/list.h> > +#include <xen/mm.h> > +#include <xen/paging.h> > +#include <xen/domain_page.h> > +#include <xen/sched.h> > +#include <asm/page.h> > +#include <xen/numa.h> > +#include <asm/hvm/vmx/vmx.h> > +#include <asm/hvm/vmx/vept.h> > + > +#undef mfn_to_page > +#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m)) > +#undef mfn_valid > +#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn)) > +#undef page_to_mfn > +#define page_to_mfn(_pg) _mfn(__page_to_mfn(_pg)) > + > +/* > + * This virtual EPT implementation is independent to p2m facility > + * and has some different characteristics. It works in a similar > + * way as shadow page table (guest table and host table composition), > + * but is per-vcpu, and of vTLB style > + * - per vCPU so no lock is requiredWhat happens when dom0 changes domU''s p2m table? Don''t you need to shoot down existing vEPT tables from a foreign CPU?> + * - vTLB style signifies honoring all invalidations, and not > + * write protection. Unlike ordinary page table, since EPT updates > + * and invalidations are minimal in a well written VMM, overhead > + * is also minimized. > + * > + * The physical root is loaded directly to L2 sVMCS, without entering > + * any other host controls. Multiple `cache slots'' are maintained > + * for multiple guest EPTPs, with simple LRU replacement. > + * > + * One of the limitations so far, is that it doesn''t work with > + * L0 emulation code, so L1 p2m_mmio_direct on top of L0 p2m_mmio_dm > + * is not supported as for now.Is this something you intend to fix before we check it in?> + */ > + > +#define VEPT_MAX_SLOTS 8 > +#define VEPT_ALLOCATION_SIZE 512 > + > +struct vept_slot { > + u64 eptp; /* guest eptp */ > + mfn_t root; /* root of phys table */ > + struct list_head list; > + > + struct page_list_head page_list; > +}; > + > +struct vept { > + struct list_head used_slots; /* lru: new->tail, old->head */ > + struct list_head free_slots; > + > + int total_pages; > + int free_pages; > + struct page_list_head freelist; > + > + struct vcpu *vcpu; > +}; > + > + > +static struct vept_slot *__get_eptp_slot(struct vept *vept, u64 geptp) > +{ > + struct vept_slot *slot, *tmp; > + > + list_for_each_entry_safe( slot, tmp, &vept->used_slots, list ) > + if ( slot->eptp == geptp ) > + return slot; > + > + return NULL; > +} > + > +static struct vept_slot *get_eptp_slot(struct vept *vept, u64 geptp) > +{ > + struct vept_slot *slot; > + > + slot = __get_eptp_slot(vept, geptp); > + if ( slot != NULL ) > + list_del(&slot->list); > + > + return slot; > +} > + > +static void __clear_slot(struct vept *vept, struct vept_slot *slot) > +{ > + struct page_info *pg; > + > + slot->eptp = 0; > + > + while ( !page_list_empty(&slot->page_list) ) > + { > + pg = page_list_remove_head(&slot->page_list); > + page_list_add_tail(pg, &vept->freelist); > + > + vept->free_pages++; > + } > +} > + > +static struct vept_slot *get_free_slot(struct vept *vept) > +{ > + struct vept_slot *slot = NULL; > + > + if ( !list_empty(&vept->free_slots) ) > + { > + slot = list_entry(vept->free_slots.next, struct vept_slot, list); > + list_del(&slot->list); > + } > + else if ( !list_empty(&vept->used_slots) ) > + { > + slot = list_entry(vept->used_slots.next, struct vept_slot, list); > + list_del(&slot->list); > + __clear_slot(vept, slot); > + } > + > + return slot; > +} > + > +static void clear_all_slots(struct vept *vept) > +{ > + struct vept_slot *slot, *tmp; > + > + list_for_each_entry_safe( slot, tmp, &vept->used_slots, list ) > + { > + list_del(&slot->list); > + __clear_slot(vept, slot); > + list_add_tail(&slot->list, &vept->free_slots); > + } > +} > + > +static int free_some_pages(struct vept *vept, struct vept_slot *curr) > +{ > + struct vept_slot *slot; > + int r = 0; > + > + if ( !list_empty(&vept->used_slots) ) > + { > + slot = list_entry(vept->used_slots.next, struct vept_slot, list); > + if ( slot != curr ) > + { > + list_del(&slot->list); > + __clear_slot(vept, slot); > + list_add_tail(&slot->list, &vept->free_slots); > + > + r = 1; > + } > + } > + > + return r; > +} > + > +struct vept *vept_init(struct vcpu *v) > +{ > + struct vept *vept; > + struct vept_slot *slot; > + struct page_info *pg; > + int i; > + > + vept = xmalloc(struct vept); > + if ( vept == NULL ) > + goto out; > + > + memset(vept, 0, sizeof(*vept)); > + vept->vcpu = v; > + > + INIT_PAGE_LIST_HEAD(&vept->freelist); > + INIT_LIST_HEAD(&vept->used_slots); > + INIT_LIST_HEAD(&vept->free_slots); > + > + for ( i = 0; i < VEPT_MAX_SLOTS; i++ ) > + { > + slot = xmalloc(struct vept_slot); > + if ( slot == NULL ) > + break; > + > + memset(slot, 0, sizeof(*slot)); > + > + INIT_LIST_HEAD(&slot->list); > + INIT_PAGE_LIST_HEAD(&slot->page_list); > + > + list_add(&slot->list, &vept->free_slots); > + } > + > + for ( i = 0; i < VEPT_ALLOCATION_SIZE; i++ )Why a fixed 2MB allocation? What if your nested domains are very large?> + { > + pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(v->domain)));Shouldn''t this be allocated from the paging pool like other EPT memory?> + if ( pg == NULL ) > + break;Return an error?> + page_list_add_tail(pg, &vept->freelist); > + vept->total_pages++; > + vept->free_pages++; > + } > + > + out: > + return vept; > +} > + > +void vept_teardown(struct vept *vept) > +{ > + struct page_info *pg; > + struct vept_slot *slot, *tmp; > + > + clear_all_slots(vept); > + > + while ( !page_list_empty(&vept->freelist) ) > + { > + pg = page_list_remove_head(&vept->freelist); > + free_domheap_page(pg); > + vept->free_pages++; > + vept->total_pages++; > + } > + > + list_for_each_entry_safe( slot, tmp, &vept->free_slots, list ) > + xfree(slot); > + > + xfree(vept); > +} > + > +mfn_t vept_load_eptp(struct vept *vept, u64 geptp) > +{ > + struct page_info *pg; > + struct vept_slot *slot; > + mfn_t mfn = _mfn(INVALID_MFN); > + void *addr; > + > + ASSERT(vept->vcpu == current); > + > + slot = get_eptp_slot(vept, geptp); > + if ( slot == NULL ) > + { > + slot = get_free_slot(vept); > + if ( unlikely(slot == NULL) ) > + { > + gdprintk(XENLOG_ERR, "nest: can''t get free slot\n"); > + return mfn; > + } > + > + while ( !vept->free_pages ) > + if ( !free_some_pages(vept, slot) ) > + { > + slot->eptp = 0; > + list_add_tail(&slot->list, &vept->free_slots); > + gdprintk(XENLOG_ERR, "nest: vept no free pages\n"); > + > + return mfn; > + } > + > + vept->free_pages--; > + pg = page_list_remove_head(&vept->freelist); > + > + mfn = page_to_mfn(pg); > + addr = map_domain_page(mfn_x(mfn)); > + clear_page(addr); > + unmap_domain_page(addr); > + page_list_add_tail(pg, &slot->page_list); > + slot->eptp = geptp; > + slot->root = mfn; > + } > + > + mfn = slot->root; > + list_add_tail(&slot->list, &vept->used_slots); > + > + return mfn; > +} > + > +mfn_t vept_invalidate(struct vept *vept, u64 geptp) > +{ > + struct vept_slot *slot; > + mfn_t mfn = _mfn(INVALID_MFN); > + > + ASSERT(vept->vcpu == current); > + > + slot = get_eptp_slot(vept, geptp); > + if ( slot != NULL ) > + { > + mfn = slot->root; > + __clear_slot(vept, slot); > + list_add_tail(&slot->list, &vept->free_slots); > + } > + > + return mfn; > +} > + > +void vept_invalidate_all(struct vept *vept) > +{ > + ASSERT(vept->vcpu == current); > + > + clear_all_slots(vept); > +} > + > +/* > + * guest EPT walk and EPT violation > + */ > +struct ept_walk { > + unsigned long gfn; > + unsigned long gfn_remainder; > + ept_entry_t l4e, l3e, l2e, l1e; > + mfn_t l4mfn, l3mfn, l2mfn, l1mfn; > + int sp; > +}; > +typedef struct ept_walk ept_walk_t; > + > +#define GEPT_NORMAL_PAGE 0 > +#define GEPT_SUPER_PAGE 1 > +#define GEPT_NOT_PRESENT 2 > +static int guest_ept_next_level(struct vcpu *v, ept_entry_t **table, > + unsigned long *gfn_remainder, int level, u32 *ar, > + ept_entry_t *entry, mfn_t *next_mfn) > +{ > + int index; > + ept_entry_t *ept_entry; > + ept_entry_t *next; > + p2m_type_t p2mt; > + int rc = GEPT_NORMAL_PAGE; > + mfn_t mfn; > + > + index = *gfn_remainder >> (level * EPT_TABLE_ORDER); > + > + ept_entry = (*table) + index; > + *entry = *ept_entry; > + *ar &= entry->epte & 0x7; > + > + *gfn_remainder &= (1UL << (level * EPT_TABLE_ORDER)) - 1; > + > + if ( !(ept_entry->epte & 0x7) ) > + rc = GEPT_NOT_PRESENT; > + else if ( ept_entry->sp_avail ) > + rc = GEPT_SUPER_PAGE; > + else > + { > + mfn = gfn_to_mfn(v->domain, ept_entry->mfn, &p2mt); > + if ( !p2m_is_ram(p2mt) ) > + return GEPT_NOT_PRESENT; > + > + if ( next_mfn ) > + { > + next = map_domain_page(mfn_x(mfn)); > + unmap_domain_page(*table); > + > + *table = next; > + *next_mfn = mfn; > + } > + } > + > + return rc; > +} > + > +static u32 guest_walk_ept(struct vcpu *v, ept_walk_t *gw, > + u64 geptp, u64 ggpa) > +{ > + ept_entry_t *table; > + p2m_type_t p2mt; > + int rc; > + u32 ar = 0x7; > + > + unsigned long gfn = (unsigned long) (ggpa >> PAGE_SHIFT); > + unsigned long gfn_remainder = gfn; > + > + memset(gw, 0, sizeof(*gw)); > + gw->gfn = gfn; > + gw->sp = 0; > + > + gw->l4mfn = gfn_to_mfn(v->domain, geptp >> PAGE_SHIFT, &p2mt); > + if ( !p2m_is_ram(p2mt) ) > + return 0; > + > + table = map_domain_page(mfn_x(gw->l4mfn)); > + > + rc = guest_ept_next_level(v, &table, &gfn_remainder, 3, &ar, > + &gw->l4e, &gw->l3mfn); > + > + if ( rc ) > + goto out; > + > + rc = guest_ept_next_level(v, &table, &gfn_remainder, 2, &ar, > + &gw->l3e, &gw->l2mfn); > + > + if ( rc == GEPT_SUPER_PAGE ) > + gw->sp = 2; > + if ( rc ) > + goto out; > + > + rc = guest_ept_next_level(v, &table, &gfn_remainder, 1, &ar, > + &gw->l2e, &gw->l1mfn); > + > + if ( rc == GEPT_SUPER_PAGE ) > + gw->sp = 1; > + if ( rc ) > + goto out; > + > + rc = guest_ept_next_level(v, &table, &gfn_remainder, 0, &ar, > + &gw->l1e, NULL); > + > + out: > + gw->gfn_remainder = gfn_remainder; > + unmap_domain_page(*table); > + return ar; > +} > + > +static void epte_set_ar_bits(ept_entry_t *entry, unsigned long ar) > +{ > + entry->epte &= ~0x7f; > + entry->epte |= ar & 0x7f; > +} > + > +static int shadow_ept_next_level(struct vept *vept, struct vept_slot *slot, > + ept_entry_t **table, unsigned long *gfn_remainder, > + int level, u32 *ar, ept_entry_t gentry) > +{ > + int index; > + ept_entry_t *sentry; > + ept_entry_t *next; > + mfn_t mfn; > + struct page_info *pg; > + > + index = *gfn_remainder >> (level * EPT_TABLE_ORDER); > + > + sentry = (*table) + index; > + *ar = sentry->epte & 0x7; > + > + *gfn_remainder &= (1UL << (level * EPT_TABLE_ORDER)) - 1; > + > + if ( !(sentry->epte & 0x7) ) > + { > + while ( !vept->free_pages ) > + if ( !free_some_pages(vept, slot) ) > + { > + gdprintk(XENLOG_ERR, "nest: vept no free pages\n"); > + return 0; > + } > + > + vept->free_pages--; > + pg = page_list_remove_head(&vept->freelist); > + page_list_add_tail(pg, &slot->page_list); > + mfn = page_to_mfn(pg); > + next = map_domain_page(mfn_x(mfn)); > + clear_page(next); > + > + sentry->mfn = mfn_x(mfn); > + } > + else > + { > + next = map_domain_page(sentry->mfn); > + } > + > + epte_set_ar_bits(sentry, gentry.epte); > + > + unmap_domain_page(*table); > + *table = next; > + > + return 1; > +} > + > +int vept_ept_violation(struct vept *vept, u64 geptp, > + unsigned long qualification, paddr_t addr) > +{ > + ept_walk_t gw; > + struct vept_slot *slot; > + ept_entry_t *table, *gept; > + ept_entry_t *sentry, *gentry; > + u32 old_entry, sp_ar = 0; > + p2m_type_t p2mt; > + unsigned long mfn_start = 0; > + unsigned long gfn_remainder; > + int rc, i; > + > + ASSERT(vept->vcpu == current); > + > + slot = __get_eptp_slot(vept, geptp); > + if ( unlikely(slot == NULL) ) > + return 0; > + > + rc = guest_walk_ept(vept->vcpu, &gw, geptp, addr); > + > + if ( !(rc & (qualification & 0x7)) ) /* inject to guest */ > + return 1; > + > + if ( gw.sp == 2 ) /* 1G */ > + { > + sp_ar = gw.l3e.epte & 0x7; > + mfn_start = gw.l3e.mfn + > + (gw.gfn_remainder & (~(1 << EPT_TABLE_ORDER) - 1)); > + } > + if ( gw.sp == 1 ) /* 2M */ > + { > + sp_ar = gw.l2e.epte & 0x7; > + mfn_start = gw.l2e.mfn; > + } > + else > + mfn_start = 0; > + > + table = map_domain_page(mfn_x(slot->root)); > + gfn_remainder = gw.gfn; > + > + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 3, > + &old_entry, gw.l4e);What if shadow_ept_next_level() returns 0 ?> + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 2, > + &old_entry, gw.l3e);Ditto> + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 1, > + &old_entry, (gw.sp == 2) ? gw.l3e : gw.l2e);Ditto> + /* if l1p is just allocated, do a full prefetch */ > + if ( !old_entry && !gw.sp ) > + { > + gept = map_domain_page(mfn_x(gw.l1mfn)); > + for ( i = 0; i < 512; i++ ) > + { > + gentry = gept + i; > + sentry = table + i; > + if ( gentry->epte & 0x7 ) > + { > + sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain, > + gentry->mfn, &p2mt)); > + epte_set_ar_bits(sentry, gentry->epte); > + } > + else > + sentry->epte = 0; > + } > + unmap_domain_page(gept); > + } > + else if ( !old_entry && gw.sp ) > + { > + for ( i = 0; i < 512; i++ ) > + { > + sentry = table + i; > + sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain, > + mfn_start + i, &p2mt)); > + epte_set_ar_bits(sentry, sp_ar); > + } > + } > + else if ( old_entry && !gw.sp ) > + { > + i = gw.gfn & ((1 << EPT_TABLE_ORDER) - 1); > + sentry = table + i; > + sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain, > + gw.l1e.mfn, &p2mt)); > + epte_set_ar_bits(sentry, gw.l1e.epte); > + } > + else // old_entry && gw.sp > + { > + i = gw.gfn & ((1 << EPT_TABLE_ORDER) - 1); > + sentry = table + i; > + sentry->mfn = mfn_x(gfn_to_mfn_guest(vept->vcpu->domain, > + mfn_start + i, &p2mt)); > + epte_set_ar_bits(sentry, sp_ar); > + } > + > + unmap_domain_page(table); > + return 0; > +} > diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/arch/x86/hvm/vmx/vmx.c > --- a/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:10 2010 +0800 > @@ -1032,6 +1032,14 @@ > p2m_type_t p2mt; > char *p; > > + /* > + * If in nesting EPT operation, L0 doesn''t have the knowledge on > + * how to interpret CR3, it''s L1''s responsibility to provide > + * GUEST_PDPTRn, we rely solely on them. > + */ > + if ( v->arch.hvm_vcpu.in_nesting && vmx_nest_vept(v) ) > + return; > + > /* EPT needs to load PDPTRS into VMCS for PAE. */ > if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA) ) > return; > @@ -2705,6 +2713,11 @@ > if ( vmx_nest_handle_vmxon(regs) == X86EMUL_OKAY ) > __update_guest_eip(inst_len); > break; > + case EXIT_REASON_INVEPT: > + inst_len = __get_instruction_length(); > + if ( vmx_nest_handle_invept(regs) == X86EMUL_OKAY ) > + __update_guest_eip(inst_len); > + break; > > case EXIT_REASON_MWAIT_INSTRUCTION: > case EXIT_REASON_MONITOR_INSTRUCTION: > diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/include/asm-x86/hvm/vmx/nest.h > --- a/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:09 2010 +0800 > +++ b/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:10 2010 +0800 > @@ -47,6 +47,9 @@ > > unsigned long intr_info; > unsigned long error_code; > + > + u64 geptp; > + struct vept *vept; > }; > > asmlinkage void vmx_nest_switch_mode(void); > @@ -64,6 +67,8 @@ > int vmx_nest_handle_vmresume(struct cpu_user_regs *regs); > int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs); > > +int vmx_nest_handle_invept(struct cpu_user_regs *regs); > + > void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value); > void vmx_nest_update_secondary_exec_control(struct vcpu *v, > unsigned long value); > @@ -81,4 +86,6 @@ > int vmx_nest_msr_write_intercept(struct cpu_user_regs *regs, > u64 msr_content); > > +int vmx_nest_vept(struct vcpu *v); > + > #endif /* __ASM_X86_HVM_NEST_H__ */ > diff -r 22df5f7ec6d3 -r 7f54e6615e1e xen/include/asm-x86/hvm/vmx/vept.h > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/xen/include/asm-x86/hvm/vmx/vept.h Thu Apr 22 22:30:10 2010 +0800 > @@ -0,0 +1,10 @@ > +#include <asm/hvm/vmx/vmx.h> > + > + > +struct vept *vept_init(struct vcpu *v); > +void vept_teardown(struct vept *vept); > +mfn_t vept_load_eptp(struct vept *vept, u64 eptp); > +mfn_t vept_invalidate(struct vept *vept, u64 eptp); > +void vept_invalidate_all(struct vept *vept); > +int vept_ept_violation(struct vept *vept, u64 eptp, > + unsigned long qualification, paddr_t addr); > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  12:53 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
On Thu, 2010-05-20 at 18:55 +0800, Tim Deegan wrote:> At 10:54 +0100 on 20 May (1274352874), Qing He wrote: > > But I still put this flags here because there have been some people > > expressing security concerns, that in some situations, hardware > > virtualization needs to be explicitly disabled to avoid stealth VMM. > > I understand that people might want to disable nested HVM, and it''s fine > to do that in the domain builder; I just don''t think that domcrf is te > right Xen interface. Christoph''s use of HVM_PARAM sounds right to me.OK, I''ll change to HVM_PARAM solution.> > Tim. > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-May-20  12:57 UTC
Re: [Xen-devel] [PATCH 03/17] vmx: nest: wrapper for control update
On 20/05/2010 10:46, "Qing He" <qing.he@intel.com> wrote:>> Shouldn''t this use the new vmx_update_exception_bitmap()? > > I left it unchanged because it''s in vmcs.c. To me, vmx.c is on top of > vmcs.c and I feel against inter-dependeny.The pslit between vmx.c and vmcs.c is a bit arbitrary in some cases. But it''s cleaner and more maintainable to have everyone use a single interface to the exception bitmap if possible. -- Keir> Anyway this feeling is not strong. And I''m fine with using > vmx_update_exception_bitmap here since inter-dependency is already > the case._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  13:28 UTC
Re: [Xen-devel] [PATCH 07/17] vmx: nest: handling VMX instruction exits
On Thu, 2010-05-20 at 18:53 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932879), Qing He wrote: > > + else > > + { > > + decode->type = VMX_INST_MEMREG_TYPE_MEMORY; > > + hvm_get_segment_register(v, sreg_to_index[info.fields.segment], &seg); > > + seg_base = seg.base; > > + > > + base = info.fields.base_reg_invalid ? 0 : > > + reg_read(regs, info.fields.base_reg); > > + > > + index = info.fields.index_reg_invalid ? 0 : > > + reg_read(regs, info.fields.index_reg); > > + > > + scale = 1 << info.fields.scaling; > > + > > + disp = __vmread(EXIT_QUALIFICATION); > > + > > + > > + decode->mem = seg_base + base + index * scale + disp; > > + decode->len = 1 << (info.fields.addr_size + 1); > > Don''t we need to check the segment limit, type &c here?Definitely. I knew that a lot of error handling is missing, and particularly, not handling errors of hvm_copy_from_user is nearly unacceptable. But since it was RFC, I decided to show the algorithm first I''ll fix the missing error handling in the next version.> > + case VMFAIL_VALID: > > + /* TODO: error number of VMFailValid */ > > ? :)There is a long list of VMFail error numbers, but VMMs typically dont''t care about them very much.> > + hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs, PAGE_SIZE); > > Do we care about failure here? > > > + ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY); > > + hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0); > > We _definitely_ care about failure here! We need to inject #PF rather > than just using zero (and #GP/#SS based on the segment limit check I > mentioned above). > > Also somewhere we should be checking CR0.PE, CR4.VMXE and RFLAGS.VM and > returning #UD if they''re not correct. And checking that CPL == 0, too. >Yes, and I think I forgot about CPL == 0, that is an important check.> > + nest->vvmcs = alloc_xenheap_page(); > > + if ( !nest->vvmcs ) > > + { > > + gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs failed\n"); > > + vmreturn(regs, VMFAIL_INVALID); > > + goto out; > > + } > > Could we just take a writeable refcount of the guest memory rather than > allocating our own copy? ISTR the guest''s not allowed to write directly > to the VMCS memory anyway. It would be expensive on 32-bit Xen (because > of having to map/unmap all the time) but cheaper on 64-bit Xen (by > skipping various 4k memcpy()s) >The original intent is to make it more analogous to possible hardware solution (that the memory is not gauranteed to be usable until an explicit vmclear). However, we do have a so called `PV VMCS'' patch that does what you want (so the guest can manipulate it directly). On a second thought now, I think there is really no special benefit not to map it directly. I''ll change it to use it.> > +int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs) > > +{ > > Needs error handling... > > > + ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY); > > + hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0); > > Error handling... #PF, segments, CPL != 0 > > > + if ( nest->vmcs_invalid ) > > + { > > + hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa, PAGE_SIZE); > > I think you know what I''m going to say here. :) Apart from the error > paths the rest of this patch looks OK to me.I''ll revise them. Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  13:49 UTC
Re: [Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
On Thu, 2010-05-20 at 19:11 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932880), Qing He wrote: > > This patch adds mode switch between L1 and L2, many controls > > and states handling may need additioinal scrutiny. > > Yep - this clearly needs some more work. I''m going to wait for a later > version that''s got the additional scrutiny. :)Hmm, I''ll scrutinize it, but meanwhile, can you share some of your thoughts? I mean, the code doesn''t seem to organize well, partly because there are many different states to cover, and some tricks are used to work with the current code, vmx_set_host_env would be a good example of such kind of tricks. Do you have any suggestions on a better code orgnization? Another issue is that 64 shadow on 64 shadow is the only case that doesn''t work, I''m curious about it, I had some initial investigation but didn''t come with something, do you have some ideas? Thanks, Qing> > Cheers, > > Tim. > > > Roughly, at virtual VMEntry time, sVMCS is loaded, L2 control > > is combined from controls of L0 and vVMCS, L2 state from vVMCS > > guest state. > > when virtual VMExit, host VMCS is loaded, L1 control is from L0, > > L1 state from vVMCS host state. > > > > Signed-off-by: Qing He <qing.he@intel.com> > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2010-May-20  14:06 UTC
Re: [Xen-devel] [PATCH 04/17] vmx: nest: domain and vcpu flags
On Thursday 20 May 2010 14:53:41 Qing He wrote:> On Thu, 2010-05-20 at 18:55 +0800, Tim Deegan wrote: > > At 10:54 +0100 on 20 May (1274352874), Qing He wrote: > > > But I still put this flags here because there have been some people > > > expressing security concerns, that in some situations, hardware > > > virtualization needs to be explicitly disabled to avoid stealth VMM. > > > > I understand that people might want to disable nested HVM, and it''s fine > > to do that in the domain builder; I just don''t think that domcrf is te > > right Xen interface. Christoph''s use of HVM_PARAM sounds right to me. > > OK, I''ll change to HVM_PARAM solution.Do you really want to do duplicate work ? IMO, it is better to adapt my patch. Christoph -- ---to satisfy European Law for business letters: Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2010-05-20 at 19:21 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932881), Qing He wrote: > > +/* > > + * Nested virtualization interrupt handling: > > + * > > + * When vcpu runs in nested context (L2), the event delivery from > > + * L0 to L1 may be blocked by several reasons: > > + * - virtual VMExit > > + * - virtual VMEntry > > I''m not sure I understand what the plan is here. It looks like you > queue up a virtual vmentry or vmexit so that it happens just before the > real vmentry and then have to hold off interrupt injection because of > it. I''m a little worried that we''ll end up taking a virtual vmexit for > an interrupt, and then not injecting the interrupt. >Interrupt handling was one of the most error prone parts when I was implementingg nested virtualization. I''ll write a more detailed outline here in next several days, but I''ll list some points here, hope it can help. Let''s say, there are three kinds of interrupt: physical intr (physical), l0 intr (to be injected to l1), l1 intr (to be injected to l2), then: - simple rule: when running in L2, l0 intr causes a virtual vmexit - what about there is a pending l0 intr when l1 intr is to be injected? and what if there is a pending softirq in the window between loading L2 vmcs and physical vmentry? - failed l1 intr may be injected by l0 or l1, depending on who solves the fault - when running in L2, if a l0 intr is to be injected, but blocked (vmentry in progress, idtv injection), it is not exactly correct to simply open intr window. Intr window will cause L2 to exit till L2''s IF is cleared, but that should not be a blocking reason for l0 intr - there is the option of ack-intr-on-exit, which needs to be handled carefully in original vmx_intr_assist Thanks, Qing> Maybe you could outline the overall design of how interrupt delivery and > virtual vmenter/vmexit should work in nested VMX. I suspect that I''ve > just misunderstood the code. > > Cheers, > > Tim. > > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-20  16:06 UTC
Re: [Xen-devel] [PATCH 10/17] vmx: nest: VMExit handler in L2
On Thu, 2010-05-20 at 19:44 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932882), Qing He wrote: > > + > > + case VMX_EXIT_REASONS_FAILED_VMENTRY: > > + case EXIT_REASON_TRIPLE_FAULT: > > + case EXIT_REASON_TASK_SWITCH: > > + case EXIT_REASON_IO_INSTRUCTION: > > + case EXIT_REASON_CPUID: > > + case EXIT_REASON_MSR_READ: > > + case EXIT_REASON_MSR_WRITE: > > Aren''t these gated on a control bitmap in the L1 VMCS? >cpuid is unconditional io and msr are controlled through bitmaps, but they are turned off in the capability reporting> > + case EXIT_REASON_HLT: > > + case EXIT_REASON_RDTSC: > > + case EXIT_REASON_RDPMC: > > + case EXIT_REASON_MWAIT_INSTRUCTION: > > + case EXIT_REASON_PAUSE_INSTRUCTION: > > + case EXIT_REASON_MONITOR_INSTRUCTION: > > + case EXIT_REASON_DR_ACCESS: > > + case EXIT_REASON_INVLPG: > > + { > > + int i; > > + > > + /* exit according to guest exec_control */ > > + ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL); > > + > > + for ( i = 0; i < ARRAY_SIZE(control_bit_for_reason); i++ ) > > + if ( control_bit_for_reason[i].reason == exit_reason ) > > + break; > > You''ve already got a switch statement - why not gate these individually > rather than bundling them together and scanning an array? >Well, they are the `regular'' part of exit handling, a bit in the control bitmap corresponds to their behavior> > + if ( nest->vmexit_pending ) > > + bypass_l0 = 1; > > This variable doesn''t seem to do anything useful.This is a preparation for those doesn''t generate virtual vmexit but need to bypass normal L0 exit handler> > > + return bypass_l0; > > +} > > Cheers, > > Tim. > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2010-05-20 at 19:47 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932883), Qing He wrote: > > L2 TSC needs special handling, either rdtsc exiting is > > turned on or off > > Looks OK to me, but I''m a bit behind on some of the recent changes to > TSC handling.I''m also behind TSC changes, I''ll try to understand more about those changes, maybe after version 2. Thanks, Qing> > Tim. > > > Signed-off-by: Qing He <qing.he@intel.com> > > > > --- > > arch/x86/hvm/vmx/nest.c | 31 +++++++++++++++++++++++++++++++ > > arch/x86/hvm/vmx/vmx.c | 4 ++++ > > include/asm-x86/hvm/vmx/nest.h | 2 ++ > > 3 files changed, 37 insertions(+) > > > > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/nest.c > > --- a/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > > +++ b/xen/arch/x86/hvm/vmx/nest.c Thu Apr 22 22:30:09 2010 +0800 > > @@ -533,6 +533,18 @@ > > * Nested VMX context switch > > */ > > > > +u64 vmx_nest_get_tsc_offset(struct vcpu *v) > > +{ > > + u64 offset = 0; > > + struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; > > + > > + if ( __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) & > > + CPU_BASED_USE_TSC_OFFSETING ) > > + offset = __get_vvmcs(nest->vvmcs, TSC_OFFSET); > > + > > + return offset; > > +} > > + > > static unsigned long vmcs_gstate_field[] = { > > /* 16 BITS */ > > GUEST_ES_SELECTOR, > > @@ -715,6 +727,8 @@ > > hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4)); > > hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3)); > > > > + hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset); > > + > > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO); > > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE); > > vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN); > > @@ -837,6 +851,8 @@ > > hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4)); > > hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3)); > > > > + hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset); > > + > > __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0); > > } > > > > @@ -1116,6 +1132,21 @@ > > > > if ( control_bit_for_reason[i].bit & ctrl ) > > nest->vmexit_pending = 1; > > + else if ( exit_reason == EXIT_REASON_RDTSC ) > > + { > > + uint64_t tsc; > > + > > + /* > > + * rdtsc can''t be handled normally in the L0 handler > > + * if L1 doesn''t want it > > + */ > > + tsc = hvm_get_guest_tsc(v); > > + tsc += __get_vvmcs(nest->vvmcs, TSC_OFFSET); > > + regs->eax = (uint32_t)tsc; > > + regs->edx = (uint32_t)(tsc >> 32); > > + > > + bypass_l0 = 1; > > + } > > > > break; > > } > > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/arch/x86/hvm/vmx/vmx.c > > --- a/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:09 2010 +0800 > > +++ b/xen/arch/x86/hvm/vmx/vmx.c Thu Apr 22 22:30:09 2010 +0800 > > @@ -974,6 +974,10 @@ > > static void vmx_set_tsc_offset(struct vcpu *v, u64 offset) > > { > > vmx_vmcs_enter(v); > > + > > + if ( v->arch.hvm_vcpu.in_nesting ) > > + offset += vmx_nest_get_tsc_offset(v); > > + > > __vmwrite(TSC_OFFSET, offset); > > #if defined (__i386__) > > __vmwrite(TSC_OFFSET_HIGH, offset >> 32); > > diff -r 2f9ba6dbbe62 -r 2332586ff957 xen/include/asm-x86/hvm/vmx/nest.h > > --- a/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:09 2010 +0800 > > +++ b/xen/include/asm-x86/hvm/vmx/nest.h Thu Apr 22 22:30:09 2010 +0800 > > @@ -69,6 +69,8 @@ > > unsigned long value); > > void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value); > > > > +u64 vmx_nest_get_tsc_offset(struct vcpu *v); > > + > > void vmx_nest_idtv_handling(void); > > > > int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs, > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > -- > Tim Deegan <Tim.Deegan@citrix.com> > Principal Software Engineer, XenServer Engineering > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-21  08:42 UTC
Re: [Xen-devel] [PATCH 10/17] vmx: nest: VMExit handler in L2
At 17:06 +0100 on 20 May (1274375166), Qing He wrote:> On Thu, 2010-05-20 at 19:44 +0800, Tim Deegan wrote: > > At 10:41 +0100 on 22 Apr (1271932882), Qing He wrote: > > > + > > > + case VMX_EXIT_REASONS_FAILED_VMENTRY: > > > + case EXIT_REASON_TRIPLE_FAULT: > > > + case EXIT_REASON_TASK_SWITCH: > > > + case EXIT_REASON_IO_INSTRUCTION: > > > + case EXIT_REASON_CPUID: > > > + case EXIT_REASON_MSR_READ: > > > + case EXIT_REASON_MSR_WRITE: > > > > Aren''t these gated on a control bitmap in the L1 VMCS? > > > > cpuid is unconditional > io and msr are controlled through bitmaps, but they are turned > off in the capability reportingOK.> > > + case EXIT_REASON_HLT: > > > + case EXIT_REASON_RDTSC: > > > + case EXIT_REASON_RDPMC: > > > + case EXIT_REASON_MWAIT_INSTRUCTION: > > > + case EXIT_REASON_PAUSE_INSTRUCTION: > > > + case EXIT_REASON_MONITOR_INSTRUCTION: > > > + case EXIT_REASON_DR_ACCESS: > > > + case EXIT_REASON_INVLPG: > > > + { > > > + int i; > > > + > > > + /* exit according to guest exec_control */ > > > + ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL); > > > + > > > + for ( i = 0; i < ARRAY_SIZE(control_bit_for_reason); i++ ) > > > + if ( control_bit_for_reason[i].reason == exit_reason ) > > > + break; > > > > You''ve already got a switch statement - why not gate these individually > > rather than bundling them together and scanning an array? > > > > Well, they are the `regular'' part of exit handling, a bit in the control > bitmap corresponds to their behaviorI understand that. It just seems inefficient to bundle them all together into one clause of the switch statement and then scan an array looking for which one you''ve hit. Wouldn''t it be better to give each one its own clause and then use goto (!) or similar to jump to the common code?> > > + if ( nest->vmexit_pending ) > > > + bypass_l0 = 1; > > > > This variable doesn''t seem to do anything useful. > > This is a preparation for those doesn''t generate virtual > vmexit but need to bypass normal L0 exit handlerYes, I saw that in the later patch. Tim.> > > + return bypass_l0; > > > +} > > > > Cheers, > > > > Tim. > > > > -- > > Tim Deegan <Tim.Deegan@citrix.com> > > Principal Software Engineer, XenServer Engineering > > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-21  09:19 UTC
Re: [Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
At 14:49 +0100 on 20 May (1274366991), Qing He wrote:> I mean, the code doesn''t seem to organize well, partly because there > are many different states to cover, and some tricks are used to > work with the current code, vmx_set_host_env would be a good example > of such kind of tricks. Do you have any suggestions on a better code > orgnization?TBH I expect that any implementation of this is going to be messy. It''s a big interface and there are too many special cases. The only thing that strikes me is that you seem to do a full translation of the vvmcs on every vmentry. Would it be possible (since we already have to intercept every vmread/vmwrite) to keep the svmcs in sync all the time? Cheers, Tim.> Another issue is that 64 shadow on 64 shadow is the only case that > doesn''t work, I''m curious about it, I had some initial investigation > but didn''t come with something, do you have some ideas? > > Thanks, > Qing > > > > Cheers, > > > > Tim. > > > > > Roughly, at virtual VMEntry time, sVMCS is loaded, L2 control > > > is combined from controls of L0 and vVMCS, L2 state from vVMCS > > > guest state. > > > when virtual VMExit, host VMCS is loaded, L1 control is from L0, > > > L1 state from vVMCS host state. > > > > > > Signed-off-by: Qing He <qing.he@intel.com> > > > > -- > > Tim Deegan <Tim.Deegan@citrix.com> > > Principal Software Engineer, XenServer Engineering > > Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-21  10:24 UTC
Re: [Xen-devel] [PATCH 15/17] vmx: nest: virtual ept for nested
On Thu, 2010-05-20 at 20:21 +0800, Tim Deegan wrote:> At 10:41 +0100 on 22 Apr (1271932887), Qing He wrote: > > + hvm_copy_from_guest_virt(&eptp, decode.mem, sizeof(eptp), 0); > > + type = reg_read(regs, decode.reg2); > > Needs error handling like the other new instructions.I''ll fix the error handling in the next version> > > + /* TODO: physical invept on other cpus */ > > ? >vcpu may migrate among multiple physical cpus When a guest invept happens, the shadow ept should be invalidated on every cpu. An alternative would be to invalidate shadow ept at the time of vcpu migration, plus invalidate shadow ept at the time EPT_POINTER is changed (otherwise, we lose track) The two methods above both seem to have efficiency problem.> > + case EXIT_REASON_EPT_VIOLATION: > > + { > > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > > + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); > > +#ifdef __i386__ > > + gpa |= (paddr_t)__vmread(GUEST_PHYSICAL_ADDRESS_HIGH) << 32; > > +#endif > > + if ( vmx_nest_vept(v) ) > > + { > > + if ( !vept_ept_violation(nest->vept, nest->geptp, > > + exit_qualification, gpa) ) > > + bypass_l0 = 1; > > + else > > + nest->vmexit_pending = 1; > > Since bypass_l0 is set from vmexit_pending() here it looks like it''s > always going to be set. Does that mean we never handle a real EPT > violation at L0? I would expect there to be three possible outcomes > here: give the violation to L1, give it to L0, or fix it in the vept and > discard it.I didn''t intend to implement a full complete solution in a single run, the first step is based on the assumption that 1) domU''s p2m is fixed, which means no copy on write, memory over-commitment or ballooning down (PoD should work though); and 2) no VT-D (no mmio_direct). The reason is that it works for a wide situations without messing the p2m and emulation. The virtual ept has somewhat resemblence to shadow memory, and there is a dedicated path to shadow. I thought p2m system is written with 1 level virtualization in mind, and admittedly I''m a little afraid of making major changes to the memory system, so the L0 p2m is effectively skipped altogether to avoid messing things up. Anyway, considering the following: 1. (in interception) if virtual EPT does not have the entry, give to L1 2. (in interception) if shadow and virtual ept is not sync, try to build the shadow entry 3. (in common code) if failed (i.e. gfn_to_mfn failed), pass to L0 p2m Does 3 automatically work? (Just put hvm_emulation aside first)> > + * This virtual EPT implementation is independent to p2m facility > > + * and has some different characteristics. It works in a similar > > + * way as shadow page table (guest table and host table composition), > > + * but is per-vcpu, and of vTLB style > > + * - per vCPU so no lock is required > > What happens when dom0 changes domU''s p2m table? Don''t you need to > shoot down existing vEPT tables from a foreign CPU?As stated above, this is not handled. However, all the vEPT tables need to be invalidated, either active (as EPT_POINTER) or inactive anyway, so I hope there is no other caveats.> > > + * One of the limitations so far, is that it doesn''t work with > > + * L0 emulation code, so L1 p2m_mmio_direct on top of L0 p2m_mmio_dm > > + * is not supported as for now. > > Is this something you intend to fix before we check it in?I intended to delay this to the time of virtual VT-D. Not only because L0 p2m is even skipped, but also hvm_emulate needs non-trivial change, at least: 1. reload gva_to_gfn to work with 2 level hap 2. hvm_emulate needs explicit knowledge of 2 level hap, because it has to differentiate a guest page fault and a guest hap fault.> > + for ( i = 0; i < VEPT_MAX_SLOTS; i++ ) > > + { > > + slot = xmalloc(struct vept_slot); > > + if ( slot == NULL ) > > + break; > > + > > + memset(slot, 0, sizeof(*slot)); > > + > > + INIT_LIST_HEAD(&slot->list); > > + INIT_PAGE_LIST_HEAD(&slot->page_list); > > + > > + list_add(&slot->list, &vept->free_slots); > > + } > > + > > + for ( i = 0; i < VEPT_ALLOCATION_SIZE; i++ ) > > Why a fixed 2MB allocation? What if your nested domains are very large? > > > + { > > + pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(v->domain))); > > Shouldn''t this be allocated from the paging pool like other EPT memory?It just creates a pool for every vcpu like the hap paging pool to avoid locking (vept is not per-domain). Besides, I think the reservation adjustment of EPT paging pool is not dynamic as well but through domctls (I''m not that sure on this, just can''t find the dynamic part)? But yes, definitely, dynamic adjustment is needed.> > > + if ( pg == NULL ) > > + break; > > Return an error?It doesn''t fill the pool but still has some water that can be worked with, so silently ignore.> > + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 3, > > + &old_entry, gw.l4e); > > What if shadow_ept_next_level() returns 0 ? > > > + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 2, > > + &old_entry, gw.l3e); > > Ditto > > > + shadow_ept_next_level(vept, slot, &table, &gfn_remainder, 1, > > + &old_entry, (gw.sp == 2) ? gw.l3e : gw.l2e); > > DittoOf course, again, error handling. Also, what do you think on the separation of vept logics in general? It does some part of memory management work but circumvents the normal paging system. I think this makes vept simpler (than creating a `shadow hap''), and avoids introducing nested related bugs in generic paging code, but am also worried about its adaption with future changes. Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-21  10:31 UTC
Re: [Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
On Fri, 2010-05-21 at 17:19 +0800, Tim Deegan wrote:> At 14:49 +0100 on 20 May (1274366991), Qing He wrote: > > I mean, the code doesn''t seem to organize well, partly because there > > are many different states to cover, and some tricks are used to > > work with the current code, vmx_set_host_env would be a good example > > of such kind of tricks. Do you have any suggestions on a better code > > orgnization? > > TBH I expect that any implementation of this is going to be messy. It''s > a big interface and there are too many special cases. > > The only thing that strikes me is that you seem to do a full translation > of the vvmcs on every vmentry. Would it be possible (since we already > have to intercept every vmread/vmwrite) to keep the svmcs in sync all > the time?I don''t think it''s a good idea to change svmcs at the vmread/vmwrite time, because 1. that means 2 addtional vmclears and 2 additional vmptrld for every vmread/vmwrite 2. it makes things like pv vmcs impossible 3. vmread/vmwrite is supposed to be simple access, changing svmcs at these points doesn''t look right I did consider a bitmap based solution, to only update fields that have been written. However, it needs to define a new encoding and is purely optimization, so I''d like to just put it as a TODO at the moment. Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Qing He
2010-May-21  10:35 UTC
Re: [Xen-devel] [PATCH 10/17] vmx: nest: VMExit handler in L2
On Fri, 2010-05-21 at 16:42 +0800, Tim Deegan wrote:> At 17:06 +0100 on 20 May (1274375166), Qing He wrote: > > On Thu, 2010-05-20 at 19:44 +0800, Tim Deegan wrote: > > > At 10:41 +0100 on 22 Apr (1271932882), Qing He wrote: > > > > + case EXIT_REASON_HLT: > > > > + case EXIT_REASON_RDTSC: > > > > + case EXIT_REASON_RDPMC: > > > > + case EXIT_REASON_MWAIT_INSTRUCTION: > > > > + case EXIT_REASON_PAUSE_INSTRUCTION: > > > > + case EXIT_REASON_MONITOR_INSTRUCTION: > > > > + case EXIT_REASON_DR_ACCESS: > > > > + case EXIT_REASON_INVLPG: > > > > + { > > > > + int i; > > > > + > > > > + /* exit according to guest exec_control */ > > > > + ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL); > > > > + > > > > + for ( i = 0; i < ARRAY_SIZE(control_bit_for_reason); i++ ) > > > > + if ( control_bit_for_reason[i].reason == exit_reason ) > > > > + break; > > > > > > You''ve already got a switch statement - why not gate these individually > > > rather than bundling them together and scanning an array? > > > > > > > Well, they are the `regular'' part of exit handling, a bit in the control > > bitmap corresponds to their behavior > > I understand that. It just seems inefficient to bundle them all > together into one clause of the switch statement and then scan an array > looking for which one you''ve hit. Wouldn''t it be better to give each > one its own clause and then use goto (!) or similar to jump to the > common code?Ok, I''ll change it to switch clauses, does it mean to be more friendly to the compiler? Thanks, Qing _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-25  15:27 UTC
Re: [Xen-devel] [PATCH 08/17] vmx: nest: L1 <-> L2 context switch
At 11:31 +0100 on 21 May (1274441514), Qing He wrote:> On Fri, 2010-05-21 at 17:19 +0800, Tim Deegan wrote: > > At 14:49 +0100 on 20 May (1274366991), Qing He wrote: > > > I mean, the code doesn''t seem to organize well, partly because there > > > are many different states to cover, and some tricks are used to > > > work with the current code, vmx_set_host_env would be a good example > > > of such kind of tricks. Do you have any suggestions on a better code > > > orgnization? > > > > TBH I expect that any implementation of this is going to be messy. It''s > > a big interface and there are too many special cases. > > > > The only thing that strikes me is that you seem to do a full translation > > of the vvmcs on every vmentry. Would it be possible (since we already > > have to intercept every vmread/vmwrite) to keep the svmcs in sync all > > the time? > > I don''t think it''s a good idea to change svmcs at the vmread/vmwrite > time, because > 1. that means 2 addtional vmclears and 2 additional vmptrld for every > vmread/vmwriteYes, I guess it does. :(> 2. it makes things like pv vmcs impossible > 3. vmread/vmwrite is supposed to be simple access, changing svmcs at > these points doesn''t look right > > I did consider a bitmap based solution, to only update fields > that have been written. However, it needs to define a new encoding > and is purely optimization, so I''d like to just put it as a TODO at > the moment.Fair enough. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-25  15:34 UTC
Re: [Xen-devel] [PATCH 10/17] vmx: nest: VMExit handler in L2
At 11:35 +0100 on 21 May (1274441723), Qing He wrote:> > I understand that. It just seems inefficient to bundle them all > > together into one clause of the switch statement and then scan an array > > looking for which one you''ve hit. Wouldn''t it be better to give each > > one its own clause and then use goto (!) or similar to jump to the > > common code? > > Ok, I''ll change it to switch clauses, does it mean to be more friendly to > the compiler?No, it''s just faster; I don''t think GCC can optimize out a while loop, even scanning a static array with a known limited set of possible inputs (though i would be delighted to hear otherwise). Just to be clear, I''m talking about replacing this kind of logic switch (x) { case a: case b: case c: for (i = o; i < 3 ; i++) if ( x == array[i] ) /* do case-specific thing */ /* do common case */ } with this equivalent: switch (x) { case a: /* do a-specific thing */ goto common; case b: /* do b-specific thing */ goto common; case c: /* do c-specific thing */ goto common; common: /* do common case */ } Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2010-May-25  16:02 UTC
Re: [Xen-devel] [PATCH 15/17] vmx: nest: virtual ept for nested
At 11:24 +0100 on 21 May (1274441043), Qing He wrote:> vcpu may migrate among multiple physical cpus > When a guest invept happens, the shadow ept should be invalidated > on every cpu. > > An alternative would be to invalidate shadow ept at the time of > vcpu migration, plus invalidate shadow ept at the time EPT_POINTER > is changed (otherwise, we lose track)Since your shadows are per-vcpu, doing the invept at migration time sounds like a better idea than sending IPIs. It''s not _that_ inefficient (we''d expect most of the TLB to be full of other vcpus'' EPT entries when we migrate to a new cpu), but I''m sure some kind of dirty-mask or tlbflush timestamp trick could make it better if it ecame a bottleneck.> The two methods above both seem to have efficiency problem. > > > > + case EXIT_REASON_EPT_VIOLATION: > > > + { > > > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > > > + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); > > > +#ifdef __i386__ > > > + gpa |= (paddr_t)__vmread(GUEST_PHYSICAL_ADDRESS_HIGH) << 32; > > > +#endif > > > + if ( vmx_nest_vept(v) ) > > > + { > > > + if ( !vept_ept_violation(nest->vept, nest->geptp, > > > + exit_qualification, gpa) ) > > > + bypass_l0 = 1; > > > + else > > > + nest->vmexit_pending = 1; > > > > Since bypass_l0 is set from vmexit_pending() here it looks like it''s > > always going to be set. Does that mean we never handle a real EPT > > violation at L0? I would expect there to be three possible outcomes > > here: give the violation to L1, give it to L0, or fix it in the vept and > > discard it. > > I didn''t intend to implement a full complete solution in a single run, > the first step is based on the assumption that 1) domU''s p2m is fixed, > which means no copy on write, memory over-commitment or ballooning down > (PoD should work though); and 2) no VT-D (no mmio_direct). The reason is > that it works for a wide situations without messing the p2m and > emulation.Ah, OK. I''d prefer if you enforced those assumptions; otherwise I think we''re at risk of breaking the basic memory isolation rules. In particular, if anything _does_ change the p2m we need to do something to make sure that the guest doesn''t retain old mappings of pages that now belong to someone else. I think hooking the current EPT code to flush all the vepts on every p2m change would be safe, if very slow. Then you can optimize it from there. :)> The virtual ept has somewhat resemblence to shadow memory, and there is > a dedicated path to shadow. I thought p2m system is written with 1 level > virtualization in mind, and admittedly I''m a little afraid of making > major changes to the memory system, so the L0 p2m is effectively skipped > altogether to avoid messing things up. > > Anyway, considering the following: > 1. (in interception) if virtual EPT does not have the entry, give to L1 > 2. (in interception) if shadow and virtual ept is not sync, try to build > the shadow entry > 3. (in common code) if failed (i.e. gfn_to_mfn failed), pass to L0 p2m > > Does 3 automatically work? (Just put hvm_emulation aside first)Looking at your patch, I don''t think 3 can happen at all. But if you change the logic to pass the EPT fault to the l0 handler it should work.> > > + * This virtual EPT implementation is independent to p2m facility > > > + * and has some different characteristics. It works in a similar > > > + * way as shadow page table (guest table and host table composition), > > > + * but is per-vcpu, and of vTLB style > > > + * - per vCPU so no lock is required > > > > What happens when dom0 changes domU''s p2m table? Don''t you need to > > shoot down existing vEPT tables from a foreign CPU? > > As stated above, this is not handled. > However, all the vEPT tables need to be invalidated, either active (as > EPT_POINTER) or inactive anyway, so I hope there is no other caveats.OK.> > > + * One of the limitations so far, is that it doesn''t work with > > > + * L0 emulation code, so L1 p2m_mmio_direct on top of L0 p2m_mmio_dm > > > + * is not supported as for now. > > > > Is this something you intend to fix before we check it in? > > I intended to delay this to the time of virtual VT-D. Not only because > L0 p2m is even skipped, but also hvm_emulate needs non-trivial change, > at least: > 1. reload gva_to_gfn to work with 2 level hap > 2. hvm_emulate needs explicit knowledge of 2 level hap, because it has > to differentiate a guest page fault and a guest hap fault.Urgh. I can see why you want to leave this out for now. :)> > > + for ( i = 0; i < VEPT_MAX_SLOTS; i++ ) > > > + { > > > + slot = xmalloc(struct vept_slot); > > > + if ( slot == NULL ) > > > + break; > > > + > > > + memset(slot, 0, sizeof(*slot)); > > > + > > > + INIT_LIST_HEAD(&slot->list); > > > + INIT_PAGE_LIST_HEAD(&slot->page_list); > > > + > > > + list_add(&slot->list, &vept->free_slots); > > > + } > > > + > > > + for ( i = 0; i < VEPT_ALLOCATION_SIZE; i++ ) > > > > Why a fixed 2MB allocation? What if your nested domains are very large? > > > > > + { > > > + pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(v->domain))); > > > > Shouldn''t this be allocated from the paging pool like other EPT memory? > > It just creates a pool for every vcpu like the hap paging pool to avoid > locking (vept is not per-domain). Besides, I think the reservation > adjustment of EPT paging pool is not dynamic as well but through > domctls (I''m not that sure on this, just can''t find the dynamic part)?I think it''s best to take this from the general EPT memory, rather than adding this extra overhead to every VCPU, even if the amount isn''t dynamic.> But yes, definitely, dynamic adjustment is needed. > > > > > > + if ( pg == NULL ) > > > + break; > > > > Return an error? > > It doesn''t fill the pool but still has some water that can be worked > with, so silently ignore.But it might have no water at all if the first allocation fails!> Also, what do you think on the separation of vept logics in general? > It does some part of memory management work but circumvents the normal > paging system. I think this makes vept simpler (than creating a `shadow hap''), > and avoids introducing nested related bugs in generic paging code, > but am also worried about its adaption with future changes.I''m happy enough for the vept code to be separate (although possibly some of the EPT-walking code could be made common). In fact it''s quite helpful in some ways to limit the confusion that having so many address spaces introduces. :) The important thing is that the paging interface must still do the right thing in every situation, without its callers needing to know about vept. That will need hooks in the normal EPT code to call into the vept code. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel