thr3ads.net - Xen devel - [PATCH 2/2] x86/vMCE: save/restore MCA capabilities [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2012-Feb-15 11:01 UTC

[PATCH 2/2] x86/vMCE: save/restore MCA capabilities

This allows migration to a host with less MCA banks than the source
host had, while without this patch accesses to the excess banks'' MSRs
caused #GP-s in the guest after migration (and it depended on the guest
kernel whether this would be fatal).

A fundamental question is whether we should also save/restore MCG_CTL
and MCi_CTL, as the HVM save record would better be defined to the
complete state that needs saving from the beginning (I''m unsure whether
the save/restore logic allows for future extension of an existing
record).

Of course, this change is expected to make migration from new to older
Xen impossible (again I''m unsure what the save/restore logic does with
records it doesn''t even know about).

The (trivial) tools side change may seem unrelated, but the code should
have been that way from the beginning to allow the hypervisor to look
at currently unused ext_vcpucontext fields without risking to read
garbage when those fields get a meaning assigned in the future. This
isn''t being enforced here - should it be? (Obviously, for backwards
compatibility, the hypervisor must assume these fields to be clear only
when the extended context''s size exceeds the old original one.)

A future addition to this change might be to allow configuration of the
number of banks and other MCA capabilities for a guest before it starts
(i.e. to not inherits the values seen on the first host it runs on).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: caps %" PRIx64 "\n", p.caps);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don''t understand type %u:
skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern unsigned int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr <
MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
            nr_mce_banks * sizeof(*dom_vmce(d)->mci_ctl));
 
     dom_vmce(d)->mcg_status = 0x0;
-    dom_vmce(d)->mcg_cap = g_mcg_cap;
     dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
     dom_vmce(d)->nr_injection = 0;
 
@@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.mcg_cap = g_mcg_cap;
+}
+
+int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
+{
+    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
+    {
+        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA
capabilities"
+                " %#" PRIx64 " for d%d:v%u (supported:
%#Lx)\n",
+                is_hvm_vcpu(v) ? "HVM" : "PV", caps,
v->domain->domain_id,
+                v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT);
+        return -EPERM;
+    }
+
+    v->arch.mcg_cap = caps;
+    return 0;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                        "MCE: rdmsr MCG_STATUS
0x%"PRIx64"\n", *val);
         break;
     case MSR_IA32_MCG_CAP:
-        *val = vmce->mcg_cap;
+        *val = cur->arch.mcg_cap;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP
0x%"PRIx64"\n",
                    *val);
         break;
     case MSR_IA32_MCG_CTL:
         /* Always 0 if no CTL support */
-        *val = vmce->mcg_ctl & h_mcg_ctl;
+        if ( cur->arch.mcg_cap & MCG_CTL_P )
+            *val = vmce->mcg_ctl & h_mcg_ctl;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL
0x%"PRIx64"\n",
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .caps = v->arch.mcg_cap
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n",
+                d->domain_id, vcpuid);
+        err = -EINVAL;
+    }
+    else
+        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+
+    return err ?: vmce_restore_vcpu(v, ctxt.caps);
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
     if ( (rc = vcpu_init_fpu(v)) != 0 )
         return rc;
 
+    vmce_init_vcpu(v);
+
     if ( is_hvm_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->mcg_cap = v->arch.mcg_cap;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,10 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
+                              sizeof(evc->mcg_cap) )
+                ret = vmce_restore_vcpu(v, evc->mcg_cap);
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint64_t mcg_cap;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -16,7 +16,6 @@ struct bank_entry {
 struct domain_mca_msrs
 {
     /* Guest should not change below values after DOM boot up */
-    uint64_t mcg_cap;
     uint64_t mcg_ctl;
     uint64_t mcg_status;
     uint64_t *mci_ctl;
@@ -28,6 +27,8 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
+extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint64_t caps;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint64_aligned_t mcg_cap;
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Liu, Jinsong

2012-Mar-05 20:19 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:> This allows migration to a host with less MCA banks than the source
> host had, while without this patch accesses to the excess banks''
MSRs
> caused #GP-s in the guest after migration (and it depended on the
> guest 
> kernel whether this would be fatal).
> 
> A fundamental question is whether we should also save/restore MCG_CTL
> and MCi_CTL, as the HVM save record would better be defined to the
> complete state that needs saving from the beginning (I''m unsure
> whether 
> the save/restore logic allows for future extension of an existing
> record).
Not sure this point. I always feel confused about the meaning of MCG_CTL/MCi_CTL
and their defination in SDM looks ambiguous to me.
ASK TONY FOR HELP: what the real h/w meaning of MCG_CTL/MCi_CTL? seems mce logic
seldomly rely on them, especially bit-by-bit of MCi_CTL.

Another question is, why in the patch mcg_cap defined as per vcpu while others
(mcg_ctl/ mcg_status/ mci_ctl) defined as per domain? Semantically it looks some
weird anyway.

Thanks,
Jinsong
> 
> Of course, this change is expected to make migration from new to older
> Xen impossible (again I''m unsure what the save/restore logic does
with
> records it doesn''t even know about).
> 
> The (trivial) tools side change may seem unrelated, but the code
> should 
> have been that way from the beginning to allow the hypervisor to look
> at currently unused ext_vcpucontext fields without risking to read
> garbage when those fields get a meaning assigned in the future. This
> isn''t being enforced here - should it be? (Obviously, for
backwards
> compatibility, the hypervisor must assume these fields to be clear
> only 
> when the extended context''s size exceeds the old original one.)
> 
> A future addition to this change might be to allow configuration of
> the 
> number of banks and other MCA capabilities for a guest before it
> starts (i.e. to not inherits the values seen on the first host it
> runs on). 
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
> 
>          domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
>          domctl.domain = dom;
> +        memset(&domctl.u, 0, sizeof(domctl.u));
>          domctl.u.ext_vcpucontext.vcpu = i;
>          if ( xc_domctl(xch, &domctl) < 0 )
>          {
> --- a/tools/misc/xen-hvmctx.c
> +++ b/tools/misc/xen-hvmctx.c
> @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
>             (unsigned long long) p.apic_assist);
>  }
> 
> +static void dump_vmce_vcpu(void)
> +{
> +    HVM_SAVE_TYPE(VMCE_VCPU) p;
> +    READ(p);
> +    printf("    VMCE_VCPU: caps %" PRIx64 "\n",
p.caps);
> +}
> +
>  int main(int argc, char **argv)
>  {
>      int entry, domid;
> @@ -449,6 +456,7 @@ int main(int argc, char **argv)
>          case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
>          case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain();
>          break; case HVM_SAVE_CODE(VIRIDIAN_VCPU):
> dump_viridian_vcpu(); break; +        case HVM_SAVE_CODE(VMCE_VCPU):
>          dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break;
>          default:
>              printf(" ** Don''t understand type %u:
skipping\n",
> --- a/xen/arch/x86/cpu/mcheck/mce.h
> +++ b/xen/arch/x86/cpu/mcheck/mce.h
> @@ -3,6 +3,7 @@
>  #define _MCE_H
> 
>  #include <xen/init.h>
> +#include <xen/sched.h>
>  #include <xen/smp.h>
>  #include <asm/types.h>
>  #include <asm/traps.h>
> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
>  u64 mce_cap_init(void);
>  extern unsigned int firstbank;
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val);
> +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t
> *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t
> val); 
> 
>  struct mcinfo_extended *intel_get_extended_msrs(
>      struct mcinfo_global *mig, struct mc_info *mi);
> @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
> 
>  extern int vmce_init(struct cpuinfo_x86 *c);
> 
> -static inline int mce_vendor_bank_msr(uint32_t msr)
> +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t
>  msr) {
>      if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> -         msr >= MSR_IA32_MC0_CTL2 && msr <
(MSR_IA32_MC0_CTL2 +
> nr_mce_banks) ) +         msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>            return 1;
>      return 0;
>  }
> 
> -static inline int mce_bank_msr(uint32_t msr)
> +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
>  {
> -    if ( (msr >= MSR_IA32_MC0_CTL && msr <
> MSR_IA32_MCx_CTL(nr_mce_banks)) || 
> -        mce_vendor_bank_msr(msr) )
> +    if ( (msr >= MSR_IA32_MC0_CTL &&
> +          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap &
MCG_CAP_COUNT)) ||
> +         mce_vendor_bank_msr(v, msr) )
>          return 1;
>      return 0;
>  }
> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c
> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
> @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
>  }
> 
>  /* intel specific MCA MSR */
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val)
> +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
>  {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not write this MSR!\n");
> @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
>      return ret;
>  }
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
> +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t
>  *val) {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not read this MSR!\n");
> --- a/xen/arch/x86/cpu/mcheck/vmce.c
> +++ b/xen/arch/x86/cpu/mcheck/vmce.c
> @@ -10,6 +10,7 @@
>  #include <xen/delay.h>
>  #include <xen/smp.h>
>  #include <xen/mm.h>
> +#include <xen/hvm/save.h>
>  #include <asm/processor.h>
>  #include <public/sysctl.h>
>  #include <asm/system.h>
> @@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
>             nr_mce_banks * sizeof(*dom_vmce(d)->mci_ctl));
> 
>      dom_vmce(d)->mcg_status = 0x0;
> -    dom_vmce(d)->mcg_cap = g_mcg_cap;
>      dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
>      dom_vmce(d)->nr_injection = 0;
> 
> @@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
>      dom_vmce(d) = NULL;
>  }
> 
> -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t
> *val) +void vmce_init_vcpu(struct vcpu *v)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    v->arch.mcg_cap = g_mcg_cap;
> +}
> +
> +int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
> +{
> +    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
> +    {
> +        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA
> capabilities" +                " %#" PRIx64 " for
d%d:v%u (supported:
> %#Lx)\n", +                is_hvm_vcpu(v) ? "HVM" :
"PV", caps,
> v->domain->domain_id, +                v->vcpu_id, g_mcg_cap &
> ~MCG_CAP_COUNT); +        return -EPERM;
> +    }
> +
> +    v->arch.mcg_cap = caps;
> +    return 0;
> +}
> +
> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr,
> uint64_t *val) +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                         "MCE: rdmsr MCG_STATUS
0x%"PRIx64"\n", *val);
>          break;
>      case MSR_IA32_MCG_CAP:
> -        *val = vmce->mcg_cap;
> +        *val = cur->arch.mcg_cap;
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP
0x%"PRIx64"\n",
>                     *val);
>          break;
>      case MSR_IA32_MCG_CTL:
>          /* Always 0 if no CTL support */
> -        *val = vmce->mcg_ctl & h_mcg_ctl;
> +        if ( cur->arch.mcg_cap & MCG_CTL_P )
> +            *val = vmce->mcg_ctl & h_mcg_ctl;
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL
0x%"PRIx64"\n",
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to
> current @@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n",
>          val); /* For HVM guest, this is the point for deleting vMCE
> injection node */ -        if ( d->is_hvm &&
(vmce->nr_injection > 0)
> ) +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0)
)
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> @@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .caps = v->arch.mcg_cap
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL
)
> +    {
> +        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no
vcpu%u\n",
> +                d->domain_id, vcpuid);
> +        err = -EINVAL;
> +    }
> +    else
> +        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +
> +    return err ?: vmce_restore_vcpu(v, ctxt.caps);
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
>      if ( (rc = vcpu_init_fpu(v)) != 0 )
>          return rc;
> 
> +    vmce_init_vcpu(v);
> +
>      if ( is_hvm_domain(d) )
>      {
>          rc = hvm_vcpu_initialise(v);
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->mcg_cap = v->arch.mcg_cap;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,10 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
> +                              sizeof(evc->mcg_cap) )
> +                ret = vmce_restore_vcpu(v, evc->mcg_cap);
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has
>       been used, * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint64_t mcg_cap;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -16,7 +16,6 @@ struct bank_entry {
>  struct domain_mca_msrs
>  {
>      /* Guest should not change below values after DOM boot up */
> -    uint64_t mcg_cap;
>      uint64_t mcg_ctl;
>      uint64_t mcg_status;
>      uint64_t *mci_ctl;
> @@ -28,6 +27,8 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
> +extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
> hvm_viridian_vcpu_context); 
> 
> +struct hvm_vmce_vcpu {
> +    uint64_t caps;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint64_aligned_t mcg_cap;
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext
> xen_domctl_ext_vcpucontext_t;

Luck, Tony

2012-Mar-05 21:47 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Meaning of MCi_CTL registers. A number of different h/w structures and
error events within those structures may be reported in a single bank.
The MCi_CTL register for that bank has a bitmask that allows each of the
errors to be enabled individually. The detail of which bits in the register
enable which errors are model specific. Architecturally the SDM recommends
writing 0xffffff...fffff to them to enable all errors. Sometimes if there
is a problem on a cpu we might disable some bits (usually handled by BIOS
or microcode quietly clearing, or not ever setting some of the bits).

MCG_CTL does a similar thing - but at a global scope - for this cpu ...
each logical cpu has its own copy of the MCG_* registers - so the
''G''
for ''Global'' isn''t all the way
''global'' - just locally global :-) - I
think that the names were given before multi-thread/multi-core.
i.e. it can affect whether an error is reported in any of the banks
that are associated with it). Again the exact meaning of the bits
is model specific, and the architectural recommendation is to enable
everything with a write of 0xffffff....fffff

I''m not a virtualization expert - so I''m not sure what the
tradeoffs
are on providing almost pass-through access to the host banks are. I''d
have thought that most useful functionality could be provided with
a wholly virtualized approach:
1) The guests really don''t need to see any corrected errors at all. If
there are any predictive actions to be taken (e.g. too many corrected
errors from a memory location -> stop using a page) these can be taken
at the hypervisor level.
2) Recoverable errors - hypervisor could remap these from whatever bank
they occur in to a virtual bank. It already needs to convert real physical
addresses to guest physical - so this shouldn''t be much extra effort
3) Fatal errors - whole system (with all guests) is going down - really
not much value in trying to tell all the guests exactly why.

But perhaps I''m missing some subtlety.

-Tony

-----Original Message-----
From: Liu, Jinsong 
Sent: Monday, March 05, 2012 12:19 PM
To: Luck, Tony; Jan Beulich; xen-devel@lists.xensource.com
Cc: Olaf Hering; Dugger, Donald D
Subject: RE: [Xen-devel] [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:> This allows migration to a host with less MCA banks than the source
> host had, while without this patch accesses to the excess banks''
MSRs
> caused #GP-s in the guest after migration (and it depended on the
> guest 
> kernel whether this would be fatal).
> 
> A fundamental question is whether we should also save/restore MCG_CTL
> and MCi_CTL, as the HVM save record would better be defined to the
> complete state that needs saving from the beginning (I''m unsure
> whether 
> the save/restore logic allows for future extension of an existing
> record).
Not sure this point. I always feel confused about the meaning of MCG_CTL/MCi_CTL
and their defination in SDM looks ambiguous to me.
ASK TONY FOR HELP: what the real h/w meaning of MCG_CTL/MCi_CTL? seems mce logic
seldomly rely on them, especially bit-by-bit of MCi_CTL.

Another question is, why in the patch mcg_cap defined as per vcpu while others
(mcg_ctl/ mcg_status/ mci_ctl) defined as per domain? Semantically it looks some
weird anyway.

Thanks,
Jinsong
> 
> Of course, this change is expected to make migration from new to older
> Xen impossible (again I''m unsure what the save/restore logic does
with
> records it doesn''t even know about).
> 
> The (trivial) tools side change may seem unrelated, but the code
> should 
> have been that way from the beginning to allow the hypervisor to look
> at currently unused ext_vcpucontext fields without risking to read
> garbage when those fields get a meaning assigned in the future. This
> isn''t being enforced here - should it be? (Obviously, for
backwards
> compatibility, the hypervisor must assume these fields to be clear
> only 
> when the extended context''s size exceeds the old original one.)
> 
> A future addition to this change might be to allow configuration of
> the 
> number of banks and other MCA capabilities for a guest before it
> starts (i.e. to not inherits the values seen on the first host it
> runs on). 
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
> 
>          domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
>          domctl.domain = dom;
> +        memset(&domctl.u, 0, sizeof(domctl.u));
>          domctl.u.ext_vcpucontext.vcpu = i;
>          if ( xc_domctl(xch, &domctl) < 0 )
>          {
> --- a/tools/misc/xen-hvmctx.c
> +++ b/tools/misc/xen-hvmctx.c
> @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
>             (unsigned long long) p.apic_assist);
>  }
> 
> +static void dump_vmce_vcpu(void)
> +{
> +    HVM_SAVE_TYPE(VMCE_VCPU) p;
> +    READ(p);
> +    printf("    VMCE_VCPU: caps %" PRIx64 "\n",
p.caps);
> +}
> +
>  int main(int argc, char **argv)
>  {
>      int entry, domid;
> @@ -449,6 +456,7 @@ int main(int argc, char **argv)
>          case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
>          case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain();
>          break; case HVM_SAVE_CODE(VIRIDIAN_VCPU):
> dump_viridian_vcpu(); break; +        case HVM_SAVE_CODE(VMCE_VCPU):
>          dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break;
>          default:
>              printf(" ** Don''t understand type %u:
skipping\n",
> --- a/xen/arch/x86/cpu/mcheck/mce.h
> +++ b/xen/arch/x86/cpu/mcheck/mce.h
> @@ -3,6 +3,7 @@
>  #define _MCE_H
> 
>  #include <xen/init.h>
> +#include <xen/sched.h>
>  #include <xen/smp.h>
>  #include <asm/types.h>
>  #include <asm/traps.h>
> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
>  u64 mce_cap_init(void);
>  extern unsigned int firstbank;
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val);
> +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t
> *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t
> val); 
> 
>  struct mcinfo_extended *intel_get_extended_msrs(
>      struct mcinfo_global *mig, struct mc_info *mi);
> @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
> 
>  extern int vmce_init(struct cpuinfo_x86 *c);
> 
> -static inline int mce_vendor_bank_msr(uint32_t msr)
> +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t
>  msr) {
>      if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> -         msr >= MSR_IA32_MC0_CTL2 && msr <
(MSR_IA32_MC0_CTL2 +
> nr_mce_banks) ) +         msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>            return 1;
>      return 0;
>  }
> 
> -static inline int mce_bank_msr(uint32_t msr)
> +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
>  {
> -    if ( (msr >= MSR_IA32_MC0_CTL && msr <
> MSR_IA32_MCx_CTL(nr_mce_banks)) || 
> -        mce_vendor_bank_msr(msr) )
> +    if ( (msr >= MSR_IA32_MC0_CTL &&
> +          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap &
MCG_CAP_COUNT)) ||
> +         mce_vendor_bank_msr(v, msr) )
>          return 1;
>      return 0;
>  }
> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c
> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
> @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
>  }
> 
>  /* intel specific MCA MSR */
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val)
> +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
>  {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not write this MSR!\n");
> @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
>      return ret;
>  }
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
> +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t
>  *val) {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap &
MCG_CAP_COUNT) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not read this MSR!\n");
> --- a/xen/arch/x86/cpu/mcheck/vmce.c
> +++ b/xen/arch/x86/cpu/mcheck/vmce.c
> @@ -10,6 +10,7 @@
>  #include <xen/delay.h>
>  #include <xen/smp.h>
>  #include <xen/mm.h>
> +#include <xen/hvm/save.h>
>  #include <asm/processor.h>
>  #include <public/sysctl.h>
>  #include <asm/system.h>
> @@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
>             nr_mce_banks * sizeof(*dom_vmce(d)->mci_ctl));
> 
>      dom_vmce(d)->mcg_status = 0x0;
> -    dom_vmce(d)->mcg_cap = g_mcg_cap;
>      dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
>      dom_vmce(d)->nr_injection = 0;
> 
> @@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
>      dom_vmce(d) = NULL;
>  }
> 
> -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t
> *val) +void vmce_init_vcpu(struct vcpu *v)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    v->arch.mcg_cap = g_mcg_cap;
> +}
> +
> +int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
> +{
> +    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
> +    {
> +        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA
> capabilities" +                " %#" PRIx64 " for
d%d:v%u (supported:
> %#Lx)\n", +                is_hvm_vcpu(v) ? "HVM" :
"PV", caps,
> v->domain->domain_id, +                v->vcpu_id, g_mcg_cap &
> ~MCG_CAP_COUNT); +        return -EPERM;
> +    }
> +
> +    v->arch.mcg_cap = caps;
> +    return 0;
> +}
> +
> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr,
> uint64_t *val) +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                         "MCE: rdmsr MCG_STATUS
0x%"PRIx64"\n", *val);
>          break;
>      case MSR_IA32_MCG_CAP:
> -        *val = vmce->mcg_cap;
> +        *val = cur->arch.mcg_cap;
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP
0x%"PRIx64"\n",
>                     *val);
>          break;
>      case MSR_IA32_MCG_CTL:
>          /* Always 0 if no CTL support */
> -        *val = vmce->mcg_ctl & h_mcg_ctl;
> +        if ( cur->arch.mcg_cap & MCG_CTL_P )
> +            *val = vmce->mcg_ctl & h_mcg_ctl;
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL
0x%"PRIx64"\n",
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to
> current @@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n",
>          val); /* For HVM guest, this is the point for deleting vMCE
> injection node */ -        if ( d->is_hvm &&
(vmce->nr_injection > 0)
> ) +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0)
)
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> @@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .caps = v->arch.mcg_cap
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL
)
> +    {
> +        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no
vcpu%u\n",
> +                d->domain_id, vcpuid);
> +        err = -EINVAL;
> +    }
> +    else
> +        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +
> +    return err ?: vmce_restore_vcpu(v, ctxt.caps);
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
>      if ( (rc = vcpu_init_fpu(v)) != 0 )
>          return rc;
> 
> +    vmce_init_vcpu(v);
> +
>      if ( is_hvm_domain(d) )
>      {
>          rc = hvm_vcpu_initialise(v);
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->mcg_cap = v->arch.mcg_cap;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,10 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
> +                              sizeof(evc->mcg_cap) )
> +                ret = vmce_restore_vcpu(v, evc->mcg_cap);
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has
>       been used, * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint64_t mcg_cap;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -16,7 +16,6 @@ struct bank_entry {
>  struct domain_mca_msrs
>  {
>      /* Guest should not change below values after DOM boot up */
> -    uint64_t mcg_cap;
>      uint64_t mcg_ctl;
>      uint64_t mcg_status;
>      uint64_t *mci_ctl;
> @@ -28,6 +27,8 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
> +extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
> hvm_viridian_vcpu_context); 
> 
> +struct hvm_vmce_vcpu {
> +    uint64_t caps;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint64_aligned_t mcg_cap;
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext
> xen_domctl_ext_vcpucontext_t;

Liu, Jinsong

2012-Mar-06 03:49 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Luck, Tony wrote:> Meaning of MCi_CTL registers. A number of different h/w structures and
> error events within those structures may be reported in a single bank.
> The MCi_CTL register for that bank has a bitmask that allows each of
> the 
> errors to be enabled individually. The detail of which bits in the
> register 
> enable which errors are model specific. Architecturally the SDM
> recommends 
> writing 0xffffff...fffff to them to enable all errors. Sometimes if
> there 
> is a problem on a cpu we might disable some bits (usually handled by
> BIOS 
> or microcode quietly clearing, or not ever setting some of the bits).
> 
> MCG_CTL does a similar thing - but at a global scope - for this cpu
> ... 
> each logical cpu has its own copy of the MCG_* registers - so the
''G''
> for ''Global'' isn''t all the way
''global'' - just locally global :-) - I
> think that the names were given before multi-thread/multi-core.
> i.e. it can affect whether an error is reported in any of the banks
> that are associated with it). Again the exact meaning of the bits
> is model specific, and the architectural recommendation is to enable
> everything with a write of 0xffffff....fffff
Thanks Tony! so MCG_CTL/MCi_CTL are model specific and main purpose is used to
debug. For kernel mca logic, software defaultly sets all 1''s and will
not change it, right?
> 
> I''m not a virtualization expert - so I''m not sure what
the tradeoffs
> are on providing almost pass-through access to the host banks are.
I''d
> have thought that most useful functionality could be provided with
> a wholly virtualized approach:
> 1) The guests really don''t need to see any corrected errors at
all. If
> there are any predictive actions to be taken (e.g. too many corrected
> errors from a memory location -> stop using a page) these can be taken
> at the hypervisor level.
For guest corrected errors, currently Xen deliver CMCI to guest and let guest
itself to make decision.
> 2) Recoverable errors - hypervisor could remap these from whatever
> bank 
> they occur in to a virtual bank. It already needs to convert real
> physical 
> addresses to guest physical - so this shouldn''t be much extra
effort
> 3) Fatal errors - whole system (with all guests) is going down -
> really 
> not much value in trying to tell all the guests exactly why.
Yes, exactly Xen did so for recoverable and fatal errors.

Thanks,
Jinsong
> 
> But perhaps I''m missing some subtlety.
> 
> -Tony
> 
> -----Original Message-----
> From: Liu, Jinsong
> Sent: Monday, March 05, 2012 12:19 PM
> To: Luck, Tony; Jan Beulich; xen-devel@lists.xensource.com
> Cc: Olaf Hering; Dugger, Donald D
> Subject: RE: [Xen-devel] [PATCH 2/2] x86/vMCE: save/restore MCA
> capabilities 
> 
> Jan Beulich wrote:
>> This allows migration to a host with less MCA banks than the source
>> host had, while without this patch accesses to the excess
banks'' MSRs
>> caused #GP-s in the guest after migration (and it depended on the
>> guest kernel whether this would be fatal).
>> 
>> A fundamental question is whether we should also save/restore MCG_CTL
>> and MCi_CTL, as the HVM save record would better be defined to the
>> complete state that needs saving from the beginning (I''m
unsure
>> whether the save/restore logic allows for future extension of an
>> existing record).
> 
> Not sure this point. I always feel confused about the meaning of
> MCG_CTL/MCi_CTL and their defination in SDM looks ambiguous to me. 
> ASK TONY FOR HELP: what the real h/w meaning of MCG_CTL/MCi_CTL?
> seems mce logic seldomly rely on them, especially bit-by-bit of
> MCi_CTL.  
> 
> Another question is, why in the patch mcg_cap defined as per vcpu
> while others (mcg_ctl/ mcg_status/ mci_ctl) defined as per domain?
> Semantically it looks some weird anyway.  
> 
> Thanks,
> Jinsong
> 
>> 
>> Of course, this change is expected to make migration from new to
>> older Xen impossible (again I''m unsure what the save/restore
logic
>> does with records it doesn''t even know about).
>> 
>> The (trivial) tools side change may seem unrelated, but the code
>> should have been that way from the beginning to allow the hypervisor
>> to look at currently unused ext_vcpucontext fields without risking
>> to read garbage when those fields get a meaning assigned in the
>> future. This isn''t being enforced here - should it be?
(Obviously,
>> for backwards compatibility, the hypervisor must assume these fields
>> to be clear only when the extended context''s size exceeds the
old
>> original one.) 
>> 
>> A future addition to this change might be to allow configuration of
>> the number of banks and other MCA capabilities for a guest before it
>> starts (i.e. to not inherits the values seen on the first host it
>> runs on). 
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>

Jan Beulich

2012-Mar-06 07:56 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

>>> On 05.03.12 at 21:19, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Another question is, why in the patch mcg_cap defined as per vcpu while 
> others (mcg_ctl/ mcg_status/ mci_ctl) defined as per domain? Semantically
it
> looks some weird anyway.
That question goes back to you (as a company), who wrote the
original code: Why were these made per-domain in the first place
when they''re really per-vCPU? All I did here is make per-vCPU
what I needed to touch anyway for save/restore (in order to not
carry over the same mistake into the save/restore definitions and
logic).

Jan

Liu, Jinsong

2012-Mar-06 08:29 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:>>>> On 05.03.12 at 21:19, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>> Another question is, why in the patch mcg_cap defined as per vcpu
>> while others (mcg_ctl/ mcg_status/ mci_ctl) defined as per domain?
>> Semantically it looks some weird anyway.
> 
> That question goes back to you (as a company), who wrote the
> original code: Why were these made per-domain in the first place
> when they''re really per-vCPU? All I did here is make per-vCPU
> what I needed to touch anyway for save/restore (in order to not
> carry over the same mistake into the save/restore definitions and
> logic).
> 
> Jan
Per my understanding, the original design may think that per vcpu mca msr is not
necessary: vcpu are not 1:1 w/ pcpu, and multi-copies of same error info (if per
vcpu) is pointless, so it designed as per domain, as long as it successfully
emulated mca msr for guest.

Thanks,
Jinsong

Jan Beulich

2012-Mar-06 08:41 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

>>> On 06.03.12 at 09:29, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Jan Beulich wrote:
>>>>> On 05.03.12 at 21:19, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>>> Another question is, why in the patch mcg_cap defined as per vcpu
>>> while others (mcg_ctl/ mcg_status/ mci_ctl) defined as per domain?
>>> Semantically it looks some weird anyway.
>> 
>> That question goes back to you (as a company), who wrote the
>> original code: Why were these made per-domain in the first place
>> when they''re really per-vCPU? All I did here is make per-vCPU
>> what I needed to touch anyway for save/restore (in order to not
>> carry over the same mistake into the save/restore definitions and
>> logic).
>> 
>> Jan
> 
> Per my understanding, the original design may think that per vcpu mca msr
is
> not necessary: vcpu are not 1:1 w/ pcpu, and multi-copies of same error
info
> (if per vcpu) is pointless, so it designed as per domain, as long as it 
> successfully emulated mca msr for guest.
But as with anything that is being done differently when virtualized,
this may come back and bite us when a need for per-vCPU treatment
is needed. As long as this was purely internal to a particular
hypervisor instance, this was just poor design needing a potentially
larger patch to overcome, but with the involvement of save/restore
it needs to be done properly to allow forward compatibility.

But we''re getting all the farther away from the actual question: Do
we need to provide for saving/restoring of any of the _CTL
registers?

Jan

Liu, Jinsong

2012-Mar-06 09:28 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:>>>> On 06.03.12 at 09:29, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>> Jan Beulich wrote:
>>>>>> On 05.03.12 at 21:19, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>>> wrote: 
>>>> Another question is, why in the patch mcg_cap defined as per
vcpu
>>>> while others (mcg_ctl/ mcg_status/ mci_ctl) defined as per
domain?
>>>> Semantically it looks some weird anyway.
>>> 
>>> That question goes back to you (as a company), who wrote the
>>> original code: Why were these made per-domain in the first place
>>> when they''re really per-vCPU? All I did here is make
per-vCPU
>>> what I needed to touch anyway for save/restore (in order to not
>>> carry over the same mistake into the save/restore definitions and
>>> logic). 
>>> 
>>> Jan
>> 
>> Per my understanding, the original design may think that per vcpu
>> mca msr is not necessary: vcpu are not 1:1 w/ pcpu, and multi-copies
>> of same error info (if per vcpu) is pointless, so it designed as per
>> domain, as long as it successfully emulated mca msr for guest.
> 
> But as with anything that is being done differently when virtualized,
> this may come back and bite us when a need for per-vCPU treatment
> is needed. As long as this was purely internal to a particular
> hypervisor instance, this was just poor design needing a potentially
> larger patch to overcome, but with the involvement of save/restore
> it needs to be done properly to allow forward compatibility.
> 
> But we''re getting all the farther away from the actual question:
Do
> we need to provide for saving/restoring of any of the _CTL
> registers?
> 
Per Tony''s elaboration about _CTL h/w meaning, I thought they are model
specific mainly used for debug purpose and os defaultly set all 1''s to
them (if any misunderstanding please point out to me).
So how about unbind _CTL with host (say, pure software emulated msr, not involve
h_mcg_ctl/h_mci_ctrl[bank])? If so we don''t need save/restore _CTL.
After all they are model specific, and emulated as all 1''s to guest
seems reasonable.

Thanks,
Jinsong

Jan Beulich

2012-Mar-06 10:15 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Jan Beulich wrote:
>> But we''re getting all the farther away from the actual
question: Do
>> we need to provide for saving/restoring of any of the _CTL
>> registers?
>> 
> 
> Per Tony''s elaboration about _CTL h/w meaning, I thought they are
model
> specific mainly used for debug purpose and os defaultly set all
1''s to them
> (if any misunderstanding please point out to me).
> So how about unbind _CTL with host (say, pure software emulated msr, not 
> involve h_mcg_ctl/h_mci_ctrl[bank])? If so we don''t need
save/restore _CTL.
> After all they are model specific, and emulated as all 1''s to
guest seems
> reasonable.
If the guest OS considers a particular CPU model to require an
adjustment to any of these, any such adjustment would be lost over
migration. I''m simply uncertain whether all OSes will tolerate that (in
which case ignoring the writes in the first place would probably be
better).

Jan

Liu, Jinsong

2012-Mar-06 11:55 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:>>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>> Jan Beulich wrote:
>>> But we''re getting all the farther away from the actual
question: Do
>>> we need to provide for saving/restoring of any of the _CTL
>>> registers? 
>>> 
>> 
>> Per Tony''s elaboration about _CTL h/w meaning, I thought they
are
>> model specific mainly used for debug purpose and os defaultly set
>> all 1''s to them (if any misunderstanding please point out to
me).
>> So how about unbind _CTL with host (say, pure software emulated msr,
>> not involve h_mcg_ctl/h_mci_ctrl[bank])? If so we don''t need
>> save/restore _CTL. After all they are model specific, and emulated
>> as all 1''s to guest seems reasonable.
> 
> If the guest OS considers a particular CPU model to require an
> adjustment to any of these, any such adjustment would be lost over
> migration. I''m simply uncertain whether all OSes will tolerate
that
> (in which case ignoring the writes in the first place would probably
> be better).
> 
I''m unsure its risk but if concern OSes tolerance, it would better
avoid such inconsistent case.
An update approach is, pure s/w emulated _CTL + save/restore, which would get
rid of h/w heterogeneity and keep consistent when migrate.
Does it make sense?

Thanks,
Jinsong

Jan Beulich

2012-Mar-06 12:19 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

>>> On 06.03.12 at 12:55, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Jan Beulich wrote:
>>>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>>> Jan Beulich wrote:
>>>> But we''re getting all the farther away from the actual
question: Do
>>>> we need to provide for saving/restoring of any of the _CTL
>>>> registers? 
>>>> 
>>> 
>>> Per Tony''s elaboration about _CTL h/w meaning, I thought
they are
>>> model specific mainly used for debug purpose and os defaultly set
>>> all 1''s to them (if any misunderstanding please point out
to me).
>>> So how about unbind _CTL with host (say, pure software emulated
msr,
>>> not involve h_mcg_ctl/h_mci_ctrl[bank])? If so we don''t
need
>>> save/restore _CTL. After all they are model specific, and emulated
>>> as all 1''s to guest seems reasonable.
>> 
>> If the guest OS considers a particular CPU model to require an
>> adjustment to any of these, any such adjustment would be lost over
>> migration. I''m simply uncertain whether all OSes will tolerate
that
>> (in which case ignoring the writes in the first place would probably
>> be better).
>> 
> 
> I''m unsure its risk but if concern OSes tolerance, it would better
avoid
> such inconsistent case.
> An update approach is, pure s/w emulated _CTL + save/restore, which would 
> get rid of h/w heterogeneity and keep consistent when migrate.
> Does it make sense?
That would be an option, but again only if OSes don''t make
assumptions on the number of banks for certain CPU models.

Jan

Liu, Jinsong

2012-Mar-06 14:38 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Jan Beulich wrote:>>>> On 06.03.12 at 12:55, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>> Jan Beulich wrote:
>>>>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>>> wrote: 
>>>> Jan Beulich wrote:
>>>>> But we''re getting all the farther away from the
actual question:
>>>>> Do we need to provide for saving/restoring of any of the
_CTL
>>>>> registers? 
>>>>> 
>>>> 
>>>> Per Tony''s elaboration about _CTL h/w meaning, I
thought they are
>>>> model specific mainly used for debug purpose and os defaultly
set
>>>> all 1''s to them (if any misunderstanding please point
out to me).
>>>> So how about unbind _CTL with host (say, pure software emulated
>>>> msr, not involve h_mcg_ctl/h_mci_ctrl[bank])? If so we
don''t need
>>>> save/restore _CTL. After all they are model specific, and
emulated
>>>> as all 1''s to guest seems reasonable.
>>> 
>>> If the guest OS considers a particular CPU model to require an
>>> adjustment to any of these, any such adjustment would be lost over
>>> migration. I''m simply uncertain whether all OSes will
tolerate that
>>> (in which case ignoring the writes in the first place would
>>> probably be better). 
>>> 
>> 
>> I''m unsure its risk but if concern OSes tolerance, it would
better
>> avoid such inconsistent case. An update approach is, pure s/w
>> emulated _CTL + save/restore, which would get rid of h/w
>> heterogeneity and keep consistent when migrate. 
>> Does it make sense?
> 
> That would be an option, but again only if OSes don''t make
> assumptions on the number of banks for certain CPU models.
> 
Afict the only way SDM recommand to get bank number is via mcg_cap, so if OS
assume bank number via cpu model it would either get same number as that via
mcg_cap or get wrong number which is OS problem not Xen.

Thanks,
Jinsong

Luck, Tony

2012-Mar-12 15:31 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

> Thanks Tony! so MCG_CTL/MCi_CTL are model specific and main purpose
> is used to debug. For kernel mca logic, software defaultly sets all
> 1''s and will not change it, right?
Linux sets all ones - I don''t know what other OS guests might do.

-Tony

Liu, Jinsong

2012-Mar-23 08:55 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Liu, Jinsong wrote:> Jan Beulich wrote:
>>>>> On 06.03.12 at 12:55, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>> wrote: 
>>> Jan Beulich wrote:
>>>>>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>>>> wrote:
>>>>> Jan Beulich wrote:
>>>>>> But we''re getting all the farther away from
the actual question:
>>>>>> Do we need to provide for saving/restoring of any of
the _CTL
>>>>>> registers? 
>>>>>> 
>>>>> 
>>>>> Per Tony''s elaboration about _CTL h/w meaning, I
thought they are
>>>>> model specific mainly used for debug purpose and os
defaultly set
>>>>> all 1''s to them (if any misunderstanding please
point out to me).
>>>>> So how about unbind _CTL with host (say, pure software
emulated
>>>>> msr, not involve h_mcg_ctl/h_mci_ctrl[bank])? If so we
don''t need
>>>>> save/restore _CTL. After all they are model specific, and
emulated
>>>>> as all 1''s to guest seems reasonable.
>>>> 
>>>> If the guest OS considers a particular CPU model to require an
>>>> adjustment to any of these, any such adjustment would be lost
over
>>>> migration. I''m simply uncertain whether all OSes will
tolerate that
>>>> (in which case ignoring the writes in the first place would
>>>> probably be better). 
>>>> 
>>> 
>>> I''m unsure its risk but if concern OSes tolerance, it
would better
>>> avoid such inconsistent case. An update approach is, pure s/w
>>> emulated _CTL + save/restore, which would get rid of h/w
>>> heterogeneity and keep consistent when migrate.
>>> Does it make sense?
>> 
>> That would be an option, but again only if OSes don''t make
>> assumptions on the number of banks for certain CPU models.
>> 
> 
> Afict the only way SDM recommand to get bank number is via mcg_cap,
> so if OS assume bank number via cpu model it would either get same
> number as that via mcg_cap or get wrong number which is OS problem
> not Xen.   
> 
> Thanks,
> Jinsong
> 
Jan, any more concern about this thread (_CTL)?

And, as for vMCE live migration, it indeed exist some issues when migrate.
Currently we are discussing internally, and will present approach/patches when
available.

Thanks,
Jinsong

Jan Beulich

2012-Mar-23 09:08 UTC

head link

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

>>> On 23.03.12 at 09:55, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Liu, Jinsong wrote:
>> Jan Beulich wrote:
>>>>>> On 06.03.12 at 12:55, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>>> wrote: 
>>>> Jan Beulich wrote:
>>>>>>>> On 06.03.12 at 10:28, "Liu, Jinsong"
<jinsong.liu@intel.com>
>>>>>>>> wrote:
>>>>>> Jan Beulich wrote:
>>>>>>> But we''re getting all the farther away
from the actual question:
>>>>>>> Do we need to provide for saving/restoring of any
of the _CTL
>>>>>>> registers? 
>>>>>>> 
>>>>>> 
>>>>>> Per Tony''s elaboration about _CTL h/w meaning,
I thought they are
>>>>>> model specific mainly used for debug purpose and os
defaultly set
>>>>>> all 1''s to them (if any misunderstanding
please point out to me).
>>>>>> So how about unbind _CTL with host (say, pure software
emulated
>>>>>> msr, not involve h_mcg_ctl/h_mci_ctrl[bank])? If so we
don''t need
>>>>>> save/restore _CTL. After all they are model specific,
and emulated
>>>>>> as all 1''s to guest seems reasonable.
>>>>> 
>>>>> If the guest OS considers a particular CPU model to require
an
>>>>> adjustment to any of these, any such adjustment would be
lost over
>>>>> migration. I''m simply uncertain whether all OSes
will tolerate that
>>>>> (in which case ignoring the writes in the first place would
>>>>> probably be better). 
>>>>> 
>>>> 
>>>> I''m unsure its risk but if concern OSes tolerance, it
would better
>>>> avoid such inconsistent case. An update approach is, pure s/w
>>>> emulated _CTL + save/restore, which would get rid of h/w
>>>> heterogeneity and keep consistent when migrate.
>>>> Does it make sense?
>>> 
>>> That would be an option, but again only if OSes don''t make
>>> assumptions on the number of banks for certain CPU models.
>>> 
>> 
>> Afict the only way SDM recommand to get bank number is via mcg_cap,
>> so if OS assume bank number via cpu model it would either get same
>> number as that via mcg_cap or get wrong number which is OS problem
>> not Xen.   
> 
> Jan, any more concern about this thread (_CTL)?
While I accept your statement above as valid from a theoretical pov,
it''s not going to work in practice if someone comes up with a case
where an OS works flawlessly on real hardware, yet has a problem
when virtualized - it will be the virtualization software that gets
blamed, not the OS.

That said, in this matter I''m fine with not doing anything until we get
an actual report of a problem (at which point working around it by
enforcing the bank count via config setting is probably the most viable
option).
> And, as for vMCE live migration, it indeed exist some issues when migrate. 
> Currently we are discussing internally, and will present approach/patches 
> when available.
Thanks, this is the part that we really need to deal with (and at least
settle on the migration _interface_ before 4.2 gets out).

Jan

Xen devel - Feb 2012 - [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

[PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Re: [PATCH 2/2] x86/vMCE: save/restore MCA capabilities