thr3ads.net - Xen devel - vMCE vs migration [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2012-Jan-23 11:08 UTC

vMCE vs migration

x86''s vMCE implementation lets a guest know of as many MCE reporting
banks as there are in the host. While a PV guest could be expected to
deal with this number changing (particularly decreasing) during migration
(not currently handled anywhere afaict), for HVM guests this is certainly
wrong.

At least to me it isn''t, however, clear how to properly handle this.
The
easiest would appear to be to save and restore the number of banks
the guest was made believe it can access, making vmce_{rd,wr}msr()
silently tolerate accesses between the host and guest values.

Any thoughts appreciated,
Jan

George Dunlap

2012-Jan-24 10:29 UTC

head link

Re: vMCE vs migration

On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com>
wrote:> x86''s vMCE implementation lets a guest know of as many MCE
reporting
> banks as there are in the host. While a PV guest could be expected to
> deal with this number changing (particularly decreasing) during migration
> (not currently handled anywhere afaict), for HVM guests this is certainly
> wrong.
>
> At least to me it isn''t, however, clear how to properly handle
this. The
> easiest would appear to be to save and restore the number of banks
> the guest was made believe it can access, making vmce_{rd,wr}msr()
> silently tolerate accesses between the host and guest values.
We ran into this in the XS 6.0 release as well.  I think that the
ideal thing to do would be to have a parameter that can be set at
boot, to say how many vMCE banks a guest has, defaulting to the number
of MCE banks on the host.  This parameter would be preserved across
migration.  Ideally, a pool-aware toolstack like xapi would then set
this value to be the value of the host in the pool with the largest
number of banks, allowing a guest to access all the banks on any host
to which it migrates.

What do you think?
 -George

Jan Beulich

2012-Jan-24 11:08 UTC

head link

Re: vMCE vs migration

>>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>> x86''s vMCE implementation lets a guest know of as many MCE
reporting
>> banks as there are in the host. While a PV guest could be expected to
>> deal with this number changing (particularly decreasing) during
migration
>> (not currently handled anywhere afaict), for HVM guests this is
certainly
>> wrong.
>>
>> At least to me it isn''t, however, clear how to properly handle
this. The
>> easiest would appear to be to save and restore the number of banks
>> the guest was made believe it can access, making vmce_{rd,wr}msr()
>> silently tolerate accesses between the host and guest values.
> 
> We ran into this in the XS 6.0 release as well.  I think that the
> ideal thing to do would be to have a parameter that can be set at
> boot, to say how many vMCE banks a guest has, defaulting to the number
> of MCE banks on the host.  This parameter would be preserved across
> migration.  Ideally, a pool-aware toolstack like xapi would then set
> this value to be the value of the host in the pool with the largest
> number of banks, allowing a guest to access all the banks on any host
> to which it migrates.
> 
> What do you think?
That sounds like the way to go.

Jan

George Dunlap

2012-Jan-26 16:54 UTC

head link

Re: vMCE vs migration

On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
> >> x86''s vMCE implementation lets a guest know of as many
MCE reporting
> >> banks as there are in the host. While a PV guest could be expected
to
> >> deal with this number changing (particularly decreasing) during
migration
> >> (not currently handled anywhere afaict), for HVM guests this is
certainly
> >> wrong.
> >>
> >> At least to me it isn''t, however, clear how to properly
handle this. The
> >> easiest would appear to be to save and restore the number of banks
> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
> >> silently tolerate accesses between the host and guest values.
> > 
> > We ran into this in the XS 6.0 release as well.  I think that the
> > ideal thing to do would be to have a parameter that can be set at
> > boot, to say how many vMCE banks a guest has, defaulting to the number
> > of MCE banks on the host.  This parameter would be preserved across
> > migration.  Ideally, a pool-aware toolstack like xapi would then set
> > this value to be the value of the host in the pool with the largest
> > number of banks, allowing a guest to access all the banks on any host
> > to which it migrates.
> > 
> > What do you think?
> 
> That sounds like the way to go.
So should we put this on IanC''s to-do-be-done list?  Are you going to
put it on your to-do list? :-)

 -George

Jan Beulich

2012-Jan-30 07:52 UTC

head link

Re: vMCE vs migration

>>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
>> >> x86''s vMCE implementation lets a guest know of as
many MCE reporting
>> >> banks as there are in the host. While a PV guest could be
expected to
>> >> deal with this number changing (particularly decreasing)
during migration
>> >> (not currently handled anywhere afaict), for HVM guests this
is certainly
>> >> wrong.
>> >>
>> >> At least to me it isn''t, however, clear how to
properly handle this. The
>> >> easiest would appear to be to save and restore the number of
banks
>> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
>> >> silently tolerate accesses between the host and guest values.
>> > 
>> > We ran into this in the XS 6.0 release as well.  I think that the
>> > ideal thing to do would be to have a parameter that can be set at
>> > boot, to say how many vMCE banks a guest has, defaulting to the
number
>> > of MCE banks on the host.  This parameter would be preserved
across
>> > migration.  Ideally, a pool-aware toolstack like xapi would then
set
>> > this value to be the value of the host in the pool with the
largest
>> > number of banks, allowing a guest to access all the banks on any
host
>> > to which it migrates.
>> > 
>> > What do you think?
>> 
>> That sounds like the way to go.
> 
> So should we put this on IanC''s to-do-be-done list?
Would certainly be desirable to get fixed.
> Are you going to put it on your to-do list? :-)
I''ll first check with Intel whether the engineers who wrote that code
could make an attempt.

Jan

Jan Beulich

2012-Jan-30 13:47 UTC

head link

Re: vMCE vs migration

>>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
>> >> x86''s vMCE implementation lets a guest know of as
many MCE reporting
>> >> banks as there are in the host. While a PV guest could be
expected to
>> >> deal with this number changing (particularly decreasing)
during migration
>> >> (not currently handled anywhere afaict), for HVM guests this
is certainly
>> >> wrong.
>> >>
>> >> At least to me it isn''t, however, clear how to
properly handle this. The
>> >> easiest would appear to be to save and restore the number of
banks
>> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
>> >> silently tolerate accesses between the host and guest values.
>> > 
>> > We ran into this in the XS 6.0 release as well.  I think that the
>> > ideal thing to do would be to have a parameter that can be set at
>> > boot, to say how many vMCE banks a guest has, defaulting to the
number
>> > of MCE banks on the host.  This parameter would be preserved
across
>> > migration.  Ideally, a pool-aware toolstack like xapi would then
set
>> > this value to be the value of the host in the pool with the
largest
>> > number of banks, allowing a guest to access all the banks on any
host
>> > to which it migrates.
>> > 
>> > What do you think?
>> 
>> That sounds like the way to go.
> 
> So should we put this on IanC''s to-do-be-done list?  Are you going
to
> put it on your to-do list? :-)
Below/attached a draft patch (compile tested only), handling save/
restore of the bank count, but not allowing for a config setting to
specify its initial value (yet).

Jan

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: nr_banks %u\n", p.nr_banks);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don''t understand type %u:
skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr <
MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.nr_vmce_banks = nr_mce_banks;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .nr_banks = v->arch.nr_vmce_banks
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n",
vcpuid);
+        return -EINVAL;
+    }
+
+    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+    if ( !err )
+        v->arch.nr_vmce_banks = ctxt.nr_banks;
+
+    return err;
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
 
+    vmce_init_vcpu(v);
+
     rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
  done:
     if ( rc )
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->nr_mce_banks = v->arch.nr_vmce_banks;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,16 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc),
mbz[ARRAY_SIZE(evc->mbz)]) )
+            {
+                unsigned int i;
+
+                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
+                    if ( evc->mbz[i] )
+                        goto ext_vcpucontext_out;
+                v->arch.nr_vmce_banks = evc->nr_mce_banks;
+            }
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint8_t nr_vmce_banks;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -28,6 +28,7 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint8_t nr_banks;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint8_t          nr_mce_banks;
+    /* This array can be split and used for future extension. */
+    /* It is there just to grow the structure beyond 4.1''s size. */
+    uint8_t          mbz[9];
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2012-Jan-31 11:27 UTC

head link

Re: vMCE vs migration

On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:> >>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> >> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
> >> >> x86''s vMCE implementation lets a guest know of
as many MCE reporting
> >> >> banks as there are in the host. While a PV guest could be
expected to
> >> >> deal with this number changing (particularly decreasing)
during migration
> >> >> (not currently handled anywhere afaict), for HVM guests
this is certainly
> >> >> wrong.
> >> >>
> >> >> At least to me it isn''t, however, clear how to
properly handle this. The
> >> >> easiest would appear to be to save and restore the number
of banks
> >> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
> >> >> silently tolerate accesses between the host and guest
values.
> >> >
> >> > We ran into this in the XS 6.0 release as well.  I think that
the
> >> > ideal thing to do would be to have a parameter that can be
set at
> >> > boot, to say how many vMCE banks a guest has, defaulting to
the number
> >> > of MCE banks on the host.  This parameter would be preserved
across
> >> > migration.  Ideally, a pool-aware toolstack like xapi would
then set
> >> > this value to be the value of the host in the pool with the
largest
> >> > number of banks, allowing a guest to access all the banks on
any host
> >> > to which it migrates.
> >> >
> >> > What do you think?
> >>
> >> That sounds like the way to go.
> >
> > So should we put this on IanC''s to-do-be-done list?  Are you
going to
> > put it on your to-do list? :-)
> 
> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).
Looks pretty good for a first blush.  Just one question: Why is the vmce
count made on a per-vcpu basis, rather than on a per-domain basis, like
the actual banks are?  Is the host MCE stuff per-vcpu?

 -George
> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t
*val)
> +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
>          break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to current
> @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n", val);
>          /* For HVM guest, this is the point for deleting vMCE injection
node */
> -        if ( d->is_hvm && (vmce->nr_injection > 0) )
> +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
>          break;
>      }
> 
> @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .nr_banks = v->arch.nr_vmce_banks
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL
)
> +    {
> +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu
%u\n", vcpuid);
> +        return -EINVAL;
> +    }
> +
> +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +    if ( !err )
> +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> +
> +    return err;
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> 
>      v->arch.pv_vcpu.ctrlreg[4] =
real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> 
> +    vmce_init_vcpu(v);
> +
>      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
>   done:
>      if ( rc )
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,16 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc),
mbz[ARRAY_SIZE(evc->mbz)]) )
> +            {
> +                unsigned int i;
> +
> +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> +                    if ( evc->mbz[i] )
> +                        goto ext_vcpucontext_out;
> +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> +            }
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has been
used,
>       * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint8_t nr_vmce_banks;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -28,6 +28,7 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
hvm_viridian_vcpu_context);
> 
> +struct hvm_vmce_vcpu {
> +    uint8_t nr_banks;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint8_t          nr_mce_banks;
> +    /* This array can be split and used for future extension. */
> +    /* It is there just to grow the structure beyond 4.1''s size.
*/
> +    uint8_t          mbz[9];
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;
>

George Dunlap

2012-Jan-31 11:28 UTC

head link

Re: vMCE vs migration

On Tue, 2012-01-31 at 11:27 +0000, George Dunlap wrote:> On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
> > >>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
> > > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> > >> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> > >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
> > >> >> x86''s vMCE implementation lets a guest know
of as many MCE reporting
> > >> >> banks as there are in the host. While a PV guest
could be expected to
> > >> >> deal with this number changing (particularly
decreasing) during migration
> > >> >> (not currently handled anywhere afaict), for HVM
guests this is certainly
> > >> >> wrong.
> > >> >>
> > >> >> At least to me it isn''t, however, clear how
to properly handle this. The
> > >> >> easiest would appear to be to save and restore the
number of banks
> > >> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
> > >> >> silently tolerate accesses between the host and
guest values.
> > >> >
> > >> > We ran into this in the XS 6.0 release as well.  I think
that the
> > >> > ideal thing to do would be to have a parameter that can
be set at
> > >> > boot, to say how many vMCE banks a guest has, defaulting
to the number
> > >> > of MCE banks on the host.  This parameter would be
preserved across
> > >> > migration.  Ideally, a pool-aware toolstack like xapi
would then set
> > >> > this value to be the value of the host in the pool with
the largest
> > >> > number of banks, allowing a guest to access all the
banks on any host
> > >> > to which it migrates.
> > >> >
> > >> > What do you think?
> > >>
> > >> That sounds like the way to go.
> > >
> > > So should we put this on IanC''s to-do-be-done list?  Are
you going to
> > > put it on your to-do list? :-)
> > 
> > Below/attached a draft patch (compile tested only), handling save/
> > restore of the bank count, but not allowing for a config setting to
> > specify its initial value (yet).
> 
> Looks pretty good for a first blush.  Just one question: Why is the vmce
> count made on a per-vcpu basis, rather than on a per-domain basis, like
> the actual banks are?  Is the host MCE stuff per-vcpu?
(Per-pcpu, that is...)
> 
>  -George
> 
> > +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr,
uint64_t *val)
> > +{
> > +    int ret = 1;
> > +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
> >      struct bank_entry *entry;
> > 
> > -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > -    if ( bank >= nr_mce_banks )
> > -        return -1;
> > +    *val = 0;
> > 
> >      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
> >      {
> >      case MSR_IA32_MC0_CTL:
> > -        *val = vmce->mci_ctl[bank] &
> > -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> > +        if ( bank < nr_mce_banks )
> > +            *val = vmce->mci_ctl[bank] &
> > +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> >          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
> >                     bank, *val);
> >          break;
> > @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
> >          switch ( boot_cpu_data.x86_vendor )
> >          {
> >          case X86_VENDOR_INTEL:
> > -            ret = intel_mce_rdmsr(msr, val);
> > +            ret = intel_mce_rdmsr(v, msr, val);
> >              break;
> >          default:
> >              ret = 0;
> > @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
> >   */
> >  int vmce_rdmsr(uint32_t msr, uint64_t *val)
> >  {
> > -    struct domain *d = current->domain;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    const struct vcpu *cur = current;
> > +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
> >      int ret = 1;
> > 
> >      *val = 0;
> > 
> > -    spin_lock(&dom_vmce(d)->lock);
> > +    spin_lock(&vmce->lock);
> > 
> >      switch ( msr )
> >      {
> > @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
> >                     *val);
> >          break;
> >      default:
> > -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> > +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val)
: 0;
> >          break;
> >      }
> > 
> > -    spin_unlock(&dom_vmce(d)->lock);
> > +    spin_unlock(&vmce->lock);
> >      return ret;
> >  }
> > 
> > -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> > +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
> >  {
> > -    int bank, ret = 1;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    int ret = 1;
> > +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
> >      struct bank_entry *entry = NULL;
> > 
> > -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > -    if ( bank >= nr_mce_banks )
> > -        return -EINVAL;
> > -
> >      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
> >      {
> >      case MSR_IA32_MC0_CTL:
> > -        vmce->mci_ctl[bank] = val;
> > +        if ( bank < nr_mce_banks )
> > +            vmce->mci_ctl[bank] = val;
> >          break;
> >      case MSR_IA32_MC0_STATUS:
> >          /* Give the first entry of the list, it corresponds to
current
> > @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
> >           * the guest, this node will be deleted.
> >           * Only error bank is written. Non-error banks simply return.
> >           */
> > -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> > +        if ( !list_empty(&vmce->impact_header) )
> >          {
> > -            entry = list_entry(dom_vmce(d)->impact_header.next,
> > +            entry = list_entry(vmce->impact_header.next,
> >                                 struct bank_entry, list);
> >              if ( entry->bank == bank )
> >                  entry->mci_status = val;
> > @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
> >          switch ( boot_cpu_data.x86_vendor )
> >          {
> >          case X86_VENDOR_INTEL:
> > -            ret = intel_mce_wrmsr(msr, val);
> > +            ret = intel_mce_wrmsr(v, msr, val);
> >              break;
> >          default:
> >              ret = 0;
> > @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
> >   */
> >  int vmce_wrmsr(u32 msr, u64 val)
> >  {
> > -    struct domain *d = current->domain;
> > +    struct vcpu *cur = current;
> >      struct bank_entry *entry = NULL;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
> >      int ret = 1;
> > 
> >      if ( !g_mcg_cap )
> > @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
> >          vmce->mcg_status = val;
> >          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n", val);
> >          /* For HVM guest, this is the point for deleting vMCE
injection node */
> > -        if ( d->is_hvm && (vmce->nr_injection > 0) )
> > +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection >
0) )
> >          {
> >              vmce->nr_injection--; /* Should be 0 */
> >              if ( !list_empty(&vmce->impact_header) )
> > @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
> >          ret = -1;
> >          break;
> >      default:
> > -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> > +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val)
: 0;
> >          break;
> >      }
> > 
> > @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
> >      return ret;
> >  }
> > 
> > +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t
*h)
> > +{
> > +    struct vcpu *v;
> > +    int err = 0;
> > +
> > +    for_each_vcpu( d, v ) {
> > +        struct hvm_vmce_vcpu ctxt = {
> > +            .nr_banks = v->arch.nr_vmce_banks
> > +        };
> > +
> > +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> > +        if ( err )
> > +            break;
> > +    }
> > +
> > +    return err;
> > +}
> > +
> > +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t
*h)
> > +{
> > +    unsigned int vcpuid = hvm_load_instance(h);
> > +    struct vcpu *v;
> > +    struct hvm_vmce_vcpu ctxt;
> > +    int err;
> > +
> > +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) ==
NULL )
> > +    {
> > +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu
%u\n", vcpuid);
> > +        return -EINVAL;
> > +    }
> > +
> > +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> > +    if ( !err )
> > +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> > +
> > +    return err;
> > +}
> > +
> > +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> > +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> > +
> >  int inject_vmce(struct domain *d)
> >  {
> >      int cpu = smp_processor_id();
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> > 
> >      v->arch.pv_vcpu.ctrlreg[4] =
real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > 
> > +    vmce_init_vcpu(v);
> > +
> >      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
> >   done:
> >      if ( rc )
> > --- a/xen/arch/x86/domctl.c
> > +++ b/xen/arch/x86/domctl.c
> > @@ -1027,11 +1027,12 @@ long arch_do_domctl(
> >                  evc->syscall32_callback_eip    = 0;
> >                  evc->syscall32_disables_events = 0;
> >              }
> > +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
> >          }
> >          else
> >          {
> >              ret = -EINVAL;
> > -            if ( evc->size != sizeof(*evc) )
> > +            if ( evc->size < offsetof(typeof(*evc),
nr_mce_banks) )
> >                  goto ext_vcpucontext_out;
> >  #ifdef __x86_64__
> >              if ( !is_hvm_domain(d) )
> > @@ -1059,6 +1060,16 @@ long arch_do_domctl(
> >                   (evc->syscall32_callback_cs & ~3) ||
> >                   evc->syscall32_callback_eip )
> >                  goto ext_vcpucontext_out;
> > +
> > +            if ( evc->size >= offsetof(typeof(*evc),
mbz[ARRAY_SIZE(evc->mbz)]) )
> > +            {
> > +                unsigned int i;
> > +
> > +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> > +                    if ( evc->mbz[i] )
> > +                        goto ext_vcpucontext_out;
> > +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> > +            }
> >          }
> > 
> >          ret = 0;
> > --- a/xen/include/asm-x86/domain.h
> > +++ b/xen/include/asm-x86/domain.h
> > @@ -488,6 +488,8 @@ struct arch_vcpu
> >      /* This variable determines whether nonlazy extended state has
been used,
> >       * and thus should be saved/restored. */
> >      bool_t nonlazy_xstate_used;
> > +
> > +    uint8_t nr_vmce_banks;
> > 
> >      struct paging_vcpu paging;
> > 
> > --- a/xen/include/asm-x86/mce.h
> > +++ b/xen/include/asm-x86/mce.h
> > @@ -28,6 +28,7 @@ struct domain_mca_msrs
> >  /* Guest vMCE MSRs virtualization */
> >  extern int vmce_init_msr(struct domain *d);
> >  extern void vmce_destroy_msr(struct domain *d);
> > +extern void vmce_init_vcpu(struct vcpu *);
> >  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
> >  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> > 
> > --- a/xen/include/public/arch-x86/hvm/save.h
> > +++ b/xen/include/public/arch-x86/hvm/save.h
> > @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> > 
> >  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
hvm_viridian_vcpu_context);
> > 
> > +struct hvm_vmce_vcpu {
> > +    uint8_t nr_banks;
> > +};
> > +
> > +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> > +
> >  /*
> >   * Largest type-code in use
> >   */
> > -#define HVM_SAVE_CODE_MAX 17
> > +#define HVM_SAVE_CODE_MAX 18
> > 
> >  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
> >      uint32_t         vcpu;
> >      /*
> >       * SET: Size of struct (IN)
> > -     * GET: Size of struct (OUT)
> > +     * GET: Size of struct (OUT, up to 128 bytes)
> >       */
> >      uint32_t         size;
> >  #if defined(__i386__) || defined(__x86_64__)
> > @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
> >      uint16_t         sysenter_callback_cs;
> >      uint8_t          syscall32_disables_events;
> >      uint8_t          sysenter_disables_events;
> > +    uint8_t          nr_mce_banks;
> > +    /* This array can be split and used for future extension. */
> > +    /* It is there just to grow the structure beyond 4.1''s
size. */
> > +    uint8_t          mbz[9];
> >  #endif
> >  };
> >  typedef struct xen_domctl_ext_vcpucontext
xen_domctl_ext_vcpucontext_t;
> > 
> 
>

Jan Beulich

2012-Jan-31 13:17 UTC

head link

Re: vMCE vs migration

>>> On 31.01.12 at 12:27, George Dunlap
<george.dunlap@citrix.com> wrote:
> On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
>> >>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
>> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
>> >> >> x86''s vMCE implementation lets a guest know
of as many MCE reporting
>> >> >> banks as there are in the host. While a PV guest
could be expected to
>> >> >> deal with this number changing (particularly
decreasing) during migration
>> >> >> (not currently handled anywhere afaict), for HVM
guests this is certainly
>> >> >> wrong.
>> >> >>
>> >> >> At least to me it isn''t, however, clear how
to properly handle this. The
>> >> >> easiest would appear to be to save and restore the
number of banks
>> >> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
>> >> >> silently tolerate accesses between the host and guest
values.
>> >> >
>> >> > We ran into this in the XS 6.0 release as well.  I think
that the
>> >> > ideal thing to do would be to have a parameter that can
be set at
>> >> > boot, to say how many vMCE banks a guest has, defaulting
to the number
>> >> > of MCE banks on the host.  This parameter would be
preserved across
>> >> > migration.  Ideally, a pool-aware toolstack like xapi
would then set
>> >> > this value to be the value of the host in the pool with
the largest
>> >> > number of banks, allowing a guest to access all the banks
on any host
>> >> > to which it migrates.
>> >> >
>> >> > What do you think?
>> >>
>> >> That sounds like the way to go.
>> >
>> > So should we put this on IanC''s to-do-be-done list?  Are
you going to
>> > put it on your to-do list? :-)
>> 
>> Below/attached a draft patch (compile tested only), handling save/
>> restore of the bank count, but not allowing for a config setting to
>> specify its initial value (yet).
> 
> Looks pretty good for a first blush.  Just one question: Why is the vmce
> count made on a per-vcpu basis, rather than on a per-domain basis, like
> the actual banks are?  Is the host MCE stuff per-vcpu?
The question should probably be the other way around - why is the
vMCE implementation using global (fake) MSRs rather than per-vCPU
ones (as they would be on hardware). If the change here was
implemented as per-domain MSRs, a future move of the vMCE
implementation to a more natural model would be impossible. Also,
for the PV case the save/restore logic is much simpler this way.

Jan

George Dunlap

2012-Jan-31 14:34 UTC

head link

Re: vMCE vs migration

On Tue, 2012-01-31 at 13:17 +0000, Jan Beulich wrote:> >>> On 31.01.12 at 12:27, George Dunlap
<george.dunlap@citrix.com> wrote:
> > On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
> >> >>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com> wrote:
> >> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> >> >> >>> On 24.01.12 at 11:29, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> >> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com> wrote:
> >> >> >> x86''s vMCE implementation lets a guest
know of as many MCE reporting
> >> >> >> banks as there are in the host. While a PV guest
could be expected to
> >> >> >> deal with this number changing (particularly
decreasing) during migration
> >> >> >> (not currently handled anywhere afaict), for HVM
guests this is certainly
> >> >> >> wrong.
> >> >> >>
> >> >> >> At least to me it isn''t, however, clear
how to properly handle this. The
> >> >> >> easiest would appear to be to save and restore
the number of banks
> >> >> >> the guest was made believe it can access, making
vmce_{rd,wr}msr()
> >> >> >> silently tolerate accesses between the host and
guest values.
> >> >> >
> >> >> > We ran into this in the XS 6.0 release as well.  I
think that the
> >> >> > ideal thing to do would be to have a parameter that
can be set at
> >> >> > boot, to say how many vMCE banks a guest has,
defaulting to the number
> >> >> > of MCE banks on the host.  This parameter would be
preserved across
> >> >> > migration.  Ideally, a pool-aware toolstack like
xapi would then set
> >> >> > this value to be the value of the host in the pool
with the largest
> >> >> > number of banks, allowing a guest to access all the
banks on any host
> >> >> > to which it migrates.
> >> >> >
> >> >> > What do you think?
> >> >>
> >> >> That sounds like the way to go.
> >> >
> >> > So should we put this on IanC''s to-do-be-done list? 
Are you going to
> >> > put it on your to-do list? :-)
> >> 
> >> Below/attached a draft patch (compile tested only), handling save/
> >> restore of the bank count, but not allowing for a config setting
to
> >> specify its initial value (yet).
> > 
> > Looks pretty good for a first blush.  Just one question: Why is the
vmce
> > count made on a per-vcpu basis, rather than on a per-domain basis,
like
> > the actual banks are?  Is the host MCE stuff per-vcpu?
> 
> The question should probably be the other way around - why is the
> vMCE implementation using global (fake) MSRs rather than per-vCPU
> ones (as they would be on hardware). If the change here was
> implemented as per-domain MSRs, a future move of the vMCE
> implementation to a more natural model would be impossible. Also,
> for the PV case the save/restore logic is much simpler this way.
Ah, that makes sense.  Good spot on the impossiblity of changing to
per-vcpu later.  

 -George

Liu, Jinsong

2012-Feb-03 07:18 UTC

head link

Re: vMCE vs migration

Jan Beulich wrote:>>>> On 26.01.12 at 17:54, George Dunlap
<george.dunlap@citrix.com>
>>>> wrote: 
>> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>>>>>> On 24.01.12 at 11:29, George Dunlap
>>>>>> <George.Dunlap@eu.citrix.com> wrote: 
>>>> On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich
<JBeulich@suse.com>
>>>> wrote: 
>>>>> x86''s vMCE implementation lets a guest know of as
many MCE
>>>>> reporting banks as there are in the host. While a PV guest
could
>>>>> be expected to deal with this number changing (particularly
>>>>> decreasing) during migration (not currently handled
anywhere
>>>>> afaict), for HVM guests this is certainly wrong. 
>>>>> 
>>>>> At least to me it isn''t, however, clear how to
properly handle
>>>>> this. The easiest would appear to be to save and restore
the
>>>>> number of banks 
>>>>> the guest was made believe it can access, making
vmce_{rd,wr}msr()
>>>>> silently tolerate accesses between the host and guest
values.
>>>> 
>>>> We ran into this in the XS 6.0 release as well.  I think that
the
>>>> ideal thing to do would be to have a parameter that can be set
at
>>>> boot, to say how many vMCE banks a guest has, defaulting to the
>>>> number of MCE banks on the host.  This parameter would be
>>>> preserved across migration.  Ideally, a pool-aware toolstack
like
>>>> xapi would then set this value to be the value of the host in
the
>>>> pool with the largest number of banks, allowing a guest to
access
>>>> all the banks on any host to which it migrates. 
>>>> 
>>>> What do you think?
>>> 
>>> That sounds like the way to go.
>> 
>> So should we put this on IanC''s to-do-be-done list?  Are you
going to
>> put it on your to-do list? :-)
> 
> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).
> 
> Jan
> 
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
> 
>          domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
>          domctl.domain = dom;
> +        memset(&domctl.u, 0, sizeof(domctl.u));
>          domctl.u.ext_vcpucontext.vcpu = i;
>          if ( xc_domctl(xch, &domctl) < 0 )
>          {
> --- a/tools/misc/xen-hvmctx.c
> +++ b/tools/misc/xen-hvmctx.c
> @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
>             (unsigned long long) p.apic_assist);
>  }
> 
> +static void dump_vmce_vcpu(void)
> +{
> +    HVM_SAVE_TYPE(VMCE_VCPU) p;
> +    READ(p);
> +    printf("    VMCE_VCPU: nr_banks %u\n", p.nr_banks);
> +}
> +
>  int main(int argc, char **argv)
>  {
>      int entry, domid;
> @@ -449,6 +456,7 @@ int main(int argc, char **argv)
>          case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
>          case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain();
>          break; case HVM_SAVE_CODE(VIRIDIAN_VCPU):
> dump_viridian_vcpu(); break; +        case HVM_SAVE_CODE(VMCE_VCPU):
>          dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break;
>          default:
>              printf(" ** Don''t understand type %u:
skipping\n",
> --- a/xen/arch/x86/cpu/mcheck/mce.h
> +++ b/xen/arch/x86/cpu/mcheck/mce.h
> @@ -3,6 +3,7 @@
>  #define _MCE_H
> 
>  #include <xen/init.h>
> +#include <xen/sched.h>
>  #include <xen/smp.h>
>  #include <asm/types.h>
>  #include <asm/traps.h>
> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
>  u64 mce_cap_init(void);
>  extern int firstbank;
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val);
> +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t
> *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t
> val); 
> 
>  struct mcinfo_extended *intel_get_extended_msrs(
>      struct mcinfo_global *mig, struct mc_info *mi);
> @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
> 
>  extern int vmce_init(struct cpuinfo_x86 *c);
> 
> -static inline int mce_vendor_bank_msr(uint32_t msr)
> +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t
>  msr) {
>      if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> -         msr >= MSR_IA32_MC0_CTL2 && msr <
(MSR_IA32_MC0_CTL2 +
> nr_mce_banks) ) +         msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>            return 1;
>      return 0;
>  }
> 
> -static inline int mce_bank_msr(uint32_t msr)
> +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
>  {
> -    if ( (msr >= MSR_IA32_MC0_CTL && msr <
> MSR_IA32_MCx_CTL(nr_mce_banks)) || 
> -        mce_vendor_bank_msr(msr) )
> +    if ( (msr >= MSR_IA32_MC0_CTL &&
> +          msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) ||
> +         mce_vendor_bank_msr(v, msr) )
>          return 1;
>      return 0;
>  }
> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c
> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
> @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
>  }
> 
>  /* intel specific MCA MSR */
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val)
> +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
>  {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not write this MSR!\n");
> @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
>      return ret;
>  }
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
> +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t
>  *val) {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2
+
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability,
"
>                   "Guest should not read this MSR!\n");
> --- a/xen/arch/x86/cpu/mcheck/vmce.c
> +++ b/xen/arch/x86/cpu/mcheck/vmce.c
> @@ -10,6 +10,7 @@
>  #include <xen/delay.h>
>  #include <xen/smp.h>
>  #include <xen/mm.h>
> +#include <xen/hvm/save.h>
>  #include <asm/processor.h>
>  #include <public/sysctl.h>
>  #include <asm/system.h>
> @@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d)
>      dom_vmce(d) = NULL;
>  }
> 
> -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t
> *val) +void vmce_init_vcpu(struct vcpu *v)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    v->arch.nr_vmce_banks = nr_mce_banks;
> +}
> +
> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr,
> uint64_t *val) +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to
> current @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n",
>          val); /* For HVM guest, this is the point for deleting vMCE
> injection node */ -        if ( d->is_hvm &&
(vmce->nr_injection > 0)
> ) +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0)
)
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .nr_banks = v->arch.nr_vmce_banks
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL
)
> +    {
> +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu
%u\n",
> vcpuid); +        return -EINVAL;
> +    }
> +
> +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +    if ( !err )
> +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> +
> +    return err;
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> 
>      v->arch.pv_vcpu.ctrlreg[4] >
real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> 
> +    vmce_init_vcpu(v);
> +
>      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
>   done:
>      if ( rc )
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,16 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc),
> mbz[ARRAY_SIZE(evc->mbz)]) ) +            {
> +                unsigned int i;
> +
> +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> +                    if ( evc->mbz[i] )
> +                        goto ext_vcpucontext_out;
> +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> +            }
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has
>       been used, * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint8_t nr_vmce_banks;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -28,6 +28,7 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
> hvm_viridian_vcpu_context); 
> 
> +struct hvm_vmce_vcpu {
> +    uint8_t nr_banks;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint8_t          nr_mce_banks;
> +    /* This array can be split and used for future extension. */
> +    /* It is there just to grow the structure beyond 4.1''s size.
*/
> +    uint8_t          mbz[9];
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext
> xen_domctl_ext_vcpucontext_t; 
Jan,

How about use static bank size, say 256, which architecture (IA32_MCG_CAP MSR)
defined max bank number?

Thanks,
Jinsong

Jan Beulich

2012-Feb-03 08:08 UTC

head link

Re: vMCE vs migration

>>> On 03.02.12 at 08:18, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> How about use static bank size, say 256, which architecture (IA32_MCG_CAP 
> MSR) defined max bank number?
The maximum, if any, is 32 (beyond which collision with other defined
MSRs would arise). But with the possibility of having a discontiguous
or relocated MSR space, I don''t think any static maximum would be a
good idea.

Jan

Liu, Jinsong

2012-Feb-03 12:34 UTC

head link

Re: vMCE vs migration

Jan Beulich wrote:>>>> On 03.02.12 at 08:18, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>> How about use static bank size, say 256, which architecture
>> (IA32_MCG_CAP MSR) defined max bank number?
> 
> The maximum, if any, is 32 (beyond which collision with other defined
> MSRs would arise). But with the possibility of having a discontiguous
> or relocated MSR space, I don''t think any static maximum would be
a
> good idea.
> 
> Jan
The advantages of static max bank is simple, any disadvantages of static
solution?
or, can we use bank_entry listed in vmce->impact_header for mci_ctl, like
what mci_status/mci_addr/mci_misc do?
I just think the patch may be too complicated for a minor issue.

Thanks,
Jinsong

Jan Beulich

2012-Feb-03 14:04 UTC

head link

Re: vMCE vs migration

>>> On 03.02.12 at 13:34, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
> Jan Beulich wrote:
>>>>> On 03.02.12 at 08:18, "Liu, Jinsong"
<jinsong.liu@intel.com> wrote:
>>> How about use static bank size, say 256, which architecture
>>> (IA32_MCG_CAP MSR) defined max bank number?
>> 
>> The maximum, if any, is 32 (beyond which collision with other defined
>> MSRs would arise). But with the possibility of having a discontiguous
>> or relocated MSR space, I don''t think any static maximum would
be a
>> good idea.
>> 
>> Jan
> 
> The advantages of static max bank is simple, any disadvantages of static 
> solution?
Lack of forward compatibility. I may even be that the use of uint8_t
wasn''t really a good choice of mine.
> or, can we use bank_entry listed in vmce->impact_header for mci_ctl,
like what
> mci_status/mci_addr/mci_misc do?
I don''t think so - the control register (obviously) must be accessible
independent of any ongoing (latched) MCE.
> I just think the patch may be too complicated for a minor issue.
Is a guest crash minor? After all there is no guarantee that the
accesses to these MSRs are exception handler protected in a
way similar to what current Linux does.

Jan

George Dunlap

2012-Feb-04 12:35 UTC

head link

Re: vMCE vs migration

On Fri, Feb 3, 2012 at 2:04 PM, Jan Beulich <JBeulich@suse.com>
wrote:>> I just think the patch may be too complicated for a minor issue.
>
> Is a guest crash minor? After all there is no guarantee that the
> accesses to these MSRs are exception handler protected in a
In fact, I can tell you for a fact that newer versions of Windows do
*not* handle the exception properly, and will crash if an MCE that
worked before now generates an exception (for example, after
live-migrating to a host with fewer MCEs).

 -George

Olaf Hering

2012-Feb-09 18:02 UTC

head link

Re: vMCE vs migration

On Mon, Jan 30, Jan Beulich wrote:
> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).
Does it take more than just applying this patch for src+dst host and
migrate a hvm guest? I see no difference, the mce warning is still
there.

Olaf

Jan Beulich

2012-Feb-10 10:03 UTC

head link

Re: vMCE vs migration

>>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote:
> On Mon, Jan 30, Jan Beulich wrote:
> 
>> Below/attached a draft patch (compile tested only), handling save/
>> restore of the bank count, but not allowing for a config setting to
>> specify its initial value (yet).
> 
> Does it take more than just applying this patch for src+dst host and
> migrate a hvm guest? I see no difference, the mce warning is still
> there.
No, it shouldn''t require anything else. Could you add a printk() each
to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
(and at once checking that they actually get executed? I was under
the impression that adding save records for HVM is a simple drop-in
exercise these days...

Jan

Olaf Hering

2012-Feb-10 16:53 UTC

head link

Re: vMCE vs migration

On Fri, Feb 10, Jan Beulich wrote:
> No, it shouldn''t require anything else. Could you add a printk()
each
> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> (and at once checking that they actually get executed? I was under
> the impression that adding save records for HVM is a simple drop-in
> exercise these days...
The functions are called on both sides with no errors.


Regarding these messages, are they guest addresses?

(XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4


Olaf

Jan Beulich

2012-Feb-10 17:00 UTC

head link

Re: vMCE vs migration

>>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> No, it shouldn''t require anything else. Could you add a
printk() each
>> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
>> (and at once checking that they actually get executed? I was under
>> the impression that adding save records for HVM is a simple drop-in
>> exercise these days...
> 
> The functions are called on both sides with no errors.
And the bank count also is correctly reflecting the original, save-side
value on the restore side?
> Regarding these messages, are they guest addresses?
> 
> (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4
No, these are hypervisor ones.

Jan

Olaf Hering

2012-Feb-10 17:05 UTC

head link

Re: vMCE vs migration

On Fri, Feb 10, Jan Beulich wrote:
> >>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de>
wrote:
> > On Fri, Feb 10, Jan Beulich wrote:
> > 
> >> No, it shouldn''t require anything else. Could you add a
printk() each
> >> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> >> (and at once checking that they actually get executed? I was under
> >> the impression that adding save records for HVM is a simple
drop-in
> >> exercise these days...
> > 
> > The functions are called on both sides with no errors.
> 
> And the bank count also is correctly reflecting the original, save-side
> value on the restore side?
Oh, the count is zero on both sides?
I will double check my dbg patch.

Olaf

Olaf Hering

2012-Feb-10 21:28 UTC

head link

Re: vMCE vs migration

On Fri, Feb 10, Jan Beulich wrote:
> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de>
wrote:
> > On Mon, Jan 30, Jan Beulich wrote:
> > 
> >> Below/attached a draft patch (compile tested only), handling save/
> >> restore of the bank count, but not allowing for a config setting
to
> >> specify its initial value (yet).
> > 
> > Does it take more than just applying this patch for src+dst host and
> > migrate a hvm guest? I see no difference, the mce warning is still
> > there.
> 
> No, it shouldn''t require anything else. Could you add a printk()
each
> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> (and at once checking that they actually get executed? I was under
> the impression that adding save records for HVM is a simple drop-in
> exercise these days...
These functions are called for dom0, but not for domU. And as a result
arch.nr_vmce_banks remains zero. I assume the guest needs to be
initialized in some way as well, and that does not happen?

Olaf

Olaf Hering

2012-Feb-13 08:30 UTC

head link

Re: vMCE vs migration

On Fri, Feb 10, Jan Beulich wrote:
> > Regarding these messages, are they guest addresses?
> > 
> > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c ->
ffff82c480226eb4
> 
> No, these are hypervisor ones.
These are the functions they belong to:

(XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e
(XEN) do_general_protection: ffff82c4801d61ec vmx_msr_read_intercept+0x2d8/0x358
(XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99a

Olaf

Jan Beulich

2012-Feb-13 09:30 UTC

head link

Re: vMCE vs migration

>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de>
wrote:
>> > On Mon, Jan 30, Jan Beulich wrote:
>> > 
>> >> Below/attached a draft patch (compile tested only), handling
save/
>> >> restore of the bank count, but not allowing for a config
setting to
>> >> specify its initial value (yet).
>> > 
>> > Does it take more than just applying this patch for src+dst host
and
>> > migrate a hvm guest? I see no difference, the mce warning is still
>> > there.
>> 
>> No, it shouldn''t require anything else. Could you add a
printk() each
>> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
>> (and at once checking that they actually get executed? I was under
>> the impression that adding save records for HVM is a simple drop-in
>> exercise these days...
> 
> These functions are called for dom0, but not for domU. And as a result
> arch.nr_vmce_banks remains zero. I assume the guest needs to be
> initialized in some way as well, and that does not happen?
These functions should be called with Dom0 being current domain,
but the struct domain * argument should certainly be that of the
DomU being saved/restored.

Guest initialization happens in vmce_init_vcpu(), called from
vcpu_initialise() (irrespective of the kind of domain, i.e. equally for
PV and HVM).

I spotted another problem with the patch though - MCG_CAP reads
aren''t reflecting the possibly non-host bank count. I''m in the
process
of addressing this, but the whole MCG_* handling is bogus as being
per-domain instead of per-vCPU (and at least MCG_CAP lacking
save/restore too).

Jan

Jan Beulich

2012-Feb-13 09:35 UTC

head link

Re: vMCE vs migration

>Guest initialization happens in vmce_init_vcpu(), called from
>vcpu_initialise() (irrespective of the kind of domain, i.e. equally for
>PV and HVM).
And wrong I was - HVM guests get filtered much earlier in that function.
I''ll get back with an updated patch hopefully soon.

Jan

Jan Beulich

2012-Feb-13 10:36 UTC

head link

Re: vMCE vs migration

>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
> These functions are called for dom0, but not for domU. And as a result
> arch.nr_vmce_banks remains zero. I assume the guest needs to be
> initialized in some way as well, and that does not happen?
Below/attached a fixed version of the patch.

Jan

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: caps %" PRIx64 "\n", p.caps);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don''t understand type %u:
skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr <
MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
            sizeof(dom_vmce(d)->mci_ctl));
 
     dom_vmce(d)->mcg_status = 0x0;
-    dom_vmce(d)->mcg_cap = g_mcg_cap;
     dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
     dom_vmce(d)->nr_injection = 0;
 
@@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.mcg_cap = g_mcg_cap;
+}
+
+int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
+{
+    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
+    {
+        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA
capabilities"
+                " %#" PRIx64 " for d%d:v%u (supported:
%#Lx)\n",
+                is_hvm_vcpu(v) ? "HVM" : "PV", caps,
v->domain->domain_id,
+                v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT);
+        return -EPERM;
+    }
+
+    v->arch.mcg_cap = caps;
+    return 0;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL
0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                        "MCE: rdmsr MCG_STATUS
0x%"PRIx64"\n", *val);
         break;
     case MSR_IA32_MCG_CAP:
-        *val = vmce->mcg_cap;
+        *val = cur->arch.mcg_cap;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP
0x%"PRIx64"\n",
                    *val);
         break;
     case MSR_IA32_MCG_CTL:
         /* Always 0 if no CTL support */
-        *val = vmce->mcg_ctl & h_mcg_ctl;
+        if ( cur->arch.mcg_cap & MCG_CTL_P )
+            *val = vmce->mcg_ctl & h_mcg_ctl;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL
0x%"PRIx64"\n",
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS
%"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .caps = v->arch.mcg_cap
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n",
+                d->domain_id, vcpuid);
+        err = -EINVAL;
+    }
+    else
+        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+
+    return err ?: vmce_restore_vcpu(v, ctxt.caps);
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
     if ( (rc = vcpu_init_fpu(v)) != 0 )
         return rc;
 
+    vmce_init_vcpu(v);
+
     if ( is_hvm_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->mcg_cap = v->arch.mcg_cap;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,10 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
+                              sizeof(evc->mcg_cap) )
+                ret = vmce_restore_vcpu(v, evc->mcg_cap);
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint64_t mcg_cap;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -16,7 +16,6 @@ struct bank_entry {
 struct domain_mca_msrs
 {
     /* Guest should not change below values after DOM boot up */
-    uint64_t mcg_cap;
     uint64_t mcg_ctl;
     uint64_t mcg_status;
     uint64_t *mci_ctl;
@@ -28,6 +27,8 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
+extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -575,9 +575,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint64_t caps;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint64_aligned_t mcg_cap;
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2012-Feb-13 10:43 UTC

head link

Re: vMCE vs migration

>>> On 13.02.12 at 09:30, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> > Regarding these messages, are they guest addresses?
>> > 
>> > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c ->
ffff82c480226eb4
>> 
>> No, these are hypervisor ones.
> 
> These are the functions they belong to:
> 
> (XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e
> (XEN) do_general_protection: ffff82c4801d61ec
vmx_msr_read_intercept+0x2d8/0x358
> (XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99a
This appears to be the rdmsr_safe() at the bottom of the default case
in vmx_msr_read_intercept(), and I would expect those to go away
once the vMCE save/restore patch is working properly.

Jan

Olaf Hering

2012-Feb-13 14:20 UTC

head link

Re: vMCE vs migration

On Mon, Feb 13, Jan Beulich wrote:
> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de>
wrote:
> > These functions are called for dom0, but not for domU. And as a result
> > arch.nr_vmce_banks remains zero. I assume the guest needs to be
> > initialized in some way as well, and that does not happen?
> 
> Below/attached a fixed version of the patch.
I get some mismatch after migration, both hosts run the same xen binary.
The small debug patch I use is attached.


Also: The tools do not catch the restore error so that the guest continues to
run on the source host.

Olaf

nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806
(XEN) vmce_init_vcpu 1 o 0 n 806
(XEN) vmce_init_vcpu 2 o 0 n 806
(XEN) vmce_init_vcpu 3 o 0 n 806
(XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and restored
on another (0x10676).
(XEN) save.c:234:d0 HVM restore: CPU 0
(XEN) save.c:234:d0 HVM restore: CPU 1
(XEN) save.c:234:d0 HVM restore: CPU 2
(XEN) save.c:234:d0 HVM restore: CPU 3
(XEN) save.c:234:d0 HVM restore: PIC 0
(XEN) save.c:234:d0 HVM restore: PIC 1
(XEN) save.c:234:d0 HVM restore: IOAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 1
(XEN) save.c:234:d0 HVM restore: LAPIC 2
(XEN) save.c:234:d0 HVM restore: LAPIC 3
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
(XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
(XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
(XEN) save.c:234:d0 HVM restore: PCI_LINK 0
(XEN) save.c:234:d0 HVM restore: PIT 0
(XEN) save.c:234:d0 HVM restore: RTC 0
(XEN) save.c:234:d0 HVM restore: HPET 0
(XEN) save.c:234:d0 HVM restore: PMTIMER 0
(XEN) save.c:234:d0 HVM restore: MTRR 0
(XEN) save.c:234:d0 HVM restore: MTRR 1
(XEN) save.c:234:d0 HVM restore: MTRR 2
(XEN) save.c:234:d0 HVM restore: MTRR 3
(XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
(XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw type 18
length 1
(XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806 n ea
(XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
(XEN) vmce_init_vcpu 0 o 0 n 806
(XEN) vmce_init_vcpu 1 o 0 n 806
(XEN) vmce_init_vcpu 2 o 0 n 806
(XEN) vmce_init_vcpu 3 o 0 n 806
(XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and restored
on another (0x10676).
(XEN) save.c:234:d0 HVM restore: CPU 0
(XEN) save.c:234:d0 HVM restore: CPU 1
(XEN) save.c:234:d0 HVM restore: CPU 2
(XEN) save.c:234:d0 HVM restore: CPU 3
(XEN) save.c:234:d0 HVM restore: PIC 0
(XEN) save.c:234:d0 HVM restore: PIC 1
(XEN) save.c:234:d0 HVM restore: IOAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 1
(XEN) save.c:234:d0 HVM restore: LAPIC 2
(XEN) save.c:234:d0 HVM restore: LAPIC 3
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
(XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
(XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
(XEN) save.c:234:d0 HVM restore: PCI_LINK 0
(XEN) save.c:234:d0 HVM restore: PIT 0
(XEN) save.c:234:d0 HVM restore: RTC 0
(XEN) save.c:234:d0 HVM restore: HPET 0
(XEN) save.c:234:d0 HVM restore: PMTIMER 0
(XEN) save.c:234:d0 HVM restore: MTRR 0
(XEN) save.c:234:d0 HVM restore: MTRR 1
(XEN) save.c:234:d0 HVM restore: MTRR 2
(XEN) save.c:234:d0 HVM restore: MTRR 3
(XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
(XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
(XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0
(supported: 0x800)
(XEN) save.c:239:d0 HVM restore: failed to load entry 18/0



diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/mce.c
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -743,6 +743,7 @@ int mca_cap_init(void)
     {
         int i;
 
+        printk("%s: nr_mce_banks %x\n", __func__, nr_mce_banks);
         mca_allbanks = mcabanks_alloc();
         for ( i = 0; i < nr_mce_banks; i++)
             mcabanks_set(i, mca_allbanks);
diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/vmce.c
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -63,6 +63,7 @@ void vmce_destroy_msr(struct domain *d)
 
 void vmce_init_vcpu(struct vcpu *v)
 {
+    printk("%s %u o %lx n %lx\n", __func__, v->vcpu_id,
v->arch.mcg_cap, g_mcg_cap);
     v->arch.mcg_cap = g_mcg_cap;
 }
 
@@ -331,6 +332,7 @@ static int vmce_save_vcpu_ctxt(struct do
         };
 
         err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        gdprintk(XENLOG_ERR, "%s %p %u %x %d o %lx\n", __func__, h,
v->vcpu_id, err, err, ctxt.caps);
         if ( err )
             break;
     }
@@ -352,8 +354,11 @@ static int vmce_load_vcpu_ctxt(struct do
         err = -EINVAL;
     }
     else
+    {
         err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
 
+    gdprintk(XENLOG_ERR, "%s %p %x %d o %lx n %lx\n", __func__, h,
err, err, v->arch.mcg_cap, ctxt.caps);
+    }
     return err ?: vmce_restore_vcpu(v, ctxt.caps);
 }
 
diff -r cbb1cce5fac0 xen/arch/x86/domctl.c
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,6 +1027,7 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__,
v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap);
             evc->mcg_cap = v->arch.mcg_cap;
         }
         else
@@ -1061,6 +1062,7 @@ long arch_do_domctl(
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
 
+            gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__,
v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap);
             if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
                               sizeof(evc->mcg_cap) )
                 ret = vmce_restore_vcpu(v, evc->mcg_cap);
diff -r cbb1cce5fac0 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3129,6 +3129,8 @@ void do_general_protection(struct cpu_us
     {
         dprintk(XENLOG_INFO, "GPF (%04x): %p -> %p\n",
                 regs->error_code, _p(regs->eip), _p(fixup));
+ printk ("%s: %p ", __func__, _p(regs->eip));
print_symbol("%s\n", regs->eip);
+ printk ("%s: %p ", __func__, _p(fixup));
print_symbol("%s\n", fixup);
         regs->eip = fixup;
         return;
     }

Jan Beulich

2012-Feb-14 14:31 UTC

head link

Re: vMCE vs migration

>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> On Mon, Feb 13, Jan Beulich wrote:
> 
>> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de>
wrote:
>> > These functions are called for dom0, but not for domU. And as a
result
>> > arch.nr_vmce_banks remains zero. I assume the guest needs to be
>> > initialized in some way as well, and that does not happen?
>> 
>> Below/attached a fixed version of the patch.
> 
> I get some mismatch after migration, both hosts run the same xen binary.
Are you sure about that? I''m asking because ...
> The small debug patch I use is attached.
> 
> 
> Also: The tools do not catch the restore error so that the guest continues 
> to
> run on the source host.
> 
> Olaf
> 
> nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806
> (XEN) vmce_init_vcpu 1 o 0 n 806
> (XEN) vmce_init_vcpu 2 o 0 n 806
> (XEN) vmce_init_vcpu 3 o 0 n 806
> (XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and 
> restored on another (0x10676).
> (XEN) save.c:234:d0 HVM restore: CPU 0
> (XEN) save.c:234:d0 HVM restore: CPU 1
> (XEN) save.c:234:d0 HVM restore: CPU 2
> (XEN) save.c:234:d0 HVM restore: CPU 3
> (XEN) save.c:234:d0 HVM restore: PIC 0
> (XEN) save.c:234:d0 HVM restore: PIC 1
> (XEN) save.c:234:d0 HVM restore: IOAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 1
> (XEN) save.c:234:d0 HVM restore: LAPIC 2
> (XEN) save.c:234:d0 HVM restore: LAPIC 3
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
> (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
> (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
> (XEN) save.c:234:d0 HVM restore: PCI_LINK 0
> (XEN) save.c:234:d0 HVM restore: PIT 0
> (XEN) save.c:234:d0 HVM restore: RTC 0
> (XEN) save.c:234:d0 HVM restore: HPET 0
> (XEN) save.c:234:d0 HVM restore: PMTIMER 0
> (XEN) save.c:234:d0 HVM restore: MTRR 0
> (XEN) save.c:234:d0 HVM restore: MTRR 1
> (XEN) save.c:234:d0 HVM restore: MTRR 2
> (XEN) save.c:234:d0 HVM restore: MTRR 3
> (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
> (XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw
type 18 length 1
... this suggests the source host was still running with the old binary
(where the save record was indeed just 1 byte in size)...
> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806
n ea
> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
> (XEN) vmce_init_vcpu 0 o 0 n 806
> (XEN) vmce_init_vcpu 1 o 0 n 806
> (XEN) vmce_init_vcpu 2 o 0 n 806
> (XEN) vmce_init_vcpu 3 o 0 n 806
> (XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and 
> restored on another (0x10676).
> (XEN) save.c:234:d0 HVM restore: CPU 0
> (XEN) save.c:234:d0 HVM restore: CPU 1
> (XEN) save.c:234:d0 HVM restore: CPU 2
> (XEN) save.c:234:d0 HVM restore: CPU 3
> (XEN) save.c:234:d0 HVM restore: PIC 0
> (XEN) save.c:234:d0 HVM restore: PIC 1
> (XEN) save.c:234:d0 HVM restore: IOAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 1
> (XEN) save.c:234:d0 HVM restore: LAPIC 2
> (XEN) save.c:234:d0 HVM restore: LAPIC 3
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
> (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
> (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
> (XEN) save.c:234:d0 HVM restore: PCI_LINK 0
> (XEN) save.c:234:d0 HVM restore: PIT 0
> (XEN) save.c:234:d0 HVM restore: RTC 0
> (XEN) save.c:234:d0 HVM restore: HPET 0
> (XEN) save.c:234:d0 HVM restore: PMTIMER 0
> (XEN) save.c:234:d0 HVM restore: MTRR 0
> (XEN) save.c:234:d0 HVM restore: MTRR 1
> (XEN) save.c:234:d0 HVM restore: MTRR 2
> (XEN) save.c:234:d0 HVM restore: MTRR 3
> (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
> (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0
(supported: 0x800)
... whereas this one suggests that the size check passed (same binary
on both ends) but the feature sets of the hosts are different (sorry,
I can''t suggest a way around this, the hosts must be sufficiently
matching in MCA features - those differences that are tolerable are
already being handled). I''m confused as to what was running on the
source host.

Let me check what bit 12 is: The latest SDM considers this reserved,
so the only way I can see how to deal with this is to switch the
setting up of g_mcg_cap from a back-listing approach to a white-listing
one.
> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
Bottom line - it appears to work as intended considering the second
attempt.

Jan

Jan Beulich

2012-Feb-14 14:43 UTC

head link

Re: vMCE vs migration

>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
> (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0
> (supported: 0x800)
> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
Attached an updated patch that sets up g_mcg_cap using a white-listing
bit mask, as mentioned in the previous response.

Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Olaf Hering

2012-Feb-14 15:21 UTC

head link

Re: vMCE vs migration

On Tue, Feb 14, Jan Beulich wrote:
> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de>
wrote:
> >> Below/attached a fixed version of the patch.
> > 
> > I get some mismatch after migration, both hosts run the same xen
binary.
> 
> Are you sure about that? I''m asking because ...
Yes.
But I will move to another source host since the current one has now
only a 10MBit/sec connection for some reason, so I will double check
that both run indeed the same code.

Olaf

Olaf Hering

2012-Feb-14 17:17 UTC

head link

Re: vMCE vs migration

On Tue, Feb 14, Jan Beulich wrote:
> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de>
wrote:
> > (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n
1809
> > (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for
d2:v0
> > (supported: 0x800)
> > (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
> 
> Attached an updated patch that sets up g_mcg_cap using a white-listing
> bit mask, as mentioned in the previous response.
> 
> Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite.
Thanks, this patch works ok for me, migrating from a host with 9 MCE
banks to a host with 6 MCE banks.

Olaf

Xen devel - Jan 2012 - vMCE vs migration

vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration

Re: vMCE vs migration