x86''s vMCE implementation lets a guest know of as many MCE reporting banks as there are in the host. While a PV guest could be expected to deal with this number changing (particularly decreasing) during migration (not currently handled anywhere afaict), for HVM guests this is certainly wrong. At least to me it isn''t, however, clear how to properly handle this. The easiest would appear to be to save and restore the number of banks the guest was made believe it can access, making vmce_{rd,wr}msr() silently tolerate accesses between the host and guest values. Any thoughts appreciated, Jan
On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:> x86''s vMCE implementation lets a guest know of as many MCE reporting > banks as there are in the host. While a PV guest could be expected to > deal with this number changing (particularly decreasing) during migration > (not currently handled anywhere afaict), for HVM guests this is certainly > wrong. > > At least to me it isn''t, however, clear how to properly handle this. The > easiest would appear to be to save and restore the number of banks > the guest was made believe it can access, making vmce_{rd,wr}msr() > silently tolerate accesses between the host and guest values.We ran into this in the XS 6.0 release as well. I think that the ideal thing to do would be to have a parameter that can be set at boot, to say how many vMCE banks a guest has, defaulting to the number of MCE banks on the host. This parameter would be preserved across migration. Ideally, a pool-aware toolstack like xapi would then set this value to be the value of the host in the pool with the largest number of banks, allowing a guest to access all the banks on any host to which it migrates. What do you think? -George
>>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: >> x86''s vMCE implementation lets a guest know of as many MCE reporting >> banks as there are in the host. While a PV guest could be expected to >> deal with this number changing (particularly decreasing) during migration >> (not currently handled anywhere afaict), for HVM guests this is certainly >> wrong. >> >> At least to me it isn''t, however, clear how to properly handle this. The >> easiest would appear to be to save and restore the number of banks >> the guest was made believe it can access, making vmce_{rd,wr}msr() >> silently tolerate accesses between the host and guest values. > > We ran into this in the XS 6.0 release as well. I think that the > ideal thing to do would be to have a parameter that can be set at > boot, to say how many vMCE banks a guest has, defaulting to the number > of MCE banks on the host. This parameter would be preserved across > migration. Ideally, a pool-aware toolstack like xapi would then set > this value to be the value of the host in the pool with the largest > number of banks, allowing a guest to access all the banks on any host > to which it migrates. > > What do you think?That sounds like the way to go. Jan
On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: > >> x86''s vMCE implementation lets a guest know of as many MCE reporting > >> banks as there are in the host. While a PV guest could be expected to > >> deal with this number changing (particularly decreasing) during migration > >> (not currently handled anywhere afaict), for HVM guests this is certainly > >> wrong. > >> > >> At least to me it isn''t, however, clear how to properly handle this. The > >> easiest would appear to be to save and restore the number of banks > >> the guest was made believe it can access, making vmce_{rd,wr}msr() > >> silently tolerate accesses between the host and guest values. > > > > We ran into this in the XS 6.0 release as well. I think that the > > ideal thing to do would be to have a parameter that can be set at > > boot, to say how many vMCE banks a guest has, defaulting to the number > > of MCE banks on the host. This parameter would be preserved across > > migration. Ideally, a pool-aware toolstack like xapi would then set > > this value to be the value of the host in the pool with the largest > > number of banks, allowing a guest to access all the banks on any host > > to which it migrates. > > > > What do you think? > > That sounds like the way to go.So should we put this on IanC''s to-do-be-done list? Are you going to put it on your to-do list? :-) -George
>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting >> >> banks as there are in the host. While a PV guest could be expected to >> >> deal with this number changing (particularly decreasing) during migration >> >> (not currently handled anywhere afaict), for HVM guests this is certainly >> >> wrong. >> >> >> >> At least to me it isn''t, however, clear how to properly handle this. The >> >> easiest would appear to be to save and restore the number of banks >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() >> >> silently tolerate accesses between the host and guest values. >> > >> > We ran into this in the XS 6.0 release as well. I think that the >> > ideal thing to do would be to have a parameter that can be set at >> > boot, to say how many vMCE banks a guest has, defaulting to the number >> > of MCE banks on the host. This parameter would be preserved across >> > migration. Ideally, a pool-aware toolstack like xapi would then set >> > this value to be the value of the host in the pool with the largest >> > number of banks, allowing a guest to access all the banks on any host >> > to which it migrates. >> > >> > What do you think? >> >> That sounds like the way to go. > > So should we put this on IanC''s to-do-be-done list?Would certainly be desirable to get fixed.> Are you going to put it on your to-do list? :-)I''ll first check with Intel whether the engineers who wrote that code could make an attempt. Jan
>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting >> >> banks as there are in the host. While a PV guest could be expected to >> >> deal with this number changing (particularly decreasing) during migration >> >> (not currently handled anywhere afaict), for HVM guests this is certainly >> >> wrong. >> >> >> >> At least to me it isn''t, however, clear how to properly handle this. The >> >> easiest would appear to be to save and restore the number of banks >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() >> >> silently tolerate accesses between the host and guest values. >> > >> > We ran into this in the XS 6.0 release as well. I think that the >> > ideal thing to do would be to have a parameter that can be set at >> > boot, to say how many vMCE banks a guest has, defaulting to the number >> > of MCE banks on the host. This parameter would be preserved across >> > migration. Ideally, a pool-aware toolstack like xapi would then set >> > this value to be the value of the host in the pool with the largest >> > number of banks, allowing a guest to access all the banks on any host >> > to which it migrates. >> > >> > What do you think? >> >> That sounds like the way to go. > > So should we put this on IanC''s to-do-be-done list? Are you going to > put it on your to-do list? :-)Below/attached a draft patch (compile tested only), handling save/ restore of the bank count, but not allowing for a config setting to specify its initial value (yet). Jan --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext; domctl.domain = dom; + memset(&domctl.u, 0, sizeof(domctl.u)); domctl.u.ext_vcpucontext.vcpu = i; if ( xc_domctl(xch, &domctl) < 0 ) { --- a/tools/misc/xen-hvmctx.c +++ b/tools/misc/xen-hvmctx.c @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void) (unsigned long long) p.apic_assist); } +static void dump_vmce_vcpu(void) +{ + HVM_SAVE_TYPE(VMCE_VCPU) p; + READ(p); + printf(" VMCE_VCPU: nr_banks %u\n", p.nr_banks); +} + int main(int argc, char **argv) { int entry, domid; @@ -449,6 +456,7 @@ int main(int argc, char **argv) case HVM_SAVE_CODE(MTRR): dump_mtrr(); break; case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break; case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break; + case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break; default: printf(" ** Don''t understand type %u: skipping\n", --- a/xen/arch/x86/cpu/mcheck/mce.h +++ b/xen/arch/x86/cpu/mcheck/mce.h @@ -3,6 +3,7 @@ #define _MCE_H #include <xen/init.h> +#include <xen/sched.h> #include <xen/smp.h> #include <asm/types.h> #include <asm/traps.h> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d, u64 mce_cap_init(void); extern int firstbank; -int intel_mce_rdmsr(uint32_t msr, uint64_t *val); -int intel_mce_wrmsr(uint32_t msr, uint64_t val); +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val); struct mcinfo_extended *intel_get_extended_msrs( struct mcinfo_global *mig, struct mc_info *mi); @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban extern int vmce_init(struct cpuinfo_x86 *c); -static inline int mce_vendor_bank_msr(uint32_t msr) +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr) { if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && - msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) ) + msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) return 1; return 0; } -static inline int mce_bank_msr(uint32_t msr) +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr) { - if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) || - mce_vendor_bank_msr(msr) ) + if ( (msr >= MSR_IA32_MC0_CTL && + msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) || + mce_vendor_bank_msr(v, msr) ) return 1; return 0; } --- a/xen/arch/x86/cpu/mcheck/mce_intel.c +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc } /* intel specific MCA MSR */ -int intel_mce_wrmsr(uint32_t msr, uint64_t val) +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) { int ret = 0; - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) { mce_printk(MCE_QUIET, "We have disabled CMCI capability, " "Guest should not write this MSR!\n"); @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64 return ret; } -int intel_mce_rdmsr(uint32_t msr, uint64_t *val) +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) { int ret = 0; - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) { mce_printk(MCE_QUIET, "We have disabled CMCI capability, " "Guest should not read this MSR!\n"); --- a/xen/arch/x86/cpu/mcheck/vmce.c +++ b/xen/arch/x86/cpu/mcheck/vmce.c @@ -10,6 +10,7 @@ #include <xen/delay.h> #include <xen/smp.h> #include <xen/mm.h> +#include <xen/hvm/save.h> #include <asm/processor.h> #include <public/sysctl.h> #include <asm/system.h> @@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d) dom_vmce(d) = NULL; } -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val) +void vmce_init_vcpu(struct vcpu *v) { - int bank, ret = 1; - struct domain_mca_msrs *vmce = dom_vmce(d); + v->arch.nr_vmce_banks = nr_mce_banks; +} + +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) +{ + int ret = 1; + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; + struct domain_mca_msrs *vmce = dom_vmce(v->domain); struct bank_entry *entry; - bank = (msr - MSR_IA32_MC0_CTL) / 4; - if ( bank >= nr_mce_banks ) - return -1; + *val = 0; switch ( msr & (MSR_IA32_MC0_CTL | 3) ) { case MSR_IA32_MC0_CTL: - *val = vmce->mci_ctl[bank] & - (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); + if ( bank < nr_mce_banks ) + *val = vmce->mci_ctl[bank] & + (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n", bank, *val); break; @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain switch ( boot_cpu_data.x86_vendor ) { case X86_VENDOR_INTEL: - ret = intel_mce_rdmsr(msr, val); + ret = intel_mce_rdmsr(v, msr, val); break; default: ret = 0; @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain */ int vmce_rdmsr(uint32_t msr, uint64_t *val) { - struct domain *d = current->domain; - struct domain_mca_msrs *vmce = dom_vmce(d); + const struct vcpu *cur = current; + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); int ret = 1; *val = 0; - spin_lock(&dom_vmce(d)->lock); + spin_lock(&vmce->lock); switch ( msr ) { @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v *val); break; default: - ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0; + ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0; break; } - spin_unlock(&dom_vmce(d)->lock); + spin_unlock(&vmce->lock); return ret; } -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val) +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val) { - int bank, ret = 1; - struct domain_mca_msrs *vmce = dom_vmce(d); + int ret = 1; + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; + struct domain_mca_msrs *vmce = dom_vmce(v->domain); struct bank_entry *entry = NULL; - bank = (msr - MSR_IA32_MC0_CTL) / 4; - if ( bank >= nr_mce_banks ) - return -EINVAL; - switch ( msr & (MSR_IA32_MC0_CTL | 3) ) { case MSR_IA32_MC0_CTL: - vmce->mci_ctl[bank] = val; + if ( bank < nr_mce_banks ) + vmce->mci_ctl[bank] = val; break; case MSR_IA32_MC0_STATUS: /* Give the first entry of the list, it corresponds to current @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain * the guest, this node will be deleted. * Only error bank is written. Non-error banks simply return. */ - if ( !list_empty(&dom_vmce(d)->impact_header) ) + if ( !list_empty(&vmce->impact_header) ) { - entry = list_entry(dom_vmce(d)->impact_header.next, + entry = list_entry(vmce->impact_header.next, struct bank_entry, list); if ( entry->bank == bank ) entry->mci_status = val; @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain switch ( boot_cpu_data.x86_vendor ) { case X86_VENDOR_INTEL: - ret = intel_mce_wrmsr(msr, val); + ret = intel_mce_wrmsr(v, msr, val); break; default: ret = 0; @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain */ int vmce_wrmsr(u32 msr, u64 val) { - struct domain *d = current->domain; + struct vcpu *cur = current; struct bank_entry *entry = NULL; - struct domain_mca_msrs *vmce = dom_vmce(d); + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); int ret = 1; if ( !g_mcg_cap ) @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val) vmce->mcg_status = val; mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val); /* For HVM guest, this is the point for deleting vMCE injection node */ - if ( d->is_hvm && (vmce->nr_injection > 0) ) + if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) ) { vmce->nr_injection--; /* Should be 0 */ if ( !list_empty(&vmce->impact_header) ) @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val) ret = -1; break; default: - ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0; + ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0; break; } @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val) return ret; } +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) +{ + struct vcpu *v; + int err = 0; + + for_each_vcpu( d, v ) { + struct hvm_vmce_vcpu ctxt = { + .nr_banks = v->arch.nr_vmce_banks + }; + + err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); + if ( err ) + break; + } + + return err; +} + +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) +{ + unsigned int vcpuid = hvm_load_instance(h); + struct vcpu *v; + struct hvm_vmce_vcpu ctxt; + int err; + + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) + { + gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid); + return -EINVAL; + } + + err = hvm_load_entry(VMCE_VCPU, h, &ctxt); + if ( !err ) + v->arch.nr_vmce_banks = ctxt.nr_banks; + + return err; +} + +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, + vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); + int inject_vmce(struct domain *d) { int cpu = smp_processor_id(); --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v) v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + vmce_init_vcpu(v); + rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0; done: if ( rc ) --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -1027,11 +1027,12 @@ long arch_do_domctl( evc->syscall32_callback_eip = 0; evc->syscall32_disables_events = 0; } + evc->nr_mce_banks = v->arch.nr_vmce_banks; } else { ret = -EINVAL; - if ( evc->size != sizeof(*evc) ) + if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) ) goto ext_vcpucontext_out; #ifdef __x86_64__ if ( !is_hvm_domain(d) ) @@ -1059,6 +1060,16 @@ long arch_do_domctl( (evc->syscall32_callback_cs & ~3) || evc->syscall32_callback_eip ) goto ext_vcpucontext_out; + + if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) ) + { + unsigned int i; + + for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i ) + if ( evc->mbz[i] ) + goto ext_vcpucontext_out; + v->arch.nr_vmce_banks = evc->nr_mce_banks; + } } ret = 0; --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -488,6 +488,8 @@ struct arch_vcpu /* This variable determines whether nonlazy extended state has been used, * and thus should be saved/restored. */ bool_t nonlazy_xstate_used; + + uint8_t nr_vmce_banks; struct paging_vcpu paging; --- a/xen/include/asm-x86/mce.h +++ b/xen/include/asm-x86/mce.h @@ -28,6 +28,7 @@ struct domain_mca_msrs /* Guest vMCE MSRs virtualization */ extern int vmce_init_msr(struct domain *d); extern void vmce_destroy_msr(struct domain *d); +extern void vmce_init_vcpu(struct vcpu *); extern int vmce_wrmsr(uint32_t msr, uint64_t val); extern int vmce_rdmsr(uint32_t msr, uint64_t *val); --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context { DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context); +struct hvm_vmce_vcpu { + uint8_t nr_banks; +}; + +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu); + /* * Largest type-code in use */ -#define HVM_SAVE_CODE_MAX 17 +#define HVM_SAVE_CODE_MAX 18 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext { uint32_t vcpu; /* * SET: Size of struct (IN) - * GET: Size of struct (OUT) + * GET: Size of struct (OUT, up to 128 bytes) */ uint32_t size; #if defined(__i386__) || defined(__x86_64__) @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext { uint16_t sysenter_callback_cs; uint8_t syscall32_disables_events; uint8_t sysenter_disables_events; + uint8_t nr_mce_banks; + /* This array can be split and used for future extension. */ + /* It is there just to grow the structure beyond 4.1''s size. */ + uint8_t mbz[9]; #endif }; typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: > > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: > >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: > >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting > >> >> banks as there are in the host. While a PV guest could be expected to > >> >> deal with this number changing (particularly decreasing) during migration > >> >> (not currently handled anywhere afaict), for HVM guests this is certainly > >> >> wrong. > >> >> > >> >> At least to me it isn''t, however, clear how to properly handle this. The > >> >> easiest would appear to be to save and restore the number of banks > >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() > >> >> silently tolerate accesses between the host and guest values. > >> > > >> > We ran into this in the XS 6.0 release as well. I think that the > >> > ideal thing to do would be to have a parameter that can be set at > >> > boot, to say how many vMCE banks a guest has, defaulting to the number > >> > of MCE banks on the host. This parameter would be preserved across > >> > migration. Ideally, a pool-aware toolstack like xapi would then set > >> > this value to be the value of the host in the pool with the largest > >> > number of banks, allowing a guest to access all the banks on any host > >> > to which it migrates. > >> > > >> > What do you think? > >> > >> That sounds like the way to go. > > > > So should we put this on IanC''s to-do-be-done list? Are you going to > > put it on your to-do list? :-) > > Below/attached a draft patch (compile tested only), handling save/ > restore of the bank count, but not allowing for a config setting to > specify its initial value (yet).Looks pretty good for a first blush. Just one question: Why is the vmce count made on a per-vcpu basis, rather than on a per-domain basis, like the actual banks are? Is the host MCE stuff per-vcpu? -George> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) > +{ > + int ret = 1; > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > struct bank_entry *entry; > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > - if ( bank >= nr_mce_banks ) > - return -1; > + *val = 0; > > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > { > case MSR_IA32_MC0_CTL: > - *val = vmce->mci_ctl[bank] & > - (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > + if ( bank < nr_mce_banks ) > + *val = vmce->mci_ctl[bank] & > + (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n", > bank, *val); > break; > @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain > switch ( boot_cpu_data.x86_vendor ) > { > case X86_VENDOR_INTEL: > - ret = intel_mce_rdmsr(msr, val); > + ret = intel_mce_rdmsr(v, msr, val); > break; > default: > ret = 0; > @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain > */ > int vmce_rdmsr(uint32_t msr, uint64_t *val) > { > - struct domain *d = current->domain; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + const struct vcpu *cur = current; > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > int ret = 1; > > *val = 0; > > - spin_lock(&dom_vmce(d)->lock); > + spin_lock(&vmce->lock); > > switch ( msr ) > { > @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v > *val); > break; > default: > - ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0; > + ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0; > break; > } > > - spin_unlock(&dom_vmce(d)->lock); > + spin_unlock(&vmce->lock); > return ret; > } > > -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val) > +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val) > { > - int bank, ret = 1; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + int ret = 1; > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > struct bank_entry *entry = NULL; > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > - if ( bank >= nr_mce_banks ) > - return -EINVAL; > - > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > { > case MSR_IA32_MC0_CTL: > - vmce->mci_ctl[bank] = val; > + if ( bank < nr_mce_banks ) > + vmce->mci_ctl[bank] = val; > break; > case MSR_IA32_MC0_STATUS: > /* Give the first entry of the list, it corresponds to current > @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain > * the guest, this node will be deleted. > * Only error bank is written. Non-error banks simply return. > */ > - if ( !list_empty(&dom_vmce(d)->impact_header) ) > + if ( !list_empty(&vmce->impact_header) ) > { > - entry = list_entry(dom_vmce(d)->impact_header.next, > + entry = list_entry(vmce->impact_header.next, > struct bank_entry, list); > if ( entry->bank == bank ) > entry->mci_status = val; > @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain > switch ( boot_cpu_data.x86_vendor ) > { > case X86_VENDOR_INTEL: > - ret = intel_mce_wrmsr(msr, val); > + ret = intel_mce_wrmsr(v, msr, val); > break; > default: > ret = 0; > @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain > */ > int vmce_wrmsr(u32 msr, u64 val) > { > - struct domain *d = current->domain; > + struct vcpu *cur = current; > struct bank_entry *entry = NULL; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > int ret = 1; > > if ( !g_mcg_cap ) > @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val) > vmce->mcg_status = val; > mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val); > /* For HVM guest, this is the point for deleting vMCE injection node */ > - if ( d->is_hvm && (vmce->nr_injection > 0) ) > + if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) ) > { > vmce->nr_injection--; /* Should be 0 */ > if ( !list_empty(&vmce->impact_header) ) > @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val) > ret = -1; > break; > default: > - ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0; > + ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0; > break; > } > > @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val) > return ret; > } > > +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) > +{ > + struct vcpu *v; > + int err = 0; > + > + for_each_vcpu( d, v ) { > + struct hvm_vmce_vcpu ctxt = { > + .nr_banks = v->arch.nr_vmce_banks > + }; > + > + err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); > + if ( err ) > + break; > + } > + > + return err; > +} > + > +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) > +{ > + unsigned int vcpuid = hvm_load_instance(h); > + struct vcpu *v; > + struct hvm_vmce_vcpu ctxt; > + int err; > + > + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) > + { > + gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid); > + return -EINVAL; > + } > + > + err = hvm_load_entry(VMCE_VCPU, h, &ctxt); > + if ( !err ) > + v->arch.nr_vmce_banks = ctxt.nr_banks; > + > + return err; > +} > + > +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, > + vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); > + > int inject_vmce(struct domain *d) > { > int cpu = smp_processor_id(); > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v) > > v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > > + vmce_init_vcpu(v); > + > rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0; > done: > if ( rc ) > --- a/xen/arch/x86/domctl.c > +++ b/xen/arch/x86/domctl.c > @@ -1027,11 +1027,12 @@ long arch_do_domctl( > evc->syscall32_callback_eip = 0; > evc->syscall32_disables_events = 0; > } > + evc->nr_mce_banks = v->arch.nr_vmce_banks; > } > else > { > ret = -EINVAL; > - if ( evc->size != sizeof(*evc) ) > + if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) ) > goto ext_vcpucontext_out; > #ifdef __x86_64__ > if ( !is_hvm_domain(d) ) > @@ -1059,6 +1060,16 @@ long arch_do_domctl( > (evc->syscall32_callback_cs & ~3) || > evc->syscall32_callback_eip ) > goto ext_vcpucontext_out; > + > + if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) ) > + { > + unsigned int i; > + > + for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i ) > + if ( evc->mbz[i] ) > + goto ext_vcpucontext_out; > + v->arch.nr_vmce_banks = evc->nr_mce_banks; > + } > } > > ret = 0; > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -488,6 +488,8 @@ struct arch_vcpu > /* This variable determines whether nonlazy extended state has been used, > * and thus should be saved/restored. */ > bool_t nonlazy_xstate_used; > + > + uint8_t nr_vmce_banks; > > struct paging_vcpu paging; > > --- a/xen/include/asm-x86/mce.h > +++ b/xen/include/asm-x86/mce.h > @@ -28,6 +28,7 @@ struct domain_mca_msrs > /* Guest vMCE MSRs virtualization */ > extern int vmce_init_msr(struct domain *d); > extern void vmce_destroy_msr(struct domain *d); > +extern void vmce_init_vcpu(struct vcpu *); > extern int vmce_wrmsr(uint32_t msr, uint64_t val); > extern int vmce_rdmsr(uint32_t msr, uint64_t *val); > > --- a/xen/include/public/arch-x86/hvm/save.h > +++ b/xen/include/public/arch-x86/hvm/save.h > @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context { > > DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context); > > +struct hvm_vmce_vcpu { > + uint8_t nr_banks; > +}; > + > +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu); > + > /* > * Largest type-code in use > */ > -#define HVM_SAVE_CODE_MAX 17 > +#define HVM_SAVE_CODE_MAX 18 > > #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext { > uint32_t vcpu; > /* > * SET: Size of struct (IN) > - * GET: Size of struct (OUT) > + * GET: Size of struct (OUT, up to 128 bytes) > */ > uint32_t size; > #if defined(__i386__) || defined(__x86_64__) > @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext { > uint16_t sysenter_callback_cs; > uint8_t syscall32_disables_events; > uint8_t sysenter_disables_events; > + uint8_t nr_mce_banks; > + /* This array can be split and used for future extension. */ > + /* It is there just to grow the structure beyond 4.1''s size. */ > + uint8_t mbz[9]; > #endif > }; > typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t; >
On Tue, 2012-01-31 at 11:27 +0000, George Dunlap wrote:> On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote: > > >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: > > > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: > > >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: > > >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting > > >> >> banks as there are in the host. While a PV guest could be expected to > > >> >> deal with this number changing (particularly decreasing) during migration > > >> >> (not currently handled anywhere afaict), for HVM guests this is certainly > > >> >> wrong. > > >> >> > > >> >> At least to me it isn''t, however, clear how to properly handle this. The > > >> >> easiest would appear to be to save and restore the number of banks > > >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() > > >> >> silently tolerate accesses between the host and guest values. > > >> > > > >> > We ran into this in the XS 6.0 release as well. I think that the > > >> > ideal thing to do would be to have a parameter that can be set at > > >> > boot, to say how many vMCE banks a guest has, defaulting to the number > > >> > of MCE banks on the host. This parameter would be preserved across > > >> > migration. Ideally, a pool-aware toolstack like xapi would then set > > >> > this value to be the value of the host in the pool with the largest > > >> > number of banks, allowing a guest to access all the banks on any host > > >> > to which it migrates. > > >> > > > >> > What do you think? > > >> > > >> That sounds like the way to go. > > > > > > So should we put this on IanC''s to-do-be-done list? Are you going to > > > put it on your to-do list? :-) > > > > Below/attached a draft patch (compile tested only), handling save/ > > restore of the bank count, but not allowing for a config setting to > > specify its initial value (yet). > > Looks pretty good for a first blush. Just one question: Why is the vmce > count made on a per-vcpu basis, rather than on a per-domain basis, like > the actual banks are? Is the host MCE stuff per-vcpu?(Per-pcpu, that is...)> > -George > > > +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) > > +{ > > + int ret = 1; > > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > > struct bank_entry *entry; > > > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > > - if ( bank >= nr_mce_banks ) > > - return -1; > > + *val = 0; > > > > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > > { > > case MSR_IA32_MC0_CTL: > > - *val = vmce->mci_ctl[bank] & > > - (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > > + if ( bank < nr_mce_banks ) > > + *val = vmce->mci_ctl[bank] & > > + (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > > mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n", > > bank, *val); > > break; > > @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain > > switch ( boot_cpu_data.x86_vendor ) > > { > > case X86_VENDOR_INTEL: > > - ret = intel_mce_rdmsr(msr, val); > > + ret = intel_mce_rdmsr(v, msr, val); > > break; > > default: > > ret = 0; > > @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain > > */ > > int vmce_rdmsr(uint32_t msr, uint64_t *val) > > { > > - struct domain *d = current->domain; > > - struct domain_mca_msrs *vmce = dom_vmce(d); > > + const struct vcpu *cur = current; > > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > > int ret = 1; > > > > *val = 0; > > > > - spin_lock(&dom_vmce(d)->lock); > > + spin_lock(&vmce->lock); > > > > switch ( msr ) > > { > > @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v > > *val); > > break; > > default: > > - ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0; > > + ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0; > > break; > > } > > > > - spin_unlock(&dom_vmce(d)->lock); > > + spin_unlock(&vmce->lock); > > return ret; > > } > > > > -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val) > > +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val) > > { > > - int bank, ret = 1; > > - struct domain_mca_msrs *vmce = dom_vmce(d); > > + int ret = 1; > > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > > struct bank_entry *entry = NULL; > > > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > > - if ( bank >= nr_mce_banks ) > > - return -EINVAL; > > - > > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > > { > > case MSR_IA32_MC0_CTL: > > - vmce->mci_ctl[bank] = val; > > + if ( bank < nr_mce_banks ) > > + vmce->mci_ctl[bank] = val; > > break; > > case MSR_IA32_MC0_STATUS: > > /* Give the first entry of the list, it corresponds to current > > @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain > > * the guest, this node will be deleted. > > * Only error bank is written. Non-error banks simply return. > > */ > > - if ( !list_empty(&dom_vmce(d)->impact_header) ) > > + if ( !list_empty(&vmce->impact_header) ) > > { > > - entry = list_entry(dom_vmce(d)->impact_header.next, > > + entry = list_entry(vmce->impact_header.next, > > struct bank_entry, list); > > if ( entry->bank == bank ) > > entry->mci_status = val; > > @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain > > switch ( boot_cpu_data.x86_vendor ) > > { > > case X86_VENDOR_INTEL: > > - ret = intel_mce_wrmsr(msr, val); > > + ret = intel_mce_wrmsr(v, msr, val); > > break; > > default: > > ret = 0; > > @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain > > */ > > int vmce_wrmsr(u32 msr, u64 val) > > { > > - struct domain *d = current->domain; > > + struct vcpu *cur = current; > > struct bank_entry *entry = NULL; > > - struct domain_mca_msrs *vmce = dom_vmce(d); > > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > > int ret = 1; > > > > if ( !g_mcg_cap ) > > @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val) > > vmce->mcg_status = val; > > mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val); > > /* For HVM guest, this is the point for deleting vMCE injection node */ > > - if ( d->is_hvm && (vmce->nr_injection > 0) ) > > + if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) ) > > { > > vmce->nr_injection--; /* Should be 0 */ > > if ( !list_empty(&vmce->impact_header) ) > > @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val) > > ret = -1; > > break; > > default: > > - ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0; > > + ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0; > > break; > > } > > > > @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val) > > return ret; > > } > > > > +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) > > +{ > > + struct vcpu *v; > > + int err = 0; > > + > > + for_each_vcpu( d, v ) { > > + struct hvm_vmce_vcpu ctxt = { > > + .nr_banks = v->arch.nr_vmce_banks > > + }; > > + > > + err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); > > + if ( err ) > > + break; > > + } > > + > > + return err; > > +} > > + > > +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) > > +{ > > + unsigned int vcpuid = hvm_load_instance(h); > > + struct vcpu *v; > > + struct hvm_vmce_vcpu ctxt; > > + int err; > > + > > + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) > > + { > > + gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid); > > + return -EINVAL; > > + } > > + > > + err = hvm_load_entry(VMCE_VCPU, h, &ctxt); > > + if ( !err ) > > + v->arch.nr_vmce_banks = ctxt.nr_banks; > > + > > + return err; > > +} > > + > > +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, > > + vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); > > + > > int inject_vmce(struct domain *d) > > { > > int cpu = smp_processor_id(); > > --- a/xen/arch/x86/domain.c > > +++ b/xen/arch/x86/domain.c > > @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v) > > > > v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > > > > + vmce_init_vcpu(v); > > + > > rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0; > > done: > > if ( rc ) > > --- a/xen/arch/x86/domctl.c > > +++ b/xen/arch/x86/domctl.c > > @@ -1027,11 +1027,12 @@ long arch_do_domctl( > > evc->syscall32_callback_eip = 0; > > evc->syscall32_disables_events = 0; > > } > > + evc->nr_mce_banks = v->arch.nr_vmce_banks; > > } > > else > > { > > ret = -EINVAL; > > - if ( evc->size != sizeof(*evc) ) > > + if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) ) > > goto ext_vcpucontext_out; > > #ifdef __x86_64__ > > if ( !is_hvm_domain(d) ) > > @@ -1059,6 +1060,16 @@ long arch_do_domctl( > > (evc->syscall32_callback_cs & ~3) || > > evc->syscall32_callback_eip ) > > goto ext_vcpucontext_out; > > + > > + if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) ) > > + { > > + unsigned int i; > > + > > + for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i ) > > + if ( evc->mbz[i] ) > > + goto ext_vcpucontext_out; > > + v->arch.nr_vmce_banks = evc->nr_mce_banks; > > + } > > } > > > > ret = 0; > > --- a/xen/include/asm-x86/domain.h > > +++ b/xen/include/asm-x86/domain.h > > @@ -488,6 +488,8 @@ struct arch_vcpu > > /* This variable determines whether nonlazy extended state has been used, > > * and thus should be saved/restored. */ > > bool_t nonlazy_xstate_used; > > + > > + uint8_t nr_vmce_banks; > > > > struct paging_vcpu paging; > > > > --- a/xen/include/asm-x86/mce.h > > +++ b/xen/include/asm-x86/mce.h > > @@ -28,6 +28,7 @@ struct domain_mca_msrs > > /* Guest vMCE MSRs virtualization */ > > extern int vmce_init_msr(struct domain *d); > > extern void vmce_destroy_msr(struct domain *d); > > +extern void vmce_init_vcpu(struct vcpu *); > > extern int vmce_wrmsr(uint32_t msr, uint64_t val); > > extern int vmce_rdmsr(uint32_t msr, uint64_t *val); > > > > --- a/xen/include/public/arch-x86/hvm/save.h > > +++ b/xen/include/public/arch-x86/hvm/save.h > > @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context { > > > > DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context); > > > > +struct hvm_vmce_vcpu { > > + uint8_t nr_banks; > > +}; > > + > > +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu); > > + > > /* > > * Largest type-code in use > > */ > > -#define HVM_SAVE_CODE_MAX 17 > > +#define HVM_SAVE_CODE_MAX 18 > > > > #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ > > --- a/xen/include/public/domctl.h > > +++ b/xen/include/public/domctl.h > > @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext { > > uint32_t vcpu; > > /* > > * SET: Size of struct (IN) > > - * GET: Size of struct (OUT) > > + * GET: Size of struct (OUT, up to 128 bytes) > > */ > > uint32_t size; > > #if defined(__i386__) || defined(__x86_64__) > > @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext { > > uint16_t sysenter_callback_cs; > > uint8_t syscall32_disables_events; > > uint8_t sysenter_disables_events; > > + uint8_t nr_mce_banks; > > + /* This array can be split and used for future extension. */ > > + /* It is there just to grow the structure beyond 4.1''s size. */ > > + uint8_t mbz[9]; > > #endif > > }; > > typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t; > > > >
>>> On 31.01.12 at 12:27, George Dunlap <george.dunlap@citrix.com> wrote: > On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote: >> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: >> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: >> >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: >> >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting >> >> >> banks as there are in the host. While a PV guest could be expected to >> >> >> deal with this number changing (particularly decreasing) during migration >> >> >> (not currently handled anywhere afaict), for HVM guests this is certainly >> >> >> wrong. >> >> >> >> >> >> At least to me it isn''t, however, clear how to properly handle this. The >> >> >> easiest would appear to be to save and restore the number of banks >> >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() >> >> >> silently tolerate accesses between the host and guest values. >> >> > >> >> > We ran into this in the XS 6.0 release as well. I think that the >> >> > ideal thing to do would be to have a parameter that can be set at >> >> > boot, to say how many vMCE banks a guest has, defaulting to the number >> >> > of MCE banks on the host. This parameter would be preserved across >> >> > migration. Ideally, a pool-aware toolstack like xapi would then set >> >> > this value to be the value of the host in the pool with the largest >> >> > number of banks, allowing a guest to access all the banks on any host >> >> > to which it migrates. >> >> > >> >> > What do you think? >> >> >> >> That sounds like the way to go. >> > >> > So should we put this on IanC''s to-do-be-done list? Are you going to >> > put it on your to-do list? :-) >> >> Below/attached a draft patch (compile tested only), handling save/ >> restore of the bank count, but not allowing for a config setting to >> specify its initial value (yet). > > Looks pretty good for a first blush. Just one question: Why is the vmce > count made on a per-vcpu basis, rather than on a per-domain basis, like > the actual banks are? Is the host MCE stuff per-vcpu?The question should probably be the other way around - why is the vMCE implementation using global (fake) MSRs rather than per-vCPU ones (as they would be on hardware). If the change here was implemented as per-domain MSRs, a future move of the vMCE implementation to a more natural model would be impossible. Also, for the PV case the save/restore logic is much simpler this way. Jan
On Tue, 2012-01-31 at 13:17 +0000, Jan Beulich wrote:> >>> On 31.01.12 at 12:27, George Dunlap <george.dunlap@citrix.com> wrote: > > On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote: > >> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote: > >> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: > >> >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote: > >> >> >> x86''s vMCE implementation lets a guest know of as many MCE reporting > >> >> >> banks as there are in the host. While a PV guest could be expected to > >> >> >> deal with this number changing (particularly decreasing) during migration > >> >> >> (not currently handled anywhere afaict), for HVM guests this is certainly > >> >> >> wrong. > >> >> >> > >> >> >> At least to me it isn''t, however, clear how to properly handle this. The > >> >> >> easiest would appear to be to save and restore the number of banks > >> >> >> the guest was made believe it can access, making vmce_{rd,wr}msr() > >> >> >> silently tolerate accesses between the host and guest values. > >> >> > > >> >> > We ran into this in the XS 6.0 release as well. I think that the > >> >> > ideal thing to do would be to have a parameter that can be set at > >> >> > boot, to say how many vMCE banks a guest has, defaulting to the number > >> >> > of MCE banks on the host. This parameter would be preserved across > >> >> > migration. Ideally, a pool-aware toolstack like xapi would then set > >> >> > this value to be the value of the host in the pool with the largest > >> >> > number of banks, allowing a guest to access all the banks on any host > >> >> > to which it migrates. > >> >> > > >> >> > What do you think? > >> >> > >> >> That sounds like the way to go. > >> > > >> > So should we put this on IanC''s to-do-be-done list? Are you going to > >> > put it on your to-do list? :-) > >> > >> Below/attached a draft patch (compile tested only), handling save/ > >> restore of the bank count, but not allowing for a config setting to > >> specify its initial value (yet). > > > > Looks pretty good for a first blush. Just one question: Why is the vmce > > count made on a per-vcpu basis, rather than on a per-domain basis, like > > the actual banks are? Is the host MCE stuff per-vcpu? > > The question should probably be the other way around - why is the > vMCE implementation using global (fake) MSRs rather than per-vCPU > ones (as they would be on hardware). If the change here was > implemented as per-domain MSRs, a future move of the vMCE > implementation to a more natural model would be impossible. Also, > for the PV case the save/restore logic is much simpler this way.Ah, that makes sense. Good spot on the impossiblity of changing to per-vcpu later. -George
Jan Beulich wrote:>>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> >>>> wrote: >> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote: >>>>>> On 24.01.12 at 11:29, George Dunlap >>>>>> <George.Dunlap@eu.citrix.com> wrote: >>>> On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> >>>> wrote: >>>>> x86''s vMCE implementation lets a guest know of as many MCE >>>>> reporting banks as there are in the host. While a PV guest could >>>>> be expected to deal with this number changing (particularly >>>>> decreasing) during migration (not currently handled anywhere >>>>> afaict), for HVM guests this is certainly wrong. >>>>> >>>>> At least to me it isn''t, however, clear how to properly handle >>>>> this. The easiest would appear to be to save and restore the >>>>> number of banks >>>>> the guest was made believe it can access, making vmce_{rd,wr}msr() >>>>> silently tolerate accesses between the host and guest values. >>>> >>>> We ran into this in the XS 6.0 release as well. I think that the >>>> ideal thing to do would be to have a parameter that can be set at >>>> boot, to say how many vMCE banks a guest has, defaulting to the >>>> number of MCE banks on the host. This parameter would be >>>> preserved across migration. Ideally, a pool-aware toolstack like >>>> xapi would then set this value to be the value of the host in the >>>> pool with the largest number of banks, allowing a guest to access >>>> all the banks on any host to which it migrates. >>>> >>>> What do you think? >>> >>> That sounds like the way to go. >> >> So should we put this on IanC''s to-do-be-done list? Are you going to >> put it on your to-do list? :-) > > Below/attached a draft patch (compile tested only), handling save/ > restore of the bank count, but not allowing for a config setting to > specify its initial value (yet). > > Jan > > --- a/tools/libxc/xc_domain_save.c > +++ b/tools/libxc/xc_domain_save.c > @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in > > domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext; > domctl.domain = dom; > + memset(&domctl.u, 0, sizeof(domctl.u)); > domctl.u.ext_vcpucontext.vcpu = i; > if ( xc_domctl(xch, &domctl) < 0 ) > { > --- a/tools/misc/xen-hvmctx.c > +++ b/tools/misc/xen-hvmctx.c > @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void) > (unsigned long long) p.apic_assist); > } > > +static void dump_vmce_vcpu(void) > +{ > + HVM_SAVE_TYPE(VMCE_VCPU) p; > + READ(p); > + printf(" VMCE_VCPU: nr_banks %u\n", p.nr_banks); > +} > + > int main(int argc, char **argv) > { > int entry, domid; > @@ -449,6 +456,7 @@ int main(int argc, char **argv) > case HVM_SAVE_CODE(MTRR): dump_mtrr(); break; > case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); > break; case HVM_SAVE_CODE(VIRIDIAN_VCPU): > dump_viridian_vcpu(); break; + case HVM_SAVE_CODE(VMCE_VCPU): > dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break; > default: > printf(" ** Don''t understand type %u: skipping\n", > --- a/xen/arch/x86/cpu/mcheck/mce.h > +++ b/xen/arch/x86/cpu/mcheck/mce.h > @@ -3,6 +3,7 @@ > #define _MCE_H > > #include <xen/init.h> > +#include <xen/sched.h> > #include <xen/smp.h> > #include <asm/types.h> > #include <asm/traps.h> > @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d, > u64 mce_cap_init(void); > extern int firstbank; > > -int intel_mce_rdmsr(uint32_t msr, uint64_t *val); > -int intel_mce_wrmsr(uint32_t msr, uint64_t val); > +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t > *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t > val); > > struct mcinfo_extended *intel_get_extended_msrs( > struct mcinfo_global *mig, struct mc_info *mi); > @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban > > extern int vmce_init(struct cpuinfo_x86 *c); > > -static inline int mce_vendor_bank_msr(uint32_t msr) > +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t > msr) { > if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && > - msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + > nr_mce_banks) ) + msr >= MSR_IA32_MC0_CTL2 && > + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) > return 1; > return 0; > } > > -static inline int mce_bank_msr(uint32_t msr) > +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr) > { > - if ( (msr >= MSR_IA32_MC0_CTL && msr < > MSR_IA32_MCx_CTL(nr_mce_banks)) || > - mce_vendor_bank_msr(msr) ) > + if ( (msr >= MSR_IA32_MC0_CTL && > + msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) || > + mce_vendor_bank_msr(v, msr) ) > return 1; > return 0; > } > --- a/xen/arch/x86/cpu/mcheck/mce_intel.c > +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c > @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc > } > > /* intel specific MCA MSR */ > -int intel_mce_wrmsr(uint32_t msr, uint64_t val) > +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) > { > int ret = 0; > > - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + > nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && > + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) > { > mce_printk(MCE_QUIET, "We have disabled CMCI capability, " > "Guest should not write this MSR!\n"); > @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64 > return ret; > } > > -int intel_mce_rdmsr(uint32_t msr, uint64_t *val) > +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t > *val) { > int ret = 0; > > - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + > nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && > + msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) ) > { > mce_printk(MCE_QUIET, "We have disabled CMCI capability, " > "Guest should not read this MSR!\n"); > --- a/xen/arch/x86/cpu/mcheck/vmce.c > +++ b/xen/arch/x86/cpu/mcheck/vmce.c > @@ -10,6 +10,7 @@ > #include <xen/delay.h> > #include <xen/smp.h> > #include <xen/mm.h> > +#include <xen/hvm/save.h> > #include <asm/processor.h> > #include <public/sysctl.h> > #include <asm/system.h> > @@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d) > dom_vmce(d) = NULL; > } > > -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t > *val) +void vmce_init_vcpu(struct vcpu *v) > { > - int bank, ret = 1; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + v->arch.nr_vmce_banks = nr_mce_banks; > +} > + > +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, > uint64_t *val) +{ > + int ret = 1; > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > struct bank_entry *entry; > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > - if ( bank >= nr_mce_banks ) > - return -1; > + *val = 0; > > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > { > case MSR_IA32_MC0_CTL: > - *val = vmce->mci_ctl[bank] & > - (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > + if ( bank < nr_mce_banks ) > + *val = vmce->mci_ctl[bank] & > + (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); > mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n", > bank, *val); > break; > @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain > switch ( boot_cpu_data.x86_vendor ) > { > case X86_VENDOR_INTEL: > - ret = intel_mce_rdmsr(msr, val); > + ret = intel_mce_rdmsr(v, msr, val); > break; > default: > ret = 0; > @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain > */ > int vmce_rdmsr(uint32_t msr, uint64_t *val) > { > - struct domain *d = current->domain; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + const struct vcpu *cur = current; > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > int ret = 1; > > *val = 0; > > - spin_lock(&dom_vmce(d)->lock); > + spin_lock(&vmce->lock); > > switch ( msr ) > { > @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v > *val); > break; > default: > - ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0; > + ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) > : 0; break; > } > > - spin_unlock(&dom_vmce(d)->lock); > + spin_unlock(&vmce->lock); > return ret; > } > > -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val) > +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val) > { > - int bank, ret = 1; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + int ret = 1; > + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; > + struct domain_mca_msrs *vmce = dom_vmce(v->domain); > struct bank_entry *entry = NULL; > > - bank = (msr - MSR_IA32_MC0_CTL) / 4; > - if ( bank >= nr_mce_banks ) > - return -EINVAL; > - > switch ( msr & (MSR_IA32_MC0_CTL | 3) ) > { > case MSR_IA32_MC0_CTL: > - vmce->mci_ctl[bank] = val; > + if ( bank < nr_mce_banks ) > + vmce->mci_ctl[bank] = val; > break; > case MSR_IA32_MC0_STATUS: > /* Give the first entry of the list, it corresponds to > current @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain > * the guest, this node will be deleted. > * Only error bank is written. Non-error banks simply return. > */ > - if ( !list_empty(&dom_vmce(d)->impact_header) ) > + if ( !list_empty(&vmce->impact_header) ) > { > - entry = list_entry(dom_vmce(d)->impact_header.next, > + entry = list_entry(vmce->impact_header.next, > struct bank_entry, list); > if ( entry->bank == bank ) > entry->mci_status = val; > @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain > switch ( boot_cpu_data.x86_vendor ) > { > case X86_VENDOR_INTEL: > - ret = intel_mce_wrmsr(msr, val); > + ret = intel_mce_wrmsr(v, msr, val); > break; > default: > ret = 0; > @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain > */ > int vmce_wrmsr(u32 msr, u64 val) > { > - struct domain *d = current->domain; > + struct vcpu *cur = current; > struct bank_entry *entry = NULL; > - struct domain_mca_msrs *vmce = dom_vmce(d); > + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); > int ret = 1; > > if ( !g_mcg_cap ) > @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val) > vmce->mcg_status = val; > mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", > val); /* For HVM guest, this is the point for deleting vMCE > injection node */ - if ( d->is_hvm && (vmce->nr_injection > 0) > ) + if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) ) > { > vmce->nr_injection--; /* Should be 0 */ > if ( !list_empty(&vmce->impact_header) ) > @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val) > ret = -1; > break; > default: > - ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0; > + ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) > : 0; break; > } > > @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val) > return ret; > } > > +static int vmce_save_vcpu_ctxt(struct domain *d, > hvm_domain_context_t *h) +{ > + struct vcpu *v; > + int err = 0; > + > + for_each_vcpu( d, v ) { > + struct hvm_vmce_vcpu ctxt = { > + .nr_banks = v->arch.nr_vmce_banks > + }; > + > + err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); > + if ( err ) > + break; > + } > + > + return err; > +} > + > +static int vmce_load_vcpu_ctxt(struct domain *d, > hvm_domain_context_t *h) +{ > + unsigned int vcpuid = hvm_load_instance(h); > + struct vcpu *v; > + struct hvm_vmce_vcpu ctxt; > + int err; > + > + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) > + { > + gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", > vcpuid); + return -EINVAL; > + } > + > + err = hvm_load_entry(VMCE_VCPU, h, &ctxt); > + if ( !err ) > + v->arch.nr_vmce_banks = ctxt.nr_banks; > + > + return err; > +} > + > +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, > + vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); > + > int inject_vmce(struct domain *d) > { > int cpu = smp_processor_id(); > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v) > > v->arch.pv_vcpu.ctrlreg[4] > real_cr4_to_pv_guest_cr4(mmu_cr4_features); > > + vmce_init_vcpu(v); > + > rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0; > done: > if ( rc ) > --- a/xen/arch/x86/domctl.c > +++ b/xen/arch/x86/domctl.c > @@ -1027,11 +1027,12 @@ long arch_do_domctl( > evc->syscall32_callback_eip = 0; > evc->syscall32_disables_events = 0; > } > + evc->nr_mce_banks = v->arch.nr_vmce_banks; > } > else > { > ret = -EINVAL; > - if ( evc->size != sizeof(*evc) ) > + if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) ) > goto ext_vcpucontext_out; > #ifdef __x86_64__ > if ( !is_hvm_domain(d) ) > @@ -1059,6 +1060,16 @@ long arch_do_domctl( > (evc->syscall32_callback_cs & ~3) || > evc->syscall32_callback_eip ) > goto ext_vcpucontext_out; > + > + if ( evc->size >= offsetof(typeof(*evc), > mbz[ARRAY_SIZE(evc->mbz)]) ) + { > + unsigned int i; > + > + for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i ) > + if ( evc->mbz[i] ) > + goto ext_vcpucontext_out; > + v->arch.nr_vmce_banks = evc->nr_mce_banks; > + } > } > > ret = 0; > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -488,6 +488,8 @@ struct arch_vcpu > /* This variable determines whether nonlazy extended state has > been used, * and thus should be saved/restored. */ > bool_t nonlazy_xstate_used; > + > + uint8_t nr_vmce_banks; > > struct paging_vcpu paging; > > --- a/xen/include/asm-x86/mce.h > +++ b/xen/include/asm-x86/mce.h > @@ -28,6 +28,7 @@ struct domain_mca_msrs > /* Guest vMCE MSRs virtualization */ > extern int vmce_init_msr(struct domain *d); > extern void vmce_destroy_msr(struct domain *d); > +extern void vmce_init_vcpu(struct vcpu *); > extern int vmce_wrmsr(uint32_t msr, uint64_t val); > extern int vmce_rdmsr(uint32_t msr, uint64_t *val); > > --- a/xen/include/public/arch-x86/hvm/save.h > +++ b/xen/include/public/arch-x86/hvm/save.h > @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context { > > DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct > hvm_viridian_vcpu_context); > > +struct hvm_vmce_vcpu { > + uint8_t nr_banks; > +}; > + > +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu); > + > /* > * Largest type-code in use > */ > -#define HVM_SAVE_CODE_MAX 17 > +#define HVM_SAVE_CODE_MAX 18 > > #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext { > uint32_t vcpu; > /* > * SET: Size of struct (IN) > - * GET: Size of struct (OUT) > + * GET: Size of struct (OUT, up to 128 bytes) > */ > uint32_t size; > #if defined(__i386__) || defined(__x86_64__) > @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext { > uint16_t sysenter_callback_cs; > uint8_t syscall32_disables_events; > uint8_t sysenter_disables_events; > + uint8_t nr_mce_banks; > + /* This array can be split and used for future extension. */ > + /* It is there just to grow the structure beyond 4.1''s size. */ > + uint8_t mbz[9]; > #endif > }; > typedef struct xen_domctl_ext_vcpucontext > xen_domctl_ext_vcpucontext_t;Jan, How about use static bank size, say 256, which architecture (IA32_MCG_CAP MSR) defined max bank number? Thanks, Jinsong
>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > How about use static bank size, say 256, which architecture (IA32_MCG_CAP > MSR) defined max bank number?The maximum, if any, is 32 (beyond which collision with other defined MSRs would arise). But with the possibility of having a discontiguous or relocated MSR space, I don''t think any static maximum would be a good idea. Jan
Jan Beulich wrote:>>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> How about use static bank size, say 256, which architecture >> (IA32_MCG_CAP MSR) defined max bank number? > > The maximum, if any, is 32 (beyond which collision with other defined > MSRs would arise). But with the possibility of having a discontiguous > or relocated MSR space, I don''t think any static maximum would be a > good idea. > > JanThe advantages of static max bank is simple, any disadvantages of static solution? or, can we use bank_entry listed in vmce->impact_header for mci_ctl, like what mci_status/mci_addr/mci_misc do? I just think the patch may be too complicated for a minor issue. Thanks, Jinsong
>>> On 03.02.12 at 13:34, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Jan Beulich wrote: >>>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >>> How about use static bank size, say 256, which architecture >>> (IA32_MCG_CAP MSR) defined max bank number? >> >> The maximum, if any, is 32 (beyond which collision with other defined >> MSRs would arise). But with the possibility of having a discontiguous >> or relocated MSR space, I don''t think any static maximum would be a >> good idea. >> >> Jan > > The advantages of static max bank is simple, any disadvantages of static > solution?Lack of forward compatibility. I may even be that the use of uint8_t wasn''t really a good choice of mine.> or, can we use bank_entry listed in vmce->impact_header for mci_ctl, like what > mci_status/mci_addr/mci_misc do?I don''t think so - the control register (obviously) must be accessible independent of any ongoing (latched) MCE.> I just think the patch may be too complicated for a minor issue.Is a guest crash minor? After all there is no guarantee that the accesses to these MSRs are exception handler protected in a way similar to what current Linux does. Jan
On Fri, Feb 3, 2012 at 2:04 PM, Jan Beulich <JBeulich@suse.com> wrote:>> I just think the patch may be too complicated for a minor issue. > > Is a guest crash minor? After all there is no guarantee that the > accesses to these MSRs are exception handler protected in aIn fact, I can tell you for a fact that newer versions of Windows do *not* handle the exception properly, and will crash if an MCE that worked before now generates an exception (for example, after live-migrating to a host with fewer MCEs). -George
On Mon, Jan 30, Jan Beulich wrote:> Below/attached a draft patch (compile tested only), handling save/ > restore of the bank count, but not allowing for a config setting to > specify its initial value (yet).Does it take more than just applying this patch for src+dst host and migrate a hvm guest? I see no difference, the mce warning is still there. Olaf
>>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote: > On Mon, Jan 30, Jan Beulich wrote: > >> Below/attached a draft patch (compile tested only), handling save/ >> restore of the bank count, but not allowing for a config setting to >> specify its initial value (yet). > > Does it take more than just applying this patch for src+dst host and > migrate a hvm guest? I see no difference, the mce warning is still > there.No, it shouldn''t require anything else. Could you add a printk() each to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored (and at once checking that they actually get executed? I was under the impression that adding save records for HVM is a simple drop-in exercise these days... Jan
On Fri, Feb 10, Jan Beulich wrote:> No, it shouldn''t require anything else. Could you add a printk() each > to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored > (and at once checking that they actually get executed? I was under > the impression that adding save records for HVM is a simple drop-in > exercise these days...The functions are called on both sides with no errors. Regarding these messages, are they guest addresses? (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4 Olaf
>>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de> wrote: > On Fri, Feb 10, Jan Beulich wrote: > >> No, it shouldn''t require anything else. Could you add a printk() each >> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored >> (and at once checking that they actually get executed? I was under >> the impression that adding save records for HVM is a simple drop-in >> exercise these days... > > The functions are called on both sides with no errors.And the bank count also is correctly reflecting the original, save-side value on the restore side?> Regarding these messages, are they guest addresses? > > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4No, these are hypervisor ones. Jan
On Fri, Feb 10, Jan Beulich wrote:> >>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de> wrote: > > On Fri, Feb 10, Jan Beulich wrote: > > > >> No, it shouldn''t require anything else. Could you add a printk() each > >> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored > >> (and at once checking that they actually get executed? I was under > >> the impression that adding save records for HVM is a simple drop-in > >> exercise these days... > > > > The functions are called on both sides with no errors. > > And the bank count also is correctly reflecting the original, save-side > value on the restore side?Oh, the count is zero on both sides? I will double check my dbg patch. Olaf
On Fri, Feb 10, Jan Beulich wrote:> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote: > > On Mon, Jan 30, Jan Beulich wrote: > > > >> Below/attached a draft patch (compile tested only), handling save/ > >> restore of the bank count, but not allowing for a config setting to > >> specify its initial value (yet). > > > > Does it take more than just applying this patch for src+dst host and > > migrate a hvm guest? I see no difference, the mce warning is still > > there. > > No, it shouldn''t require anything else. Could you add a printk() each > to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored > (and at once checking that they actually get executed? I was under > the impression that adding save records for HVM is a simple drop-in > exercise these days...These functions are called for dom0, but not for domU. And as a result arch.nr_vmce_banks remains zero. I assume the guest needs to be initialized in some way as well, and that does not happen? Olaf
On Fri, Feb 10, Jan Beulich wrote:> > Regarding these messages, are they guest addresses? > > > > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4 > > No, these are hypervisor ones.These are the functions they belong to: (XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e (XEN) do_general_protection: ffff82c4801d61ec vmx_msr_read_intercept+0x2d8/0x358 (XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99a Olaf
>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote: > On Fri, Feb 10, Jan Beulich wrote: > >> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote: >> > On Mon, Jan 30, Jan Beulich wrote: >> > >> >> Below/attached a draft patch (compile tested only), handling save/ >> >> restore of the bank count, but not allowing for a config setting to >> >> specify its initial value (yet). >> > >> > Does it take more than just applying this patch for src+dst host and >> > migrate a hvm guest? I see no difference, the mce warning is still >> > there. >> >> No, it shouldn''t require anything else. Could you add a printk() each >> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored >> (and at once checking that they actually get executed? I was under >> the impression that adding save records for HVM is a simple drop-in >> exercise these days... > > These functions are called for dom0, but not for domU. And as a result > arch.nr_vmce_banks remains zero. I assume the guest needs to be > initialized in some way as well, and that does not happen?These functions should be called with Dom0 being current domain, but the struct domain * argument should certainly be that of the DomU being saved/restored. Guest initialization happens in vmce_init_vcpu(), called from vcpu_initialise() (irrespective of the kind of domain, i.e. equally for PV and HVM). I spotted another problem with the patch though - MCG_CAP reads aren''t reflecting the possibly non-host bank count. I''m in the process of addressing this, but the whole MCG_* handling is bogus as being per-domain instead of per-vCPU (and at least MCG_CAP lacking save/restore too). Jan
>Guest initialization happens in vmce_init_vcpu(), called from >vcpu_initialise() (irrespective of the kind of domain, i.e. equally for >PV and HVM).And wrong I was - HVM guests get filtered much earlier in that function. I''ll get back with an updated patch hopefully soon. Jan
>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote: > These functions are called for dom0, but not for domU. And as a result > arch.nr_vmce_banks remains zero. I assume the guest needs to be > initialized in some way as well, and that does not happen?Below/attached a fixed version of the patch. Jan --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext; domctl.domain = dom; + memset(&domctl.u, 0, sizeof(domctl.u)); domctl.u.ext_vcpucontext.vcpu = i; if ( xc_domctl(xch, &domctl) < 0 ) { --- a/tools/misc/xen-hvmctx.c +++ b/tools/misc/xen-hvmctx.c @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void) (unsigned long long) p.apic_assist); } +static void dump_vmce_vcpu(void) +{ + HVM_SAVE_TYPE(VMCE_VCPU) p; + READ(p); + printf(" VMCE_VCPU: caps %" PRIx64 "\n", p.caps); +} + int main(int argc, char **argv) { int entry, domid; @@ -449,6 +456,7 @@ int main(int argc, char **argv) case HVM_SAVE_CODE(MTRR): dump_mtrr(); break; case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break; case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break; + case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break; default: printf(" ** Don''t understand type %u: skipping\n", --- a/xen/arch/x86/cpu/mcheck/mce.h +++ b/xen/arch/x86/cpu/mcheck/mce.h @@ -3,6 +3,7 @@ #define _MCE_H #include <xen/init.h> +#include <xen/sched.h> #include <xen/smp.h> #include <asm/types.h> #include <asm/traps.h> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d, u64 mce_cap_init(void); extern int firstbank; -int intel_mce_rdmsr(uint32_t msr, uint64_t *val); -int intel_mce_wrmsr(uint32_t msr, uint64_t val); +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val); struct mcinfo_extended *intel_get_extended_msrs( struct mcinfo_global *mig, struct mc_info *mi); @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban extern int vmce_init(struct cpuinfo_x86 *c); -static inline int mce_vendor_bank_msr(uint32_t msr) +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr) { if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && - msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) ) + msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) ) return 1; return 0; } -static inline int mce_bank_msr(uint32_t msr) +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr) { - if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) || - mce_vendor_bank_msr(msr) ) + if ( (msr >= MSR_IA32_MC0_CTL && + msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) || + mce_vendor_bank_msr(v, msr) ) return 1; return 0; } --- a/xen/arch/x86/cpu/mcheck/mce_intel.c +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc } /* intel specific MCA MSR */ -int intel_mce_wrmsr(uint32_t msr, uint64_t val) +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) { int ret = 0; - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) ) { mce_printk(MCE_QUIET, "We have disabled CMCI capability, " "Guest should not write this MSR!\n"); @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64 return ret; } -int intel_mce_rdmsr(uint32_t msr, uint64_t *val) +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) { int ret = 0; - if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks)) + if ( msr >= MSR_IA32_MC0_CTL2 && + msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) ) { mce_printk(MCE_QUIET, "We have disabled CMCI capability, " "Guest should not read this MSR!\n"); --- a/xen/arch/x86/cpu/mcheck/vmce.c +++ b/xen/arch/x86/cpu/mcheck/vmce.c @@ -10,6 +10,7 @@ #include <xen/delay.h> #include <xen/smp.h> #include <xen/mm.h> +#include <xen/hvm/save.h> #include <asm/processor.h> #include <public/sysctl.h> #include <asm/system.h> @@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d) sizeof(dom_vmce(d)->mci_ctl)); dom_vmce(d)->mcg_status = 0x0; - dom_vmce(d)->mcg_cap = g_mcg_cap; dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0; dom_vmce(d)->nr_injection = 0; @@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d) dom_vmce(d) = NULL; } -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val) +void vmce_init_vcpu(struct vcpu *v) { - int bank, ret = 1; - struct domain_mca_msrs *vmce = dom_vmce(d); + v->arch.mcg_cap = g_mcg_cap; +} + +int vmce_restore_vcpu(struct vcpu *v, uint64_t caps) +{ + if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P ) + { + dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA capabilities" + " %#" PRIx64 " for d%d:v%u (supported: %#Lx)\n", + is_hvm_vcpu(v) ? "HVM" : "PV", caps, v->domain->domain_id, + v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT); + return -EPERM; + } + + v->arch.mcg_cap = caps; + return 0; +} + +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) +{ + int ret = 1; + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; + struct domain_mca_msrs *vmce = dom_vmce(v->domain); struct bank_entry *entry; - bank = (msr - MSR_IA32_MC0_CTL) / 4; - if ( bank >= nr_mce_banks ) - return -1; + *val = 0; switch ( msr & (MSR_IA32_MC0_CTL | 3) ) { case MSR_IA32_MC0_CTL: - *val = vmce->mci_ctl[bank] & - (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); + if ( bank < nr_mce_banks ) + *val = vmce->mci_ctl[bank] & + (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL); mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n", bank, *val); break; @@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain switch ( boot_cpu_data.x86_vendor ) { case X86_VENDOR_INTEL: - ret = intel_mce_rdmsr(msr, val); + ret = intel_mce_rdmsr(v, msr, val); break; default: ret = 0; @@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain */ int vmce_rdmsr(uint32_t msr, uint64_t *val) { - struct domain *d = current->domain; - struct domain_mca_msrs *vmce = dom_vmce(d); + const struct vcpu *cur = current; + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); int ret = 1; *val = 0; - spin_lock(&dom_vmce(d)->lock); + spin_lock(&vmce->lock); switch ( msr ) { @@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v "MCE: rdmsr MCG_STATUS 0x%"PRIx64"\n", *val); break; case MSR_IA32_MCG_CAP: - *val = vmce->mcg_cap; + *val = cur->arch.mcg_cap; mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP 0x%"PRIx64"\n", *val); break; case MSR_IA32_MCG_CTL: /* Always 0 if no CTL support */ - *val = vmce->mcg_ctl & h_mcg_ctl; + if ( cur->arch.mcg_cap & MCG_CTL_P ) + *val = vmce->mcg_ctl & h_mcg_ctl; mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL 0x%"PRIx64"\n", *val); break; default: - ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0; + ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0; break; } - spin_unlock(&dom_vmce(d)->lock); + spin_unlock(&vmce->lock); return ret; } -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val) +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val) { - int bank, ret = 1; - struct domain_mca_msrs *vmce = dom_vmce(d); + int ret = 1; + unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4; + struct domain_mca_msrs *vmce = dom_vmce(v->domain); struct bank_entry *entry = NULL; - bank = (msr - MSR_IA32_MC0_CTL) / 4; - if ( bank >= nr_mce_banks ) - return -EINVAL; - switch ( msr & (MSR_IA32_MC0_CTL | 3) ) { case MSR_IA32_MC0_CTL: - vmce->mci_ctl[bank] = val; + if ( bank < nr_mce_banks ) + vmce->mci_ctl[bank] = val; break; case MSR_IA32_MC0_STATUS: /* Give the first entry of the list, it corresponds to current @@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain * the guest, this node will be deleted. * Only error bank is written. Non-error banks simply return. */ - if ( !list_empty(&dom_vmce(d)->impact_header) ) + if ( !list_empty(&vmce->impact_header) ) { - entry = list_entry(dom_vmce(d)->impact_header.next, + entry = list_entry(vmce->impact_header.next, struct bank_entry, list); if ( entry->bank == bank ) entry->mci_status = val; @@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain switch ( boot_cpu_data.x86_vendor ) { case X86_VENDOR_INTEL: - ret = intel_mce_wrmsr(msr, val); + ret = intel_mce_wrmsr(v, msr, val); break; default: ret = 0; @@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain */ int vmce_wrmsr(u32 msr, u64 val) { - struct domain *d = current->domain; + struct vcpu *cur = current; struct bank_entry *entry = NULL; - struct domain_mca_msrs *vmce = dom_vmce(d); + struct domain_mca_msrs *vmce = dom_vmce(cur->domain); int ret = 1; if ( !g_mcg_cap ) @@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val) vmce->mcg_status = val; mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val); /* For HVM guest, this is the point for deleting vMCE injection node */ - if ( d->is_hvm && (vmce->nr_injection > 0) ) + if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) ) { vmce->nr_injection--; /* Should be 0 */ if ( !list_empty(&vmce->impact_header) ) @@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val) ret = -1; break; default: - ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0; + ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0; break; } @@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val) return ret; } +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) +{ + struct vcpu *v; + int err = 0; + + for_each_vcpu( d, v ) { + struct hvm_vmce_vcpu ctxt = { + .caps = v->arch.mcg_cap + }; + + err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); + if ( err ) + break; + } + + return err; +} + +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h) +{ + unsigned int vcpuid = hvm_load_instance(h); + struct vcpu *v; + struct hvm_vmce_vcpu ctxt; + int err; + + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) + { + dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n", + d->domain_id, vcpuid); + err = -EINVAL; + } + else + err = hvm_load_entry(VMCE_VCPU, h, &ctxt); + + return err ?: vmce_restore_vcpu(v, ctxt.caps); +} + +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, + vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); + int inject_vmce(struct domain *d) { int cpu = smp_processor_id(); --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v) if ( (rc = vcpu_init_fpu(v)) != 0 ) return rc; + vmce_init_vcpu(v); + if ( is_hvm_domain(d) ) { rc = hvm_vcpu_initialise(v); --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -1027,11 +1027,12 @@ long arch_do_domctl( evc->syscall32_callback_eip = 0; evc->syscall32_disables_events = 0; } + evc->mcg_cap = v->arch.mcg_cap; } else { ret = -EINVAL; - if ( evc->size != sizeof(*evc) ) + if ( evc->size < offsetof(typeof(*evc), mcg_cap) ) goto ext_vcpucontext_out; #ifdef __x86_64__ if ( !is_hvm_domain(d) ) @@ -1059,6 +1060,10 @@ long arch_do_domctl( (evc->syscall32_callback_cs & ~3) || evc->syscall32_callback_eip ) goto ext_vcpucontext_out; + + if ( evc->size >= offsetof(typeof(*evc), mcg_cap) + + sizeof(evc->mcg_cap) ) + ret = vmce_restore_vcpu(v, evc->mcg_cap); } ret = 0; --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -488,6 +488,8 @@ struct arch_vcpu /* This variable determines whether nonlazy extended state has been used, * and thus should be saved/restored. */ bool_t nonlazy_xstate_used; + + uint64_t mcg_cap; struct paging_vcpu paging; --- a/xen/include/asm-x86/mce.h +++ b/xen/include/asm-x86/mce.h @@ -16,7 +16,6 @@ struct bank_entry { struct domain_mca_msrs { /* Guest should not change below values after DOM boot up */ - uint64_t mcg_cap; uint64_t mcg_ctl; uint64_t mcg_status; uint64_t *mci_ctl; @@ -28,6 +27,8 @@ struct domain_mca_msrs /* Guest vMCE MSRs virtualization */ extern int vmce_init_msr(struct domain *d); extern void vmce_destroy_msr(struct domain *d); +extern void vmce_init_vcpu(struct vcpu *); +extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps); extern int vmce_wrmsr(uint32_t msr, uint64_t val); extern int vmce_rdmsr(uint32_t msr, uint64_t *val); --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -575,9 +575,15 @@ struct hvm_viridian_vcpu_context { DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context); +struct hvm_vmce_vcpu { + uint64_t caps; +}; + +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu); + /* * Largest type-code in use */ -#define HVM_SAVE_CODE_MAX 17 +#define HVM_SAVE_CODE_MAX 18 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */ --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext { uint32_t vcpu; /* * SET: Size of struct (IN) - * GET: Size of struct (OUT) + * GET: Size of struct (OUT, up to 128 bytes) */ uint32_t size; #if defined(__i386__) || defined(__x86_64__) @@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext { uint16_t sysenter_callback_cs; uint8_t syscall32_disables_events; uint8_t sysenter_disables_events; + uint64_aligned_t mcg_cap; #endif }; typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.02.12 at 09:30, Olaf Hering <olaf@aepfle.de> wrote: > On Fri, Feb 10, Jan Beulich wrote: > >> > Regarding these messages, are they guest addresses? >> > >> > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4 >> >> No, these are hypervisor ones. > > These are the functions they belong to: > > (XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e > (XEN) do_general_protection: ffff82c4801d61ec vmx_msr_read_intercept+0x2d8/0x358 > (XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99aThis appears to be the rdmsr_safe() at the bottom of the default case in vmx_msr_read_intercept(), and I would expect those to go away once the vMCE save/restore patch is working properly. Jan
On Mon, Feb 13, Jan Beulich wrote:> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote: > > These functions are called for dom0, but not for domU. And as a result > > arch.nr_vmce_banks remains zero. I assume the guest needs to be > > initialized in some way as well, and that does not happen? > > Below/attached a fixed version of the patch.I get some mismatch after migration, both hosts run the same xen binary. The small debug patch I use is attached. Also: The tools do not catch the restore error so that the guest continues to run on the source host. Olaf nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806 (XEN) vmce_init_vcpu 1 o 0 n 806 (XEN) vmce_init_vcpu 2 o 0 n 806 (XEN) vmce_init_vcpu 3 o 0 n 806 (XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and restored on another (0x10676). (XEN) save.c:234:d0 HVM restore: CPU 0 (XEN) save.c:234:d0 HVM restore: CPU 1 (XEN) save.c:234:d0 HVM restore: CPU 2 (XEN) save.c:234:d0 HVM restore: CPU 3 (XEN) save.c:234:d0 HVM restore: PIC 0 (XEN) save.c:234:d0 HVM restore: PIC 1 (XEN) save.c:234:d0 HVM restore: IOAPIC 0 (XEN) save.c:234:d0 HVM restore: LAPIC 0 (XEN) save.c:234:d0 HVM restore: LAPIC 1 (XEN) save.c:234:d0 HVM restore: LAPIC 2 (XEN) save.c:234:d0 HVM restore: LAPIC 3 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3 (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0 (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0 (XEN) save.c:234:d0 HVM restore: PCI_LINK 0 (XEN) save.c:234:d0 HVM restore: PIT 0 (XEN) save.c:234:d0 HVM restore: RTC 0 (XEN) save.c:234:d0 HVM restore: HPET 0 (XEN) save.c:234:d0 HVM restore: PMTIMER 0 (XEN) save.c:234:d0 HVM restore: MTRR 0 (XEN) save.c:234:d0 HVM restore: MTRR 1 (XEN) save.c:234:d0 HVM restore: MTRR 2 (XEN) save.c:234:d0 HVM restore: MTRR 3 (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0 (XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw type 18 length 1 (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806 n ea (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0 (XEN) vmce_init_vcpu 0 o 0 n 806 (XEN) vmce_init_vcpu 1 o 0 n 806 (XEN) vmce_init_vcpu 2 o 0 n 806 (XEN) vmce_init_vcpu 3 o 0 n 806 (XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and restored on another (0x10676). (XEN) save.c:234:d0 HVM restore: CPU 0 (XEN) save.c:234:d0 HVM restore: CPU 1 (XEN) save.c:234:d0 HVM restore: CPU 2 (XEN) save.c:234:d0 HVM restore: CPU 3 (XEN) save.c:234:d0 HVM restore: PIC 0 (XEN) save.c:234:d0 HVM restore: PIC 1 (XEN) save.c:234:d0 HVM restore: IOAPIC 0 (XEN) save.c:234:d0 HVM restore: LAPIC 0 (XEN) save.c:234:d0 HVM restore: LAPIC 1 (XEN) save.c:234:d0 HVM restore: LAPIC 2 (XEN) save.c:234:d0 HVM restore: LAPIC 3 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2 (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3 (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0 (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0 (XEN) save.c:234:d0 HVM restore: PCI_LINK 0 (XEN) save.c:234:d0 HVM restore: PIT 0 (XEN) save.c:234:d0 HVM restore: RTC 0 (XEN) save.c:234:d0 HVM restore: HPET 0 (XEN) save.c:234:d0 HVM restore: PMTIMER 0 (XEN) save.c:234:d0 HVM restore: MTRR 0 (XEN) save.c:234:d0 HVM restore: MTRR 1 (XEN) save.c:234:d0 HVM restore: MTRR 2 (XEN) save.c:234:d0 HVM restore: MTRR 3 (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0 (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809 (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 (supported: 0x800) (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0 diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/mce.c --- a/xen/arch/x86/cpu/mcheck/mce.c +++ b/xen/arch/x86/cpu/mcheck/mce.c @@ -743,6 +743,7 @@ int mca_cap_init(void) { int i; + printk("%s: nr_mce_banks %x\n", __func__, nr_mce_banks); mca_allbanks = mcabanks_alloc(); for ( i = 0; i < nr_mce_banks; i++) mcabanks_set(i, mca_allbanks); diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/vmce.c --- a/xen/arch/x86/cpu/mcheck/vmce.c +++ b/xen/arch/x86/cpu/mcheck/vmce.c @@ -63,6 +63,7 @@ void vmce_destroy_msr(struct domain *d) void vmce_init_vcpu(struct vcpu *v) { + printk("%s %u o %lx n %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, g_mcg_cap); v->arch.mcg_cap = g_mcg_cap; } @@ -331,6 +332,7 @@ static int vmce_save_vcpu_ctxt(struct do }; err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt); + gdprintk(XENLOG_ERR, "%s %p %u %x %d o %lx\n", __func__, h, v->vcpu_id, err, err, ctxt.caps); if ( err ) break; } @@ -352,8 +354,11 @@ static int vmce_load_vcpu_ctxt(struct do err = -EINVAL; } else + { err = hvm_load_entry(VMCE_VCPU, h, &ctxt); + gdprintk(XENLOG_ERR, "%s %p %x %d o %lx n %lx\n", __func__, h, err, err, v->arch.mcg_cap, ctxt.caps); + } return err ?: vmce_restore_vcpu(v, ctxt.caps); } diff -r cbb1cce5fac0 xen/arch/x86/domctl.c --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -1027,6 +1027,7 @@ long arch_do_domctl( evc->syscall32_callback_eip = 0; evc->syscall32_disables_events = 0; } + gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap); evc->mcg_cap = v->arch.mcg_cap; } else @@ -1061,6 +1062,7 @@ long arch_do_domctl( evc->syscall32_callback_eip ) goto ext_vcpucontext_out; + gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap); if ( evc->size >= offsetof(typeof(*evc), mcg_cap) + sizeof(evc->mcg_cap) ) ret = vmce_restore_vcpu(v, evc->mcg_cap); diff -r cbb1cce5fac0 xen/arch/x86/traps.c --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3129,6 +3129,8 @@ void do_general_protection(struct cpu_us { dprintk(XENLOG_INFO, "GPF (%04x): %p -> %p\n", regs->error_code, _p(regs->eip), _p(fixup)); + printk ("%s: %p ", __func__, _p(regs->eip)); print_symbol("%s\n", regs->eip); + printk ("%s: %p ", __func__, _p(fixup)); print_symbol("%s\n", fixup); regs->eip = fixup; return; }
>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote: > On Mon, Feb 13, Jan Beulich wrote: > >> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote: >> > These functions are called for dom0, but not for domU. And as a result >> > arch.nr_vmce_banks remains zero. I assume the guest needs to be >> > initialized in some way as well, and that does not happen? >> >> Below/attached a fixed version of the patch. > > I get some mismatch after migration, both hosts run the same xen binary.Are you sure about that? I''m asking because ...> The small debug patch I use is attached. > > > Also: The tools do not catch the restore error so that the guest continues > to > run on the source host. > > Olaf > > nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806 > (XEN) vmce_init_vcpu 1 o 0 n 806 > (XEN) vmce_init_vcpu 2 o 0 n 806 > (XEN) vmce_init_vcpu 3 o 0 n 806 > (XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and > restored on another (0x10676). > (XEN) save.c:234:d0 HVM restore: CPU 0 > (XEN) save.c:234:d0 HVM restore: CPU 1 > (XEN) save.c:234:d0 HVM restore: CPU 2 > (XEN) save.c:234:d0 HVM restore: CPU 3 > (XEN) save.c:234:d0 HVM restore: PIC 0 > (XEN) save.c:234:d0 HVM restore: PIC 1 > (XEN) save.c:234:d0 HVM restore: IOAPIC 0 > (XEN) save.c:234:d0 HVM restore: LAPIC 0 > (XEN) save.c:234:d0 HVM restore: LAPIC 1 > (XEN) save.c:234:d0 HVM restore: LAPIC 2 > (XEN) save.c:234:d0 HVM restore: LAPIC 3 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3 > (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0 > (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0 > (XEN) save.c:234:d0 HVM restore: PCI_LINK 0 > (XEN) save.c:234:d0 HVM restore: PIT 0 > (XEN) save.c:234:d0 HVM restore: RTC 0 > (XEN) save.c:234:d0 HVM restore: HPET 0 > (XEN) save.c:234:d0 HVM restore: PMTIMER 0 > (XEN) save.c:234:d0 HVM restore: MTRR 0 > (XEN) save.c:234:d0 HVM restore: MTRR 1 > (XEN) save.c:234:d0 HVM restore: MTRR 2 > (XEN) save.c:234:d0 HVM restore: MTRR 3 > (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0 > (XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw type 18 length 1... this suggests the source host was still running with the old binary (where the save record was indeed just 1 byte in size)...> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806 n ea > (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0 > (XEN) vmce_init_vcpu 0 o 0 n 806 > (XEN) vmce_init_vcpu 1 o 0 n 806 > (XEN) vmce_init_vcpu 2 o 0 n 806 > (XEN) vmce_init_vcpu 3 o 0 n 806 > (XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and > restored on another (0x10676). > (XEN) save.c:234:d0 HVM restore: CPU 0 > (XEN) save.c:234:d0 HVM restore: CPU 1 > (XEN) save.c:234:d0 HVM restore: CPU 2 > (XEN) save.c:234:d0 HVM restore: CPU 3 > (XEN) save.c:234:d0 HVM restore: PIC 0 > (XEN) save.c:234:d0 HVM restore: PIC 1 > (XEN) save.c:234:d0 HVM restore: IOAPIC 0 > (XEN) save.c:234:d0 HVM restore: LAPIC 0 > (XEN) save.c:234:d0 HVM restore: LAPIC 1 > (XEN) save.c:234:d0 HVM restore: LAPIC 2 > (XEN) save.c:234:d0 HVM restore: LAPIC 3 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2 > (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3 > (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0 > (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0 > (XEN) save.c:234:d0 HVM restore: PCI_LINK 0 > (XEN) save.c:234:d0 HVM restore: PIT 0 > (XEN) save.c:234:d0 HVM restore: RTC 0 > (XEN) save.c:234:d0 HVM restore: HPET 0 > (XEN) save.c:234:d0 HVM restore: PMTIMER 0 > (XEN) save.c:234:d0 HVM restore: MTRR 0 > (XEN) save.c:234:d0 HVM restore: MTRR 1 > (XEN) save.c:234:d0 HVM restore: MTRR 2 > (XEN) save.c:234:d0 HVM restore: MTRR 3 > (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0 > (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809 > (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 (supported: 0x800)... whereas this one suggests that the size check passed (same binary on both ends) but the feature sets of the hosts are different (sorry, I can''t suggest a way around this, the hosts must be sufficiently matching in MCA features - those differences that are tolerable are already being handled). I''m confused as to what was running on the source host. Let me check what bit 12 is: The latest SDM considers this reserved, so the only way I can see how to deal with this is to switch the setting up of g_mcg_cap from a back-listing approach to a white-listing one.> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0Bottom line - it appears to work as intended considering the second attempt. Jan
>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote: > (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809 > (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 > (supported: 0x800) > (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0Attached an updated patch that sets up g_mcg_cap using a white-listing bit mask, as mentioned in the previous response. Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Feb 14, Jan Beulich wrote:> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote: > >> Below/attached a fixed version of the patch. > > > > I get some mismatch after migration, both hosts run the same xen binary. > > Are you sure about that? I''m asking because ...Yes. But I will move to another source host since the current one has now only a 10MBit/sec connection for some reason, so I will double check that both run indeed the same code. Olaf
On Tue, Feb 14, Jan Beulich wrote:> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote: > > (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809 > > (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 > > (supported: 0x800) > > (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0 > > Attached an updated patch that sets up g_mcg_cap using a white-listing > bit mask, as mentioned in the previous response. > > Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite.Thanks, this patch works ok for me, migrating from a host with 9 MCE banks to a host with 6 MCE banks. Olaf