Xen vMCE bugfix: inject vMCE# to all vcpus In our test for win8 guest mce, we find a bug in that no matter what SRAO/SRAR error xen inject to win8 guest, it always reboot. The root cause is, current Xen vMCE logic inject vMCE# only to vcpu0, this is not correct for Intel MCE (Under Intel arch, h/w generate MCE# to all CPUs). This patch fix vMCE injection bug, injecting vMCE# to all vcpus. Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> diff -r 19c15f3dfe1f xen/arch/x86/cpu/mcheck/mce.h --- a/xen/arch/x86/cpu/mcheck/mce.h Tue Jun 05 03:18:00 2012 +0800 +++ b/xen/arch/x86/cpu/mcheck/mce.h Wed Jun 13 23:40:45 2012 +0800 @@ -167,7 +167,6 @@ int fill_vmsr_data(struct mcinfo_bank *mc_bank, struct domain *d, uint64_t gstatus); -int inject_vmce(struct domain *d); int vmce_domain_inject(struct mcinfo_bank *bank, struct domain *d, struct mcinfo_global *global); extern int vmce_init(struct cpuinfo_x86 *c); diff -r 19c15f3dfe1f xen/arch/x86/cpu/mcheck/mce_intel.c --- a/xen/arch/x86/cpu/mcheck/mce_intel.c Tue Jun 05 03:18:00 2012 +0800 +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c Wed Jun 13 23:40:45 2012 +0800 @@ -638,6 +638,32 @@ return rec; } +static int inject_vmce(struct domain *d) +{ + struct vcpu *v; + + /* inject vMCE to all vcpus */ + for_each_vcpu(d, v) + { + if ( !test_and_set_bool(v->mce_pending) && + ((d->is_hvm) ? 1 : + guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check)) ) + { + mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d vcpu%d\n", + d->domain_id, v->vcpu_id); + vcpu_kick(v); + } + else + { + mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d vcpu%d\n", + d->domain_id, v->vcpu_id); + return -1; + } + } + + return 0; +} + static void intel_memerr_dhandler( struct mca_binfo *binfo, enum mce_result *result, @@ -718,11 +744,8 @@ /* We will inject vMCE to DOMU*/ if ( inject_vmce(d) < 0 ) - { - mce_printk(MCE_QUIET, "inject vMCE to DOM%d" - " failed\n", d->domain_id); goto vmce_failed; - } + /* Impacted domain go on with domain''s recovery job * if the domain has its own MCA handler. * For xen, it has contained the error and finished diff -r 19c15f3dfe1f xen/arch/x86/cpu/mcheck/vmce.c --- a/xen/arch/x86/cpu/mcheck/vmce.c Tue Jun 05 03:18:00 2012 +0800 +++ b/xen/arch/x86/cpu/mcheck/vmce.c Wed Jun 13 23:40:45 2012 +0800 @@ -389,53 +389,6 @@ HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt, vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU); -int inject_vmce(struct domain *d) -{ - int cpu = smp_processor_id(); - - /* PV guest and HVM guest have different vMCE# injection methods. */ - if ( !test_and_set_bool(d->vcpu[0]->mce_pending) ) - { - if ( d->is_hvm ) - { - mce_printk(MCE_VERBOSE, "MCE: inject vMCE to HVM DOM %d\n", - d->domain_id); - vcpu_kick(d->vcpu[0]); - } - else - { - mce_printk(MCE_VERBOSE, "MCE: inject vMCE to PV DOM%d\n", - d->domain_id); - if ( guest_has_trap_callback(d, 0, TRAP_machine_check) ) - { - cpumask_copy(d->vcpu[0]->cpu_affinity_tmp, - d->vcpu[0]->cpu_affinity); - mce_printk(MCE_VERBOSE, "MCE: CPU%d set affinity, old %d\n", - cpu, d->vcpu[0]->processor); - vcpu_set_affinity(d->vcpu[0], cpumask_of(cpu)); - vcpu_kick(d->vcpu[0]); - } - else - { - mce_printk(MCE_VERBOSE, - "MCE: Kill PV guest with No MCE handler\n"); - domain_crash(d); - } - } - } - else - { - /* new vMCE comes while first one has not been injected yet, - * in this case, inject fail. [We can''t lose this vMCE for - * the mce node''s consistency]. - */ - mce_printk(MCE_QUIET, "There''s a pending vMCE waiting to be injected " - " to this DOM%d!\n", d->domain_id); - return -1; - } - return 0; -} - /* This node list records errors impacting a domain. when one * MCE# happens, one error bank impacts a domain. This error node * will be inserted to the tail of the per_dom data for vMCE# MSR _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Xen vMCE bugfix: inject vMCE# to all vcpus > > In our test for win8 guest mce, we find a bug in that no matter what > SRAO/SRAR > error xen inject to win8 guest, it always reboot. > > The root cause is, current Xen vMCE logic inject vMCE# only to vcpu0, this > is > not correct for Intel MCE (Under Intel arch, h/w generate MCE# to all CPUs). > > This patch fix vMCE injection bug, injecting vMCE# to all vcpus.I see no correlation between the fix (and its description) and the problem at hand: Why would Win8 reboot if it doesn''t receive a particular MCE on all CPUs? Isn''t that model specific behavior? Furthermore I doubt that an MCE on one socket indeed causes MCE-s on all other sockets, not to speak of distinct NUMA nodes (it would already surprise me if MCE-s got broadcast across cores within a socket, unless they are caused by a resource shared across cores).> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c Tue Jun 05 03:18:00 2012 +0800 > +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c Wed Jun 13 23:40:45 2012 +0800 > @@ -638,6 +638,32 @@ > return rec; > } > > +static int inject_vmce(struct domain *d)Is it really necessary to move this vendor independent function into a vendor specific source file?> +{ > + struct vcpu *v; > + > + /* inject vMCE to all vcpus */ > + for_each_vcpu(d, v) > + { > + if ( !test_and_set_bool(v->mce_pending) && > + ((d->is_hvm) ? 1 : > + guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check)) )Quite strange a way to say (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check))> + { > + mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d vcpu%d\n", > + d->domain_id, v->vcpu_id); > + vcpu_kick(v); > + } > + else > + { > + mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d vcpu%d\n", > + d->domain_id, v->vcpu_id); > + return -1;Why do you bail here? This is particularly bad if v->mce_pending was already set on some vCPU (as that could simply mean the guest just didn''t get around to handle the vMCE yet).> + } > + } > + > + return 0; > +} > + > static void intel_memerr_dhandler( > struct mca_binfo *binfo, > enum mce_result *result,Also, how does this whole change interact with vmce_{rd,wr}msr()? The struct bank_entry instances live on a per-domain list, so the vMCE being delivered to all vCPU-s means they will all race for the single entry (and might erroneously access others, particularly in vmce_wrmsr()''s MCG_STATUS handling). Jan
Jan Beulich wrote:>>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Xen vMCE bugfix: inject vMCE# to all vcpus >> >> In our test for win8 guest mce, we find a bug in that no matter what >> SRAO/SRAR error xen inject to win8 guest, it always reboot. >> >> The root cause is, current Xen vMCE logic inject vMCE# only to >> vcpu0, this is not correct for Intel MCE (Under Intel arch, h/w >> generate MCE# to all CPUs). >> >> This patch fix vMCE injection bug, injecting vMCE# to all vcpus. > > I see no correlation between the fix (and its description) and the > problem at hand: Why would Win8 reboot if it doesn''t receive a > particular MCE on all CPUs? Isn''t that model specific behavior?It''s not model specific. For Intel processor, MCE# is broadcast to all logical processors on the system on which the UCR errors are supported (refer Intel SDM 15.9.3.1 & 15.9.3.2). So for vMCE injection under Intel arch, it''s a bug if inject vMCE# only to vcpu0. As for why win8 reboot, I guess it need wait all cpus enter mce handler and then go ahead. If timeout fail, it will reboot. I have no win8 code, I can only infer so, and test result prove this point. (It''s reasonable of win8 doing so, as Intel''s SDM suggestion, due to the potentially shared machine check MSR resources among the logical processors on the same package/core, the MCE handler may be required to synchronize with the other processors that received a machine check error and serialize access to the machine check registers when analyzing, logging and clearing the information in the machine check registers)> > Furthermore I doubt that an MCE on one socket indeed causes > MCE-s on all other sockets, not to speak of distinct NUMA nodes > (it would already surprise me if MCE-s got broadcast across cores > within a socket, unless they are caused by a resource shared > across cores).Somehow I agree. But at least currently from Intel SDM, architecturally it would broadcast.> >> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c Tue Jun 05 03:18:00 2012 >> +0800 +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c Wed Jun 13 23:40:45 >> 2012 +0800 @@ -638,6 +638,32 @@ return rec; >> } >> >> +static int inject_vmce(struct domain *d) > > Is it really necessary to move this vendor independent function > into a vendor specific source file?For AMD, I''m not quite sure if inject vMCE to all vcpus is proper for its arch. I remember AMD use nmi/ single MCE#/ broadcast MCE# in their different platforms. BTW, currently in Xen mce logic, inject_vmce() is only used by Intel mce, so I put it into Intel mce logic to avoid potential issue. Intel mce and AMD mce differs in some points (including MCE# injection), so we''d better not spool together. Only if we are totally sure some logic is safe to co-used by Intel and AMD will we put it into common mce code.> >> +{ >> + struct vcpu *v; >> + >> + /* inject vMCE to all vcpus */ >> + for_each_vcpu(d, v) >> + { >> + if ( !test_and_set_bool(v->mce_pending) && >> + ((d->is_hvm) ? 1 : >> + guest_has_trap_callback(d, v->vcpu_id, >> TRAP_machine_check)) ) > > Quite strange a way to say > > (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id, > TRAP_machine_check))For hvm it''s OK to just set v->mce_pending. While for pv it need also check if guest callback has been registered.> >> + { >> + mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d >> vcpu%d\n", + d->domain_id, v->vcpu_id); >> + vcpu_kick(v); >> + } >> + else >> + { >> + mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d >> vcpu%d\n", + d->domain_id, v->vcpu_id); >> + return -1; > > Why do you bail here? This is particularly bad if v->mce_pending > was already set on some vCPU (as that could simply mean the guest > just didn''t get around to handle the vMCE yet).the case v->mce_pending already set while new vmce come will result in domain crash. I don''t think guest can still handle this case, e.g. under baremetal if 2nd mce occur while 1st mce have not been handled os will reboot directly.> >> + } >> + } >> + >> + return 0; >> +} >> + >> static void intel_memerr_dhandler( >> struct mca_binfo *binfo, >> enum mce_result *result, > > Also, how does this whole change interact with vmce_{rd,wr}msr()? > The struct bank_entry instances live on a per-domain list, so the > vMCE being delivered to all vCPU-s means they will all race for the > single entry (and might erroneously access others, particularly in > vmce_wrmsr()''s MCG_STATUS handling).Yes, I agree. However, recently we have done vMCE re-design work (ongoing some internal review) and would present sooner. In new approach, MSRs are per-vcpu instead of per-domain. MSRs race issue (and the effort to make it clean) would be pointless then. So at this butfix patch, I only touch vMCE# injection itself. Thanks, Jinsong
>>> On 13.06.12 at 12:49, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Jan Beulich wrote: >>>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >>> Xen vMCE bugfix: inject vMCE# to all vcpus >>> >>> In our test for win8 guest mce, we find a bug in that no matter what >>> SRAO/SRAR error xen inject to win8 guest, it always reboot. >>> >>> The root cause is, current Xen vMCE logic inject vMCE# only to >>> vcpu0, this is not correct for Intel MCE (Under Intel arch, h/w >>> generate MCE# to all CPUs). >>> >>> This patch fix vMCE injection bug, injecting vMCE# to all vcpus. >> >> I see no correlation between the fix (and its description) and the >> problem at hand: Why would Win8 reboot if it doesn''t receive a >> particular MCE on all CPUs? Isn''t that model specific behavior? > > It''s not model specific. For Intel processor, MCE# is broadcast to all > logical processors on the system on which the UCR errors are supported (refer > Intel SDM 15.9.3.1 & 15.9.3.2). So for vMCE injection under Intel arch, it''s a > bug if inject vMCE# only to vcpu0.This is for a certain group of errors, but I can''t find any statement that this is general architectural behavior.>> Furthermore I doubt that an MCE on one socket indeed causes >> MCE-s on all other sockets, not to speak of distinct NUMA nodes >> (it would already surprise me if MCE-s got broadcast across cores >> within a socket, unless they are caused by a resource shared >> across cores). > > Somehow I agree. But at least currently from Intel SDM, architecturally it > would broadcast.I can''t even see how this would work reliably across packages or nodes: Suppose two entities each encounter a UCR - they would each signal the other one, and the handling of this signal would need to be synchronized with the handling of the local UCR (irrespective of the order in which the events arrive).>>> + if ( !test_and_set_bool(v->mce_pending) && >>> + ((d->is_hvm) ? 1 : >>> + guest_has_trap_callback(d, v->vcpu_id, >>> TRAP_machine_check)) ) >> >> Quite strange a way to say >> >> (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id, >> TRAP_machine_check)) > > For hvm it''s OK to just set v->mce_pending. While for pv it need also check if > guest callback has been registered.Sure. I was just asking why you use a conditional operator when the simpler || would do.>>> + { >>> + mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d >>> vcpu%d\n", + d->domain_id, v->vcpu_id); >>> + vcpu_kick(v); >>> + } >>> + else >>> + { >>> + mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d >>> vcpu%d\n", + d->domain_id, v->vcpu_id); >>> + return -1; >> >> Why do you bail here? This is particularly bad if v->mce_pending >> was already set on some vCPU (as that could simply mean the guest >> just didn''t get around to handle the vMCE yet). > > the case v->mce_pending already set while new vmce come will result in domain > crash. > I don''t think guest can still handle this case, e.g. under baremetal if 2nd > mce occur while 1st mce have not been handled os will reboot directly.Sorry, no. This goes back to the questionable nature of the broadcasting above - if a CPU encounters a UCR itself and gets signaled of a remote CPU having encountered one, it should be able to cope. Otherwise the risk of (pointless!) system death would increase with the number of CPUs.>> Also, how does this whole change interact with vmce_{rd,wr}msr()? >> The struct bank_entry instances live on a per-domain list, so the >> vMCE being delivered to all vCPU-s means they will all race for the >> single entry (and might erroneously access others, particularly in >> vmce_wrmsr()''s MCG_STATUS handling). > > Yes, I agree. > However, recently we have done vMCE re-design work (ongoing some internal > review) and would present sooner. In new approach, MSRs are per-vcpu instead > of per-domain. MSRs race issue (and the effort to make it clean) would be > pointless then. So at this butfix patch, I only touch vMCE# injection itself.I understand that you want the simpler version as bug fix, but you didn''t answer whether it was at all verified that the single per-domain entry logic would actually cope with the broadcast delivery logic you''re adding here. I''m afraid the per-vCPU entry logic is a prerequisite to this change (in whatever form it may end up going in). Jan
Jan Beulich wrote:>>>> On 13.06.12 at 12:49, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Jan Beulich wrote: >>>>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@intel.com> >>>>>> wrote: >>>> Xen vMCE bugfix: inject vMCE# to all vcpus >>>> >>>> In our test for win8 guest mce, we find a bug in that no matter >>>> what SRAO/SRAR error xen inject to win8 guest, it always reboot. >>>> >>>> The root cause is, current Xen vMCE logic inject vMCE# only to >>>> vcpu0, this is not correct for Intel MCE (Under Intel arch, h/w >>>> generate MCE# to all CPUs). >>>> >>>> This patch fix vMCE injection bug, injecting vMCE# to all vcpus. >>> >>> I see no correlation between the fix (and its description) and the >>> problem at hand: Why would Win8 reboot if it doesn''t receive a >>> particular MCE on all CPUs? Isn''t that model specific behavior? >> >> It''s not model specific. For Intel processor, MCE# is broadcast to >> all logical processors on the system on which the UCR errors are >> supported (refer Intel SDM 15.9.3.1 & 15.9.3.2). So for vMCE >> injection under Intel arch, it''s a bug if inject vMCE# only to vcpu0. > > This is for a certain group of errors, but I can''t find any statement > that this is general architectural behavior.Xen mce only inject SRAO/SRAR error to guest, which both belong to UCR error. For other error like CE and UCNA, hypervisor just virq to dom0 mcelog for logging.> >>> Furthermore I doubt that an MCE on one socket indeed causes >>> MCE-s on all other sockets, not to speak of distinct NUMA nodes >>> (it would already surprise me if MCE-s got broadcast across cores >>> within a socket, unless they are caused by a resource shared >>> across cores). >> >> Somehow I agree. But at least currently from Intel SDM, >> architecturally it would broadcast. > > I can''t even see how this would work reliably across packages or > nodes: Suppose two entities each encounter a UCR - they would > each signal the other one, and the handling of this signal would > need to be synchronized with the handling of the local UCR > (irrespective of the order in which the events arrive).I don''t think h/w story being so, though I in fact don''t have all details. Basically, processor is just a ''report point'' (via bank) to indicate what error occur. For example, error (mostly) generated at memory and transfer through footpoints e.g. memory --> memory controller --> QPI --> cache controller --> ... --> ... --> cpu core. It''s not one processor signaling the other cpu. A mca mechanism monitor different units decide when and what error reported, and notify all processors (that''s what broadcast mean).> >>>> + if ( !test_and_set_bool(v->mce_pending) && >>>> + ((d->is_hvm) ? 1 : >>>> + guest_has_trap_callback(d, v->vcpu_id, >>>> TRAP_machine_check)) ) >>> >>> Quite strange a way to say >>> >>> (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id, >>> TRAP_machine_check)) >> >> For hvm it''s OK to just set v->mce_pending. While for pv it need >> also check if guest callback has been registered. > > Sure. I was just asking why you use a conditional operator when > the simpler || would do.Ah I see. It''s better.> >>>> + { >>>> + mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d >>>> vcpu%d\n", + d->domain_id, v->vcpu_id); + >>>> vcpu_kick(v); + } >>>> + else >>>> + { >>>> + mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d >>>> vcpu%d\n", + d->domain_id, v->vcpu_id); >>>> + return -1; >>> >>> Why do you bail here? This is particularly bad if v->mce_pending >>> was already set on some vCPU (as that could simply mean the guest >>> just didn''t get around to handle the vMCE yet). >> >> the case v->mce_pending already set while new vmce come will result >> in domain crash. I don''t think guest can still handle this case, >> e.g. under baremetal if 2nd mce occur while 1st mce have not been >> handled os will reboot directly. > > Sorry, no. This goes back to the questionable nature of the > broadcasting above - if a CPU encounters a UCR itself and gets > signaled of a remote CPU having encountered one, it should be > able to cope. Otherwise the risk of (pointless!) system death > would increase with the number of CPUs. > >>> Also, how does this whole change interact with vmce_{rd,wr}msr()? >>> The struct bank_entry instances live on a per-domain list, so the >>> vMCE being delivered to all vCPU-s means they will all race for the >>> single entry (and might erroneously access others, particularly in >>> vmce_wrmsr()''s MCG_STATUS handling). >> >> Yes, I agree. >> However, recently we have done vMCE re-design work (ongoing some >> internal review) and would present sooner. In new approach, MSRs are >> per-vcpu instead of per-domain. MSRs race issue (and the effort to >> make it clean) would be pointless then. So at this butfix patch, I >> only touch vMCE# injection itself. > > I understand that you want the simpler version as bug fix, but > you didn''t answer whether it was at all verified that the single > per-domain entry logic would actually cope with the broadcast > delivery logic you''re adding here. I''m afraid the per-vCPU entry > logic is a prerequisite to this change (in whatever form it may end > up going in). > > JanAgree, let''s hold it. After we present new vMCE approach, we can do it at that time, and no race issue then. Thanks, Jinsong