Ke, Liping
2008-Dec-22 03:54 UTC
[Xen-devel] [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Hi, All According to Keir''s suggestion, I re-organize those patches and make sure that each intermediate patch could be compiled and running. Since remove old p4 and p6 files are hard to be split, I now put the remove and replace in the one patch. Is it ok? Any comment, just let me know. Thanks a lot for your help! Criping Following 3 patches are for enabling CMCI of Intel CPUs in XEN. --------------------------------------------------------------------------- Patch 1: change_stop_machine_run.patch changes stopmachine_run interface so that we can designate the callback function running on the cpu_map instead of single one. We do this change because *cmci owner change* callback (please refer to note 3 below) needs to be executed on each of online cpus when do CPU hotplug. And also, we add a callback function for CMCI use before taking CPU down. Patch 2: apic_cmci.patch adds the new CMCI LVT entry. And also it did small clean up jobs for thermal since thermal is not only P4 specific, but also later Intel CPUs common features. The old thermal removal will also be finished in patch 3 combined with old P4/p6 removals. Patch 3: clean_and_cmci.patch contains the main CMCI support logic including removing old duplicated P4/P6 files, add new MCA/CMCI common init/interrupt processing, *CMCI owner judge algorithms* when (bring up CPUs or do CPU hotplug), polling mechanism, etc. ---------------------------------------------------------------------------- About Test We wrote CMCI injection tool to test the patch. Also we wrote DOM0 test patches to see whether logs are accepted by DOM0. We test the patches on both CMCI-support and NO-CMCI-support machine. We test the patches combined with CPU online/offline ops and S3&S5 ops which cause banks owner changing ---------------------------------------------------------------------------- Below notes might be helpful 1) CMCI use another apic_lvt entry (now max_lvt could be 6 in newer Intel platform) 2) CMCI is combined with polling mechanism since some CPUs don''t support CMCI. And for supporting CPUs, still some banks don''t support CMCI. So we keep polling mechanism. Also add simple policy, when find errors, shorten polling interval. Otherwise, lengthen intervals. 3) MCA banks shares between cores/threads. For avoiding mis-ops, we need to judge *owner* for each bank. Each bank only has one owner, the owner manages the bank. The owner will be set up when cpu_up. Owners might be changed when doing cpu-hotplug. 4) When one CMCI interrupt is accepted and logged by XEN, XEN will notify DOM0. We don''t plan to check in DOM0 code now. For more info, please refer to latest Intel software development manual, Chapter 14. -------------------------------------------------------------------------- Machine check (Uncorrectable Error) support will be sent in later patches, so we keep old machine check handling code. Any comment, just let us know. Thanks a lot for your help! Regards, Criping _______________________________________________ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-22 12:06 UTC
[Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 22/12/2008 03:54, "Ke, Liping" <liping.ke@intel.com> wrote:> Hi, All > > According to Keir''s suggestion, I re-organize those patches and make sure that > each intermediate patch could be compiled and running. > > Since remove old p4 and p6 files are hard to be split, I now put the remove > and replace in the one patch. Is it ok? > > Any comment, just let me know.I checked in the patches as c/s 18938 and then cleaned up significantly as c/s 18939. Let me know if this stops things working for you. Notice I deleted more than I added. ;-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ke, Liping
2008-Dec-23 05:17 UTC
[Xen-devel] RE: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Hi, Keir Thanks a lot for the clean-up and format refine! I will pay more attention to this too.And also I tested on the related platform. It's fine. As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us. Just one thing: CMCI might happen and lost during the very small window (old owner is cleared while new owner is not set). In order to make sure that CMCI could be triggered an on the new owner, we need to clear MSR Bank(i) status register [Corrected Error Counter] field ( We normally do this @ CMCI interrupt handler, according to spec, if the counter is not cleared, CMCI will not be triggered any more). I made a small patch for it in the attachment. How do you think? Thanks a lot! Criping -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: 2008年12月22日 20:07 To: Ke, Liping Cc: xen-devel@lists.xensource.com Subject: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs On 22/12/2008 03:54, "Ke, Liping" <liping.ke@intel.com> wrote:> Hi, All > > According to Keir's suggestion, I re-organize those patches and make sure that > each intermediate patch could be compiled and running. > > Since remove old p4 and p6 files are hard to be split, I now put the remove > and replace in the one patch. Is it ok? > > Any comment, just let me know.I checked in the patches as c/s 18938 and then cleaned up significantly as c/s 18939. Let me know if this stops things working for you. Notice I deleted more than I added. ;-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-23 08:40 UTC
[Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 23/12/2008 05:17, "Ke, Liping" <liping.ke@intel.com> wrote:> Thanks a lot for the clean-up and format refine! I will pay more attention to > this too.And also I tested on the related platform. It''s fine. > > As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us. > Just one thing: > CMCI might happen and lost during the very small window (old owner is cleared > while new owner is not set). In order to make sure that CMCI could be > triggered an on the new owner, we need to clear MSR Bank(i) status register > [Corrected Error Counter] field ( We normally do this @ CMCI interrupt > handler, according to spec, if the counter is not cleared, CMCI will not be > triggered any more). > I made a small patch for it in the attachment. How do you think?I don''t know very much about CMCI. If you think this is required I will certainly check it in. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-23 09:00 UTC
[Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 23/12/2008 08:40, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us. >> Just one thing: >> CMCI might happen and lost during the very small window (old owner is cleared >> while new owner is not set). In order to make sure that CMCI could be >> triggered an on the new owner, we need to clear MSR Bank(i) status register >> [Corrected Error Counter] field ( We normally do this @ CMCI interrupt >> handler, according to spec, if the counter is not cleared, CMCI will not be >> triggered any more). >> I made a small patch for it in the attachment. How do you think? > > I don''t know very much about CMCI. If you think this is required I will > certainly check it in.Actually I think this is a good idea, even if we''d stayed with your original CMCI patches. I will apply it. One thing -- if you want to reduce the window between release of a band by its old owner and acquisition by a new owner, we could do the whole lot before stop_machine_run()? Maybe cmci_cpu_down(cpu) which would IPI ''cpu'' to clear its CMCI state and then IPI all other CPUs to pick up the released banks. This would be neatly hooked off CPU_DOWN_PREPARE or similar in Linux, but Xen doesn''t have cpu notifiers. :-) You''d have to call cmci_cpu_down() explicitly in cpu_down(). Or perhaps we should have cpu notifier chains in Xen too... If we do the above I don''t think we need to re-introduce your rollback logic. If you think about it, there''s no reason to prefer the old owner over the new owner, so no reason to roll back. I believe? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ke, Liping
2008-Dec-23 12:32 UTC
RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
> > One thing -- if you want to reduce the window between release of a > band by its old owner and acquisition by a new owner, we could do the > whole lot before stop_machine_run()? Maybe cmci_cpu_down(cpu) which > would IPI ''cpu'' to clear its CMCI state and then IPI all other CPUs > to pick up the released banks. This would be neatly hooked off > CPU_DOWN_PREPARE or similar in Linux, but Xen doesn''t have cpu > notifiers. :-) You''d have to call cmci_cpu_down() explicitly in > cpu_down(). Or perhaps we should have cpu notifier chains in Xen > too...Hi, Keir When we wrote the patch, yunhong also mentioned similar thoughts, I will have some discussion with him tomorrow. Thanks a lot! Criping> > If we do the above I don''t think we need to re-introduce your rollback > logic. If you think about it, there''s no reason to prefer the old > owner over the new owner, so no reason to roll back. I believe? > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2008-Dec-24 02:57 UTC
RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com >[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser >Sent: 2008年12月23日 17:00 >To: Ke, Liping >Cc: xen-devel@lists.xensource.com >Subject: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected >Machine Check Error Interrupt) for Intel CPUs > >On 23/12/2008 08:40, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > >>> As for moving *cmci_owner_set* out of stopmachine_run is >basically ok for us. >>> Just one thing: >>> CMCI might happen and lost during the very small window >(old owner is cleared >>> while new owner is not set). In order to make sure that >CMCI could be >>> triggered an on the new owner, we need to clear MSR Bank(i) >status register >>> [Corrected Error Counter] field ( We normally do this @ >CMCI interrupt >>> handler, according to spec, if the counter is not cleared, >CMCI will not be >>> triggered any more). >>> I made a small patch for it in the attachment. How do you think? >> >> I don't know very much about CMCI. If you think this is >required I will >> certainly check it in. > >Actually I think this is a good idea, even if we'd stayed with >your original >CMCI patches. I will apply it. > >One thing -- if you want to reduce the window between release >of a band by >its old owner and acquisition by a new owner, we could do the whole lot >before stop_machine_run()? Maybe cmci_cpu_down(cpu) which >would IPI 'cpu' to >clear its CMCI state and then IPI all other CPUs to pick up >the released >banks. This would be neatly hooked off CPU_DOWN_PREPARE or >similar in Linux, >but Xen doesn't have cpu notifiers. :-) You'd have to call >cmci_cpu_down() >explicitly in cpu_down(). Or perhaps we should have cpu >notifier chains in >Xen too...Yes, we discussed this when working on the patch. Two target for CMCI when CPU offline: a) We'd better not lost CMCI interrupt; b)if we do lost CMCI interrupt, we should not block further CMCI interrupt. Since CMCI is correctaed error, so target a) is not so strong. When we place the __cpu_clear_cmci in the stop_machine_run() logic, because the interrupt is disabled when the __cpu_clear_cmci() called, it is sure no CMCI interrupt is lost and no blocking will happen. In current Xen implementation, cpu_mcheck_distribute_cmci() is called after stop_machine_run(), so although there may be CMCI interrupt lost, but with Criping's patch, the CMCI will not be blocked anymore, and the solution is very clear. As for your proposal of do it before the stop_machine_run(), it may reduce the window, but still can't eliminate the window of lost CMCI interrupt, unless we do similar thing in the cmci_cpu_down() (i.e. all CPU is irq_disabled before update the CMCI status). It is the same if we pull the notifier chain to Xen. How is your idea?> >If we do the above I don't think we need to re-introduce your rollback >logic. If you think about it, there's no reason to prefer the >old owner over >the new owner, so no reason to roll back. I believe?Yes, currently we don't need rollback anymore. But if we do it before stop_machine_run(), we may still need the rollback for bank owned by the down cpu exclusively (not all back is shared between CPUs). Thanks Yunhong Jiang> > -- Keir > > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-24 08:01 UTC
Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 24/12/2008 02:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> As for your proposal of do it before the stop_machine_run(), it may reduce the > window, but still can''t eliminate the window of lost CMCI interrupt, unless we > do similar thing in the cmci_cpu_down() (i.e. all CPU is irq_disabled before > update the CMCI status). It is the same if we pull the notifier chain to Xen. > How is your idea?You only need hardirqs disabled, so rendezvous within an on_each_cpu() callback is sufficient for you. You don''t need to get intimate with stop_machine_run()! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2008-Dec-24 08:46 UTC
RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2008年12月24日 16:02 >To: Jiang, Yunhong; Ke, Liping >Cc: xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected >Machine Check Error Interrupt) for Intel CPUs > >On 24/12/2008 02:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >> As for your proposal of do it before the stop_machine_run(), >it may reduce the >> window, but still can't eliminate the window of lost CMCI >interrupt, unless we >> do similar thing in the cmci_cpu_down() (i.e. all CPU is >irq_disabled before >> update the CMCI status). It is the same if we pull the >notifier chain to Xen. >> How is your idea? > >You only need hardirqs disabled, so rendezvous within an on_each_cpu() >callback is sufficient for you. You don't need to get intimate with >stop_machine_run()!Yes, that's what I mean of "do similar thing in the cmci_cpu_down()". We placed it in stop_machine_run() because it has rendezvous all CPU already :$> > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-24 08:54 UTC
Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 24/12/2008 08:46, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:>> You only need hardirqs disabled, so rendezvous within an on_each_cpu() >> callback is sufficient for you. You don''t need to get intimate with >> stop_machine_run()! > > Yes, that''s what I mean of "do similar thing in the cmci_cpu_down()". We > placed it in stop_machine_run() because it has rendezvous all CPU already :$Yeah, on_each_cpu() rendezvous is pretty cheap. May as well have cleaner partitioned code than avoid an extra rendezvous. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2008-Dec-24 14:51 UTC
RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com >[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser >Sent: 2008年12月24日 16:54 >To: Jiang, Yunhong; Ke, Liping >Cc: xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected >Machine Check Error Interrupt) for Intel CPUs > >On 24/12/2008 08:46, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >>> You only need hardirqs disabled, so rendezvous within an >on_each_cpu() >>> callback is sufficient for you. You don't need to get intimate with >>> stop_machine_run()! >> >> Yes, that's what I mean of "do similar thing in the >cmci_cpu_down()". We >> placed it in stop_machine_run() because it has rendezvous >all CPU already :$ > >Yeah, on_each_cpu() rendezvous is pretty cheap. May as well >have cleaner >partitioned code than avoid an extra rendezvous.I'd have a bit clarification here. Simply harirqs disabled in the on_each_cpu() is not enough. What we need is, firstly make sure every CPU has irq disabled, then every CPU begin update the CMCI status. If this is what you mean, we will take a patch for it later. Thanks Yunhong Jiang> > -- Keir > > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-24 14:55 UTC
Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 24/12/2008 14:51, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:>>> Yes, that''s what I mean of "do similar thing in the >> cmci_cpu_down()". We >>> placed it in stop_machine_run() because it has rendezvous >> all CPU already :$ >> >> Yeah, on_each_cpu() rendezvous is pretty cheap. May as well >> have cleaner >> partitioned code than avoid an extra rendezvous. > > I''d have a bit clarification here. Simply harirqs disabled in the > on_each_cpu() is not enough. What we need is, firstly make sure every CPU has > irq disabled, then every CPU begin update the CMCI status. If this is what you > mean, we will take a patch for it later.Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI function. I couldn''t mean much else -- for example, it''s not even valid for the caller of on_each_cpu() to enter it with IRQs disabled. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2008-Dec-24 15:08 UTC
RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2008年12月24日 22:55 >To: Jiang, Yunhong; Ke, Liping >Cc: xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected >Machine Check Error Interrupt) for Intel CPUs > >On 24/12/2008 14:51, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >>>> Yes, that's what I mean of "do similar thing in the >>> cmci_cpu_down()". We >>>> placed it in stop_machine_run() because it has rendezvous >>> all CPU already :$ >>> >>> Yeah, on_each_cpu() rendezvous is pretty cheap. May as well >>> have cleaner >>> partitioned code than avoid an extra rendezvous. >> >> I'd have a bit clarification here. Simply harirqs disabled in the >> on_each_cpu() is not enough. What we need is, firstly make >sure every CPU has >> irq disabled, then every CPU begin update the CMCI status. >If this is what you >> mean, we will take a patch for it later. > >Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI >function. I couldn't mean much else -- for example, it's not >even valid for >the caller of on_each_cpu() to enter it with IRQs disabled.Currently on_each_cpu() has no gurantee that the fn() will be called with all CPU has entered the IPI. For example, maybe on CPU has been working on the fn() while the IPI is still pending on other CPU.> > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-24 15:38 UTC
Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On 24/12/2008 15:08, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:>> Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI >> function. I couldn''t mean much else -- for example, it''s not >> even valid for >> the caller of on_each_cpu() to enter it with IRQs disabled. > > Currently on_each_cpu() has no gurantee that the fn() will be called with all > CPU has entered the IPI. For example, maybe on CPU has been working on the > fn() while the IPI is still pending on other CPU.In which case you can explicitly rendezvous in the handler, as we do for time_calibration_rendezvous(). Notice there that we snapshot cpu_online_map and use that as cpumask argument to on_selected_cpus() and to count CPUs into a barrier. If you absolutely need all CPUs to have IRQs disabled before you start your work on any CPU, that''s the way to do it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel