Ke, Liping
2008-Dec-19 06:19 UTC
[Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Hi, all Patch 4 is the main patch for CMCI enabling in XEN. It adds the CMCI interrupt handler, new common CMCI/MCA init process,CMCI owner judge algorithm when bring_up CPUs, CPU on/offline and polling mechanisms, etc Thanks & Regards, Criping _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2009-Jan-12 14:03 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On Friday 19 December 2008 07:19:16 Ke, Liping wrote:> Hi, all > > Patch 4 is the main patch for CMCI enabling in XEN. It adds the CMCI > interrupt handler, new common CMCI/MCA init process,CMCI owner judge > algorithm when bring_up CPUs, CPU on/offline and polling mechanisms, etc > > Thanks & Regards, > CripingThis patch changes the public API. There''s no need to change struct mcinfo_extended. The marshalling technique allows to use this structure as often as needed. Please undo this change. struct mcinfo_global is also changed. Why can''t mc_coreid not be filled with the apicid ? Adding the apicid field breaks the alignment causing troubles on 32bit-guest-on-64bit-hypervisor. Christoph -- Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34 85609 Dornach b. München Geschäftsführer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis München Registergericht München, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ke, Liping
2009-Jan-13 02:12 UTC
RE: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Hi, Christoph Please see our comments below Christoph Egger wrote:> On Friday 19 December 2008 07:19:16 Ke, Liping wrote: >> Hi, all >> >> Patch 4 is the main patch for CMCI enabling in XEN. It adds the CMCI >> interrupt handler, new common CMCI/MCA init process,CMCI owner judge >> algorithm when bring_up CPUs, CPU on/offline and polling >> mechanisms, etc >> >> Thanks & Regards, >> Criping > > This patch changes the public API. There''s no need to change > struct mcinfo_extended. The marshalling technique allows to use > this structure as often as needed. Please undo this change.Since Intel extended msrs'' number is bigger than AMD, and we can''t use pointer and allocate memory in mca handler, so we extended it from 5 -> 10. We think it should not impact any of your usage. Else, we can change boundary 5->0, use a globally allocated pointer instead. But it seems not that necessary. How do you think about?> > struct mcinfo_global is also changed. Why can''t mc_coreid not be > filled with the apicid ? Adding the apicid field breaks the alignment > causing troubles on 32bit-guest-on-64bit-hypervisor.We plan to extend the apic_id to 32 bit since 8 bit is not enough for new apic_id. Besides, for this problem, before sending the patch, we actually talked about it. Seems original structure is not 32/64 alligned. How about below changes? struct mcinfo_global { struct mcinfo_common common; 4 byte /* running domain at the time in error (most likely the impacted one) */ uint16_t mc_domid; 2 byte uint32_t mc_socketid; /* physical socket of the physical core */ 4 byte uint16_t mc_coreid; /* physical impacted core */ 2 byte uint8_t mc_apicid; ---change it to 4 byte uint16_t mc_core_threadid; /* core thread of physical core */ 2 byte uint16_t mc_vcpuid; /* virtual cpu scheduled for mc_domid */ 2 byte uint64_t mc_gstatus; /* global status */ 8 byte uint32_t mc_flags; 4 byte }; Change to ------------------------>>>>> struct mcinfo_global { struct mcinfo_common common; 4 byte /* running domain at the time in error (most likely the impacted one) */ uint32_t mc_socketid; /* physical socket of the physical core */ 4 byte ----------------------------------------------------------------- uint16_t mc_domid; 2 byte uint16_t mc_coreid; /* physical impacted core */ 2 byte uint32_t mc_apicid; ---change it to 4 byte ------------------------------------------------------------------- uint16_t mc_core_threadid; /* core thread of physical core */ 2 byte uint16_t mc_vcpuid; /* virtual cpu scheduled for mc_domid */ 2 byte uint32_t mc_flags; 4 byte -------------------------------------------------------------------------- uint64_t mc_gstatus; /* global status */ 8 byte }; Thanks & Regards, Criping> > Christoph_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2009-Jan-13 11:21 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On Tuesday 13 January 2009 03:12:51 Ke, Liping wrote:> Hi, Christoph > Please see our comments below > > Christoph Egger wrote: > > On Friday 19 December 2008 07:19:16 Ke, Liping wrote: > >> Hi, all > >> > >> Patch 4 is the main patch for CMCI enabling in XEN. It adds the CMCI > >> interrupt handler, new common CMCI/MCA init process,CMCI owner judge > >> algorithm when bring_up CPUs, CPU on/offline and polling > >> mechanisms, etc > >> > >> Thanks & Regards, > >> Criping > > > > This patch changes the public API. There''s no need to change > > struct mcinfo_extended. The marshalling technique allows to use > > this structure as often as needed. Please undo this change. > > Since Intel extended msrs'' number is bigger than AMD, and we can''t use > pointer and allocate memory in mca handler, so we extended it from 5 -> 10. > We think it should not impact any of your usage. > > Else, we can change boundary 5->0, use a globally allocated pointer > instead. But it seems not that necessary. How do you think about?When I defined the API, I already knew that 5 is not enough for Intel. So I made struct mc_info large enough to keep multiple entities in any combination. I expected from you to fill struct mcinfo_extended two or three times into struct mc_info the same way as you do with struct mcinfo_bank. There''s no (additional) allocation needed. From your description I just read, that you don''t know this marshalling technique.> > struct mcinfo_global is also changed. Why can''t mc_coreid not be > > filled with the apicid ? Adding the apicid field breaks the alignment > > causing troubles on 32bit-guest-on-64bit-hypervisor. > > We plan to extend the apic_id to 32 bit since 8 bit is not enough for new > apic_id. Besides, for this problem, before sending the patch, we actually > talked about it. Seems original structure is not 32/64 alligned. How about > below changes? > > struct mcinfo_global { > struct mcinfo_common common; 4 byte > > /* running domain at the time in error (most likely the impacted one) > */ uint16_t mc_domid; 2 byte > uint32_t mc_socketid; /* physical socket of the physical core */ 4 byte > uint16_t mc_coreid; /* physical impacted core */ 2 byte > uint8_t mc_apicid; ---change it to 4 byte > uint16_t mc_core_threadid; /* core thread of physical core */ 2 byte > uint16_t mc_vcpuid; /* virtual cpu scheduled for mc_domid */ 2 byte > uint64_t mc_gstatus; /* global status */ 8 byte > uint32_t mc_flags; 4 byte > }; > > > Change to > ------------------------>>>>> > > struct mcinfo_global { > struct mcinfo_common common; 4 byte > > /* running domain at the time in error (most likely the impacted one) > */ uint32_t mc_socketid; /* physical socket of the physical core */ 4 byte > ----------------------------------------------------------------- > uint16_t mc_domid; 2 byte > uint16_t mc_coreid; /* physical impacted core */ 2 byte > uint32_t mc_apicid; ---change it to 4 byte > ------------------------------------------------------------------- > uint16_t mc_core_threadid; /* core thread of physical core */ 2 byte > uint16_t mc_vcpuid; /* virtual cpu scheduled for mc_domid */ 2 byte > uint32_t mc_flags; 4 byte > -------------------------------------------------------------------------- > uint64_t mc_gstatus; /* global status */ 8 byte > };Your proposed change fixes the alignment problem, but it doesn''t answer my question: Why is mc_apicid needed at all since the CPU core is already identified by mc_coreid ? Christoph -- Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34 85609 Dornach b. München Geschäftsführer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis München Registergericht München, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Frank Van Der Linden
2009-Jan-13 16:48 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Hi Christoph, This patch contains part of our work to extend the MCE interface, as part of our FMA efforts. I know we changed some structures, and I also remember some alignment issues that we had to fix. I think these patches are partly based on a slightly older version of what we made available, not our final version. I''m hoping to bring all our changes up to xen-unstable level soon (they were originally for 3.1.4), and post them. That should refresh my mind about our changes, and hopefully we can discuss things then (with Gavin too, who made a lot of these changes). We (Sun) should probably also re-synch with some of the Intel efforts. Again, hopefully we can do this soon, - Frank _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2009-Jan-13 17:11 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
On Tuesday 13 January 2009 17:48:48 Frank Van Der Linden wrote:> Hi Christoph, > > This patch contains part of our work to extend the MCE interface, as > part of our FMA efforts. I know we changed some structures, and I also > remember some alignment issues that we had to fix. > > I think these patches are partly based on a slightly older version of > what we made available, not our final version. > > I''m hoping to bring all our changes up to xen-unstable level soon (they > were originally for 3.1.4), and post them. That should refresh my mind > about our changes, and hopefully we can discuss things then (with Gavin > too, who made a lot of these changes).The patches I have seen so far, don''t touch the public API at all. But Intel did. That''s the point in this discussion.> We (Sun) should probably also re-synch with some of the Intel efforts. > > Again, hopefully we can do this soon, > > - Frank-- ---to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Frank van der Linden
2009-Jan-13 17:29 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Christoph Egger wrote:> The patches I have seen so far, don''t touch the public API at all. > But Intel did. That''s the point in this discussion.Actually, it seems we did add a few things that changed it. But looking at our changes now, we might not need them (like mc_apicid; it appears we don''t actually use it anywhere). But, I''ll have to get back to you on that after I''ve brought the changes up to xen-unstable level. - Frank _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ke, Liping
2009-Jan-14 01:35 UTC
RE: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Christoph Egger wrote:> On Tuesday 13 January 2009 03:12:51 Ke, Liping wrote: >> Hi, Christoph >> Please see our comments below >> >> Christoph Egger wrote: >>> On Friday 19 December 2008 07:19:16 Ke, Liping wrote: >>>> Hi, all >>>> >>>> Patch 4 is the main patch for CMCI enabling in XEN. It adds the >>>> CMCI interrupt handler, new common CMCI/MCA init process,CMCI >>>> owner judge algorithm when bring_up CPUs, CPU on/offline and >>>> polling mechanisms, etc >>>> >>>> Thanks & Regards, >>>> Criping >>> >>> This patch changes the public API. There''s no need to change >>> struct mcinfo_extended. The marshalling technique allows to use >>> this structure as often as needed. Please undo this change. >> >> Since Intel extended msrs'' number is bigger than AMD, and we can''t >> use pointer and allocate memory in mca handler, so we extended it >> from 5 -> 10. We think it should not impact any of your usage. >> >> Else, we can change boundary 5->0, use a globally allocated pointer >> instead. But it seems not that necessary. How do you think about? > > When I defined the API, I already knew that 5 is not enough for Intel. > So I made struct mc_info large enough to keep multiple entities in any > combination. I expected from you to fill struct mcinfo_extended two or > three times into struct mc_info the same way as you do with > struct mcinfo_bank. There''s no (additional) allocation needed. > > From your description I just read, that you don''t know this > marshalling technique.Hi, Christoph We do understand and actually are using your *marshalling* techniques, you see we push each bank infor to one entry of your *mcinfo_bank*. It''s flexible and just good. But split one integral information into several entries seems not necessary. For me it is not good programming style for me. If we have several kinds of extended Infor, surely I will use your marshalling techniques. It should not *impact* public interface(We noticed you have calculate data structure size dynamically we think). How do you think? For the reason of keeping apic_id, partly because we''re not sure whether the information would be useful to recovery action(It brings more Information than core_id), so we just want to *keep* it. If you don''t use it, just let it be blank. It should not impact your interface. Your opinion? Thanks a lot for your help! Regards, Criping> > >>> struct mcinfo_global is also changed. Why can''t mc_coreid not be >>> filled with the apicid ? Adding the apicid field breaks the >>> alignment causing troubles on 32bit-guest-on-64bit-hypervisor. >> >> We plan to extend the apic_id to 32 bit since 8 bit is not enough >> for new apic_id. Besides, for this problem, before sending the >> patch, we actually talked about it. Seems original structure is not >> 32/64 alligned. How about below changes? >> >> struct mcinfo_global { >> struct mcinfo_common common; 4 byte >> >> /* running domain at the time in error (most likely the impacted >> one) */ uint16_t mc_domid; 2 byte uint32_t mc_socketid; /* >> physical socket of the physical core */ 4 byte uint16_t >> mc_coreid; /* physical impacted core */ 2 byte uint8_t >> mc_apicid; ---change it to 4 byte uint16_t mc_core_threadid; /* >> core thread of physical core */ 2 byte uint16_t mc_vcpuid; /* >> virtual cpu scheduled for mc_domid */ 2 byte uint64_t >> mc_gstatus; /* global status */ 8 byte uint32_t mc_flags; 4 byte >> }; >> >> >> Change to >> ------------------------>>>>> >> >> struct mcinfo_global { >> struct mcinfo_common common; 4 byte >> >> /* running domain at the time in error (most likely the impacted >> one) */ uint32_t mc_socketid; /* physical socket of the physical >> core */ 4 byte >> >> >> ----------------------------------------------------------------- >> uint16_t mc_domid; 2 byte uint16_t mc_coreid; /* physical >> impacted core */ 2 byte uint32_t mc_apicid; ---change it to 4 >> byte >> >> ------------------------------------------------------------------- >> uint16_t mc_core_threadid; /* core thread of physical core */ 2 byte >> uint16_t mc_vcpuid; /* virtual cpu scheduled for mc_domid */ 2 byte >> uint32_t mc_flags; 4 byte >> -------------------------------------------------------------------------- >> uint64_t mc_gstatus; /* global status */ 8 byte }; > > Your proposed change fixes the alignment problem, but it doesn''t > answer my question: Why is mc_apicid needed at all since the CPU core > is already identified by mc_coreid ? > > Christoph_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2009-Jan-14 02:11 UTC
RE: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:> Christoph Egger wrote: > >> The patches I have seen so far, don''t touch the public API at all. >> But Intel did. That''s the point in this discussion. > > Actually, it seems we did add a few things that changed it. > But looking > at our changes now, we might not need them (like mc_apicid; it appears > we don''t actually use it anywhere).For the mc_apicid, I think it gives informatoin for the whole topology information of the system, and will be helpful if needed in future. That''s the reason we adopt the changes in our patch. Of course, we can get them through translate the mc_coreid to the apic_id, requiring dom0 cache the translation or have a hypercall for it.> > But, I''ll have to get back to you on that after I''ve brought > the changes > up to xen-unstable level.Yes, I suggest we hold the discussion till your changes are submited to upstream.> > - Frank_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2009-Jan-14 02:58 UTC
RE: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
xen-devel-bounces@lists.xensource.com <> wrote:> On Tuesday 13 January 2009 03:12:51 Ke, Liping wrote: > > When I defined the API, I already knew that 5 is not enough for Intel. > So I made struct mc_info large enough to keep multiple entities in any > combination. I expected from you to fill struct mcinfo_extended two or > three times into struct mc_info the same way as you do with > struct mcinfo_bank. There''s no (additional) allocation needed.Just as Liping has pointed out, since there is size caculated in mcinfo_common, it should be ok for this changes, especially considering just increase the size, not decrease. Anyway, let''s wait for Frank and Gavin''s changes. BTW, do you know any usage for this MCA mechanism except Sun? We are considering more changes to Xen''s MCA and some suggestion is send to mailing list (http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00643.html gives three options ), we want get feedback from them and you, to avoid suprise in last minutes again. Thanks Yunhong Jiang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Frank van der Linden
2009-Jan-14 04:13 UTC
Re: [Xen-devel] [patch 4/4]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
Jiang, Yunhong wrote:> Just as Liping has pointed out, since there is size caculated in mcinfo_common, it should be ok for this changes, especially considering just increase the size, not decrease. > Anyway, let''s wait for Frank and Gavin''s changes.Thanks! I''ll start re-synching our changes with xen-unstable soon (later this week). Hopefully I''ll be done with that somewhere next week. Then I''ll send our changes to both you and Christoph for review. Then we can discuss them, and present them for integration, so that there''s a common basis for future work from all sides. - Frank _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel