Christoph Egger
2007-Aug-21 13:31 UTC
[Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
This is patch 3/3. Signed-off-by: Christoph Egger <Christoph.Egger@amd.com> -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Aug-21 16:02 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
>+ if (mc_global->mc_flags & MC_FLAG_UNCORRECTABLE) >+ printk(KERN_EMERG); >+ else >+ printk(KERN_INFO);KERN_INFO seems gross understatement here - generally, correctable MCs are considered indicators that within not too distant future uncorrectable MCs might result, so this generally is a call for action (and hence shouldn''t be hidden with default log level settings). Also, I''m not sure adjusting the polling frequency makes much sense - 30s seems an awful lot of time to me. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2007-Aug-22 09:00 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On Tuesday 21 August 2007 18:02:54 Jan Beulich wrote:> >+ if (mc_global->mc_flags & MC_FLAG_UNCORRECTABLE) > >+ printk(KERN_EMERG); > >+ else > >+ printk(KERN_INFO); > > KERN_INFO seems gross understatement here - generally, correctable MCs are > considered indicators that within not too distant future uncorrectable MCs > might result, so this generally is a call for action (and hence shouldn''t > be hidden with default log level settings).Well, that is what the "old" code did. It used KERN_EMERG for fatal errors and KERN_INFO in the polling service routine. What do you want me to suggest?> Also, I''m not sure adjusting the polling frequency makes much sense - 30s > seems an awful lot of time to me.It''s not clear to me what you are trying to tell me. Please explain/elaborate. Christoph -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-Aug-22 10:09 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
>>> "Christoph Egger" <Christoph.Egger@amd.com> 22.08.07 11:00 >>> >On Tuesday 21 August 2007 18:02:54 Jan Beulich wrote: >> >+ if (mc_global->mc_flags & MC_FLAG_UNCORRECTABLE) >> >+ printk(KERN_EMERG); >> >+ else >> >+ printk(KERN_INFO); >> >> KERN_INFO seems gross understatement here - generally, correctable MCs are >> considered indicators that within not too distant future uncorrectable MCs >> might result, so this generally is a call for action (and hence shouldn''t >> be hidden with default log level settings). > >Well, that is what the "old" code did. It used KERN_EMERG for fatal errors >and KERN_INFO in the polling service routine. What do you want me to suggest?This should be at least KERN_WARNING, probably even KERN_ERR (note though that KERN_ERR and KERN_EMERG both resolve to XENLOG_ERR).>> Also, I''m not sure adjusting the polling frequency makes much sense - 30s >> seems an awful lot of time to me. > >It''s not clear to me what you are trying to tell me. Please explain/elaborate.What I''m trying to say is that I''d think this should be polled at a much higher frequency (I''d suggest 1Hz), without adjustments. Typically, a healthy system will not encounter problems soon after boot, but after running for perhaps a very long time (and a system in bad condition is likely to encounter problems right away, so wouldn''t be affected by changing the polling rate). Thus, in the general case, you''d have a comparably long latency, during which some kind of (automated) action could already be taken to preserve data consistency. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2007-Aug-22 15:56 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On Wednesday 22 August 2007 12:09:41 Jan Beulich wrote:> >>> "Christoph Egger" <Christoph.Egger@amd.com> 22.08.07 11:00 >>> > > > >On Tuesday 21 August 2007 18:02:54 Jan Beulich wrote: > >> >+ if (mc_global->mc_flags & MC_FLAG_UNCORRECTABLE) > >> >+ printk(KERN_EMERG); > >> >+ else > >> >+ printk(KERN_INFO); > >> > >> KERN_INFO seems gross understatement here - generally, correctable MCs > >> are considered indicators that within not too distant future > >> uncorrectable MCs might result, so this generally is a call for action > >> (and hence shouldn''t be hidden with default log level settings). > > > >Well, that is what the "old" code did. It used KERN_EMERG for fatal errors > >and KERN_INFO in the polling service routine. What do you want me to > > suggest? > > This should be at least KERN_WARNING, probably even KERN_ERR (note > though that KERN_ERR and KERN_EMERG both resolve to XENLOG_ERR).I changed to KERN_WARNING. This made the above if block superflous. Tnx. I will re-submit this patch as well.> >> Also, I''m not sure adjusting the polling frequency makes much sense - > >> 30s seems an awful lot of time to me. > > > >It''s not clear to me what you are trying to tell me. Please > > explain/elaborate. > > What I''m trying to say is that I''d think this should be polled at a much > higher frequency (I''d suggest 1Hz), without adjustments. Typically, a > healthy system will not encounter problems soon after boot, but after > running for perhaps a very long time (and a system in bad condition is > likely to encounter problems right away, so wouldn''t be affected by > changing the polling rate). Thus, in the general case, you''d have a > comparably long latency, during which some kind of (automated) action could > already be taken to preserve data consistency.The polling routine that is in the -unstable tree (the version taken from Linux) runs every 15 seconds without adjustments. 1Hz causes too much system load for a healthy system IMO. That''s why I introduced the adjustments with use of hw threshold registers to come to a compromise solution. -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Aug-22 16:05 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On 22/8/07 16:56, "Christoph Egger" <Christoph.Egger@amd.com> wrote:>> What I''m trying to say is that I''d think this should be polled at a much >> higher frequency (I''d suggest 1Hz), without adjustments. Typically, a >> healthy system will not encounter problems soon after boot, but after >> running for perhaps a very long time (and a system in bad condition is >> likely to encounter problems right away, so wouldn''t be affected by >> changing the polling rate). Thus, in the general case, you''d have a >> comparably long latency, during which some kind of (automated) action could >> already be taken to preserve data consistency. > > The polling routine that is in the -unstable tree (the version taken from > Linux) runs every 15 seconds without adjustments. > 1Hz causes too much system load for a healthy system IMO. > That''s why I introduced the adjustments with use of hw threshold registers > to come to a compromise solution.What''s the deal here? Do correctable errors not cause an MCE, yet are still detected via the machine-check architecture (albeit by a polling method)? Are there going to be patches on the Linux side to pick up this MCA info? What is Linux going to do with it, apart from log it (which Xen can already do itself)? Or is this all Solaris-specific? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Aug-22 16:10 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On 22/8/07 17:05, "Keir Fraser" <keir@xensource.com> wrote:>> The polling routine that is in the -unstable tree (the version taken from >> Linux) runs every 15 seconds without adjustments. >> 1Hz causes too much system load for a healthy system IMO. >> That''s why I introduced the adjustments with use of hw threshold registers >> to come to a compromise solution. > > What''s the deal here? Do correctable errors not cause an MCE, yet are still > detected via the machine-check architecture (albeit by a polling method)? > > Are there going to be patches on the Linux side to pick up this MCA info? > What is Linux going to do with it, apart from log it (which Xen can already > do itself)? Or is this all Solaris-specific?Oh, and is AMD-specific code really needed in non-fatal.c? I though the MCA stuff was architectural now rather than vendor specific? If there are vendor-specific extensions then they belong in the vendor''s .c file. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2007-Aug-23 06:57 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On Wednesday 22 August 2007 18:10:24 Keir Fraser wrote:> On 22/8/07 17:05, "Keir Fraser" <keir@xensource.com> wrote: > >> The polling routine that is in the -unstable tree (the version taken > >> from Linux) runs every 15 seconds without adjustments. > >> 1Hz causes too much system load for a healthy system IMO. > >> That''s why I introduced the adjustments with use of hw threshold > >> registers to come to a compromise solution. > > > > What''s the deal here? Do correctable errors not cause an MCE, yet are > > still detected via the machine-check architecture (albeit by a polling > > method)?The deal here is, detect correctable errors via polling und uncorrectable errors via MCE. This patchset is about correctable errors.> > Are there going to be patches on the Linux side to pick up this MCA info? > > What is Linux going to do with it, apart from log it (which Xen can > > already do itself)? Or is this all Solaris-specific?The general idea is the Dom0 picks up this MCA info and a) uses the error-handling infrastructure provided for the non-virtualized form and b) will use hypercalls to tell xen to also report MCA to a DomU and/or kill a DomU. Some hw features for self-healing can only use Dom0 (because registers sit in the PCI extended config space, Xen doesn''t have access to) and some can use Xen itself. I wrote a demo driver that mainly tests that the Dom0 actually receives the MCA info for NetBSD/Xen (Sun prefers to look into BSD licensed code). It should be easy to port it to Linux.> Oh, and is AMD-specific code really needed in non-fatal.c? I though the MCA > stuff was architectural now rather than vendor specific? If there are > vendor-specific extensions then they belong in the vendor''s .c file.AMD-specific is the use of the hw register code. Intel has some additional machine check MSR''s containing the register set. Intel may add a structure to patch 2/3 that make use of them. Should I move the amd polling handler to amd.c ? -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Egger
2007-Aug-23 09:27 UTC
[Xen-devel] [PATCH] resend 3/3: MCA/MCE correctable error handling
Yesterday I said, I will re-send this patch. Here is it. It incorporates feedback from Jan Beulich. Signed-off-by: Christoph Egger <Christoph.Egger@amd.com> -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Aug-23 14:07 UTC
Re: [Xen-devel] [PATCH] 3/3: MCA/MCE correctable error handling
On 23/8/07 07:57, "Christoph Egger" <Christoph.Egger@amd.com> wrote:>> Oh, and is AMD-specific code really needed in non-fatal.c? I though the MCA >> stuff was architectural now rather than vendor specific? If there are >> vendor-specific extensions then they belong in the vendor''s .c file. > > AMD-specific is the use of the hw register code. Intel has some additional > machine check MSR''s containing the register set. Intel may add a structure > to patch 2/3 that make use of them. Should I move the amd polling handler to > amd.c ?I think so. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel