Elliott Mitchell
2017-May-16 03:47 UTC
[Pkg-xen-devel] Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote:> >>> On 14.05.17 at 00:36, <ehem+debian at m5p.com> wrote: > > I haven't yet done as much experimentation as Andreas Pflug has, but I > > can confirm I'm also running into this bug with Xen 4.4.1. > > > > I've only tried Linux kernel 3.16.43, but as Dom0: > > > > EDAC MC: Ver: 3.0.0 > > AMD64 EDAC driver v3.4.0 > > EDAC amd64: DRAM ECC enabled. > > EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. > > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > > load. > > AMD64 EDAC driver v3.4.0 > > EDAC amd64: DRAM ECC enabled. > > EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. > > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > > load. > > Afaict the driver as is simply can't work in a Xen Dom0; it needs > enabling (read: para-virtualizing). I'm actually glad to see it doesn't > load (the worse alternative would be for it to load and then do the > wrong thing or give you a false sense of safety of your data).I'm unsure of how to evaluate the situation. Since ECC is enabled in the BIOS, data should be safe whether or not the EDAC driver loads. I /suspect/ the EDAC driver failing to load merely means reportting of ECC errors won't happen. I suspect the only paravirtualization needed is to map the physical address of the soft|hard errors to which VM's memory range was effected. What this effects is which VM should panic in case of hard errors. Depending upon the environment there may or may not be cause to report soft errors anywhere beside Dom0. In most cases a soft error will at worst trigger a desire to replace the memory module, but not trigger a panic for the affected VM. It is only once a hard error occurs that it is urgent to warn the effected VM and cause a panic; in this case it may also be desireable to first alert Dom0 anyway. As such I'm inclined to think force-enabling ECC EDAC monitoring in Dom0 is the best approach for now. As long as a hard error doesn't occur in Dom0's address range, Dom0 is in the best position to deal with the situation. The worst case is a hard error occuring in Xen's address range, since that will mean all VMs on the machine are likely to be toast. I think this should be a fairly high priority for Xen since ECC memory is a feature very common on systems running with a hypervisor. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | EHeM+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Jan Beulich
2017-May-16 09:54 UTC
[Pkg-xen-devel] Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
>>> On 16.05.17 at 05:47, <ehem+debian at m5p.com> wrote: > On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote: >> >>> On 14.05.17 at 00:36, <ehem+debian at m5p.com> wrote: >> > I haven't yet done as much experimentation as Andreas Pflug has, but I >> > can confirm I'm also running into this bug with Xen 4.4.1. >> > >> > I've only tried Linux kernel 3.16.43, but as Dom0: >> > >> > EDAC MC: Ver: 3.0.0 >> > AMD64 EDAC driver v3.4.0 >> > EDAC amd64: DRAM ECC enabled. >> > EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. >> > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. >> > AMD64 EDAC driver v3.4.0 >> > EDAC amd64: DRAM ECC enabled. >> > EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. >> > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not >> > load. >> >> Afaict the driver as is simply can't work in a Xen Dom0; it needs >> enabling (read: para-virtualizing). I'm actually glad to see it doesn't >> load (the worse alternative would be for it to load and then do the >> wrong thing or give you a false sense of safety of your data). > > I'm unsure of how to evaluate the situation. Since ECC is enabled in the > BIOS, data should be safe whether or not the EDAC driver loads. I > /suspect/ the EDAC driver failing to load merely means reportting of ECC > errors won't happen."Merely" being relative here: The missing reports mean a false feeling of safety, as they may be early indications of later double-bit errors.> I suspect the only paravirtualization needed is to > map the physical address of the soft|hard errors to which VM's memory > range was effected. What this effects is which VM should panic in case > of hard errors.Which in turn obviously requires hypervisor interaction. It's not really clear to me whether perhaps the driver would better live in the hypervisor in the first place for that reason. And there's a second piece of paravirtualization needed: The driver doesn't distinguish physical and machine address spaces, yet the addresses reported by hardware are machine ones and hence would generally need translation to physical ones in order to assign Dom0- local meaning to them (or to determine that the address belongs to another VM or the hypervisor). Jan
Andrew Cooper
2017-May-16 10:08 UTC
[Pkg-xen-devel] Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
On 16/05/17 10:54, Jan Beulich wrote:>>>> On 16.05.17 at 05:47, <ehem+debian at m5p.com> wrote: >> On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote: >>>>>> On 14.05.17 at 00:36, <ehem+debian at m5p.com> wrote: >>>> I haven't yet done as much experimentation as Andreas Pflug has, but I >>>> can confirm I'm also running into this bug with Xen 4.4.1. >>>> >>>> I've only tried Linux kernel 3.16.43, but as Dom0: >>>> >>>> EDAC MC: Ver: 3.0.0 >>>> AMD64 EDAC driver v3.4.0 >>>> EDAC amd64: DRAM ECC enabled. >>>> EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. >>>> EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. >>>> AMD64 EDAC driver v3.4.0 >>>> EDAC amd64: DRAM ECC enabled. >>>> EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. >>>> EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not >>>> load. >>> Afaict the driver as is simply can't work in a Xen Dom0; it needs >>> enabling (read: para-virtualizing). I'm actually glad to see it doesn't >>> load (the worse alternative would be for it to load and then do the >>> wrong thing or give you a false sense of safety of your data). >> I'm unsure of how to evaluate the situation. Since ECC is enabled in the >> BIOS, data should be safe whether or not the EDAC driver loads. I >> /suspect/ the EDAC driver failing to load merely means reportting of ECC >> errors won't happen. > "Merely" being relative here: The missing reports mean a false feeling > of safety, as they may be early indications of later double-bit errors. > >> I suspect the only paravirtualization needed is to >> map the physical address of the soft|hard errors to which VM's memory >> range was effected. What this effects is which VM should panic in case >> of hard errors. > Which in turn obviously requires hypervisor interaction. It's not really > clear to me whether perhaps the driver would better live in the > hypervisor in the first place for that reason.The driver should probably live directly in Xen; it needs to program a number of nothbridge and CPU registers including interrupt information. For the reporting side of things, it looks like it would require vMCE to pass on fault information to guests. ~Andrew
Elliott Mitchell
2017-May-16 18:02 UTC
[Pkg-xen-devel] Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
On Tue, May 16, 2017 at 03:54:37AM -0600, Jan Beulich wrote:> >>> On 16.05.17 at 05:47, <ehem+debian at m5p.com> wrote: > > I suspect the only paravirtualization needed is to > > map the physical address of the soft|hard errors to which VM's memory > > range was effected. What this effects is which VM should panic in case > > of hard errors. > > Which in turn obviously requires hypervisor interaction. It's not really > clear to me whether perhaps the driver would better live in the > hypervisor in the first place for that reason. > > And there's a second piece of paravirtualization needed: The driver > doesn't distinguish physical and machine address spaces, yet the > addresses reported by hardware are machine ones and hence would > generally need translation to physical ones in order to assign Dom0- > local meaning to them (or to determine that the address belongs to > another VM or the hypervisor).Merely reporting the machine address to Dom0 is already high value since it lets you attribute the failure to a memory module. Without that you may have a VM or whole machine randomly crash for a completely unknown reason. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | EHeM+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445