Andreas Pflug
2016-Jan-14 09:39 UTC
[Pkg-xen-devel] Bug#810964: only partial EDAC information with Xen
Package: xen-hypervisor-4.4-amd64 Version: 4.4.1-9+deb8u3 Debian 8.2 installed on a supermicro H8SGL Board, AMD 6128 with 4x4GB ECC RAM. When booting the plain kernel (stock Jessie 3.16 or backport 4.1 or 4.3), both memory controllers (mc0 and mc1) appear under /sys/devices/system/edac/mc with two csrow* each as expected. Same happens, when booted with Xen 4.1.4-3+deb7u1. When booted with Xen 4.4.1, only mc1 with two RAM modules is visible, although all 16GB RAM is available in the OS (xl info).
Ian Campbell
2016-Jan-20 11:33 UTC
[Pkg-xen-devel] Bug#810964: Bug#810964: only partial EDAC information with Xen
On Thu, 2016-01-14 at 10:39 +0100, Andreas Pflug wrote:> Package: xen-hypervisor-4.4-amd64 > Version: 4.4.1-9+deb8u3 > > Debian 8.2 installed on a supermicro H8SGL Board, AMD 6128 with 4x4GB > ECC RAM. > > When booting the plain kernel (stock Jessie 3.16 or backport 4.1 or > 4.3), both memory controllers (mc0 and mc1) appear under > /sys/devices/system/edac/mc with two csrow* each as expected. Same > happens, when booted with Xen 4.1.4-3+deb7u1.Thanks for the report. I think this is as likely to be a dom0/domU kernel issue as a hypervisor one, but I'm not sure. Would you mind reporting this to upstream per: ????http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen please. We could forward it but I expect there will need to be some back and forth with the maintainers so it makes sense for you to speak to them directly. You can CC 810964 at bugs.debian.org to keep this bug in the loop. In addition to the information you provide here I would expect upstream to want to see the full "xl dmesg" and "dmesg" from dom0 with and without Xen and with 4.1.4 as well as something exhibiting the issue (4.6 is probably best) Thanks, Ian.
Elliott Mitchell
2019-Feb-09 05:37 UTC
[Pkg-xen-devel] Bug#810964: EDAC bug #810964 effects 4.8 too
I'm seeing bug #810964 occur in Xen 4.8 as well. Perhaps #810964 should be reassigned to xen-hypervisor-common or src:xen ? I don't know whether it effects Xen 4.11 yet... -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Hans van Kranenburg
2019-Feb-11 23:11 UTC
[Pkg-xen-devel] Bug#810964: Bug#810964: EDAC bug #810964 effects 4.8 too
Hi, On 2/9/19 6:37 AM, Elliott Mitchell wrote:> I'm seeing bug #810964 occur in Xen 4.8 as well. Perhaps #810964 > should be reassigned to xen-hypervisor-common or src:xen ? > > I don't know whether it effects Xen 4.11 yet...Since the issue seems to be a lack of functionality to support certain hardware in the upstream Xen product, I would recommend to not have a bug open against Debian at all. It's not that I don't value your use case, but I just think as a package maintainer team that ships the currently released upstream software, *we* cannot be of any value to you for this, sitting in between you and upstream developers. Sometimes we can work around some things by tweaking the build scripts or other things, but I just need to be honest here about the fact that we will not be able to help you getting low level hypervisor features implemented. This means you will have to do things like hop on the upstream development mailing list, build a reproducable failure case, search for a developer that has similar hardware and wants to spend time on it, donate hardware to someone to reproduce the error scenarios or learn how to do it yourself, or whatever it takes. :) Hans
Elliott Mitchell
2019-Feb-13 06:36 UTC
[Pkg-xen-devel] Bug#810964: Bug#810964: EDAC bug #810964 effects 4.8 too
On Tue, Feb 12, 2019 at 12:11:11AM +0100, Hans van Kranenburg wrote:> This means you will have to do things like hop on the upstream > development mailing list, build a reproducable failure case, search for > a developer that has similar hardware and wants to spend time on it, > donate hardware to someone to reproduce the error scenarios or learn how > to do it yourself, or whatever it takes. :)I had hopes of avoiding doing such. Problem is there are so many pieces of software I have to use that if I jumped on the mailing lists of each of them would be akin to trying to read all of Usenet. I may not be able to avoid that here, but... Looks like Xen's MCE support is in near-useless shape. The code in the git repository mention documentation for family 10h, problem is that is almost entirely decade-old processors. The last apparently significant change was in 2014. The copyright is to AMD, so I guess that means they need more funding. Looks like Intel has been offering more support to Xen. :-( I'm surprised at Xen's handling of MCE. Given Xen's approach to things I would expect MCE handling to be done more by Domain 0. Let Domain 0 handle talking to the memory controller and merely have Xen map the physical address to a domain and domain address. Domain 0 can log all correctable memory errors to a single location, and in case of an uncorrectable error it can panic the machine. (plus Linux's MCE support is in better shape) Handling MCE errors in non-Domain 0 only seems to make sense in HVM where you want to simulate memory errors. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2019-Feb-22 05:11 UTC
[Pkg-xen-devel] Bug#810964: [Xen-devel] MCE/EDAC Status/Updating?
On Mon, Feb 18, 2019 at 02:37:48AM -0700, Jan Beulich wrote:> >>> On 18.02.19 at 09:42, <ehem+xen at m5p.com> wrote: > > On Mon, Feb 18, 2019 at 01:12:16AM -0700, Jan Beulich wrote: > >> >>> On 15.02.19 at 19:20, <ehem+xen at m5p.com> wrote: > >> > On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote: > >> >> Well, Fam10 is mentioned explicitly, but as per the use of e.g. > >> >> mcheck_amd_famXX newer ones are supported by this code > >> >> as well. > >> > > >> > In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code > >> > was completely broken and hasn't been fixed. > >> > >> I can't say I'm surprised, but details of the breakage would still > >> be appreciated. > > > > Originally noticed with Debian: https://bugs.debian.org/810964 > > > > Original observer noticed that half the memory controllers were missing > > from Linux's Domain-0 dmesg with Xen 4.4. EDAC capability flags are > > missing with Xen 4.4. > > And I had been commenting in this bug. I don't recall technical data > ever having emerged on the list here as to what is really going on, > and what the root of this perceived regression is.I've been having an interesting time trying to figure out where to look to find appropriate information. I'm thinking Debian's default Xen log level is a little too high. Adding "loglvl=info" doesn't put all that much in Xen's dmesg. I'm suspecting "mce_verbosity=verbose" may be a different story though. "loglvl=info" gets me "AMD Fam10h machine check reporting enabled", so looks like Xen is successfully getting its MCE support operational. Taking a closer look at Dom0's dmesg though: MCE: In-kernel MCE decoding enabled. EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) So it seems Linux wants bit 4 of MSR_IA32_MCG_CTL set before it will willingly enable MCE support (I've no idea what this does). This was done in commit b272353fe98db5bdc73fff3c60a0574835df4c87.> > I'd been working with a processor Linux was reporting as > > "cpu family : 16" (ah yes "10h", that funky olde way of refering to > > things) and noticing Linux's EDAC support failing on kernel start. In > > which case the EDAC support on AMD processors was completely broken > > between 4.1 and 4.4 (hadn't realized that processor was just old enough > > to be interesting). > > While there's a relation, I think we need to keep #MC handling > and EDAC separate here: The latter lives entirely in Dom0. And > as said in the Debian bug, at least back at the time there was no > reason to believe the driver would work on Xen other than by > accident.True, they might have merely been noticed at the same time and in fact be two distinct issues. Having EDAC reporting broken is *very* bad. I am left though noticing how the state of Xen's EDAC support looks rather odd from how other bits of Xen are evolving. Rather than going more in a direction of para-virtualization, this code looks to be heading more towards true virtualization. A more PV type approach might be to let Dom0 handle decoding the machine check registers. Then Dom0 asks Xen for what is at physical address X, then potentially turns this into a PV message to the appropriate domain and potentially logs the event. Such an approach could be used to synthesize machine check events for testing VMs. Qemu would then need code to simulate the appropriate register values for a HVM. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 On Mon, Feb 18, 2019 at 02:37:48AM -0700, Jan Beulich wrote:> >>> On 18.02.19 at 09:42, <ehem+xen at m5p.com> wrote: > > On Mon, Feb 18, 2019 at 01:12:16AM -0700, Jan Beulich wrote: > >> >>> On 15.02.19 at 19:20, <ehem+xen at m5p.com> wrote: > >> > On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote: > >> >> Well, Fam10 is mentioned explicitly, but as per the use of e.g. > >> >> mcheck_amd_famXX newer ones are supported by this code > >> >> as well. > >> > > >> > In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code > >> > was completely broken and hasn't been fixed. > >> > >> I can't say I'm surprised, but details of the breakage would still > >> be appreciated. > > > > Originally noticed with Debian: https://bugs.debian.org/810964 > > > > Original observer noticed that half the memory controllers were missing > > from Linux's Domain-0 dmesg with Xen 4.4. EDAC capability flags are > > missing with Xen 4.4. > > And I had been commenting in this bug. I don't recall technical data > ever having emerged on the list here as to what is really going on, > and what the root of this perceived regression is. > > > I'd been working with a processor Linux was reporting as > > "cpu family : 16" (ah yes "10h", that funky olde way of refering to > > things) and noticing Linux's EDAC support failing on kernel start. In > > which case the EDAC support on AMD processors was completely broken > > between 4.1 and 4.4 (hadn't realized that processor was just old enough > > to be interesting). > > While there's a relation, I think we need to keep #MC handling > and EDAC separate here: The latter lives entirely in Dom0. And > as said in the Debian bug, at least back at the time there was no > reason to believe the driver would work on Xen other than by > accident. > > Jan > >-- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2023-Oct-02 17:18 UTC
[Pkg-xen-devel] Bug#810964: #810964 is more kernel driver than Xen
reassign 810964 src:linux tags 810964 -moreinfo affects 810964 src:xen found 810964 5.10.191-1 found 810964 6.1.52-1 found 810964 6.5.3-1 found 810964 5.10.127-2~bpo10+1 found 810964 6.1.38-4~bpo11+1 found 810964 6.4.4-3~bpo12+1 quit Upon further investigation, while some part of #810964 may be in Xen, the biggest issue is in the Linux kernel. Appears MCE/EDAC support for Xen was implemented around 2008-2012. Since that time the maintainer has changed and the new maintainer was unaware the driver was supposed to function on Xen. As such the current maintainer has been adding in constructs which are incompatible with operation on Xen, and at 767f4b620eda overtly broke Xen support. Part of the fix may require adjustments to Xen, but right now the immediate source of breakage is the Linux kernel. As such I'm reassigning this to src:linux. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Possibly Parallel Threads
- Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
- Bug#810964: [Xen-devel] [BUG] EDAC infomation partially missing
- [PATCH 00/12] Bunch of patches for cross-compilatio + RP4
- Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
- Bug#1032480: xen: Important cherry-picks for bookworm/updates