Diederik de Haas
2021-Sep-19 17:10 UTC
[Pkg-xen-devel] Bug#991967: Simply ACPI powerdown/reset issue?
Adding pkg-xen-devel at lists.alioth.debian.org into the loop. Chuck Zmudzinski replied to the bug and later replied to his own reply. To give full context, I've added the original reply in full and Chuck's reply to that (as it only quoted part of the context there). On zondag 19 september 2021 07:05:56 CEST Chuck Zmudzinski wrote:> On Sat, 11 Sep 2021 13:29:12 +0200 Salvatore Bonaccorso > > <carnil at debian.org> wrote: > > Hi Elliott, > > > > On Fri, Sep 10, 2021 at 06:47:12PM -0700, Elliott Mitchell wrote: > > > An experiment lead to a potential alternative explanation for #991967. > > > The issue may be ACPI (non-UEFI) powerdown/reset was broken at > > > 4.19.194-3. Presence of Xen on the system may be unrelated. > > > > > > Failing that, it could be Xen and non-UEFI systems are effected. (Xen > > > was tried on a UEFI system and the issue wasn't observed) > > > > Following up on https://bugs.debian.org/991967#12 > > > > Did you succeeded in bisecting the issue as you seem to have it > > reproducible? > > > > Regards, > > Salvatore > > Hello Elliott and Salvatore, > > I noticed this bug on bullseye ever since I have been > running bullseye as a dom0, but my testing indicates > there is no problem with src:linux but the problem > appeared in src:xen with the 4.14 version of xen on > bullseye. > > I ask Elliott if you are only seeing the problem on Debian's > xen-4.14 hypervisor? Also, which architecture, arm or > amd64? I only see the problem on the Debian xen-4.14 > hypervisor, and I have only tested on amd64, and I > have found a fix for my amd64 system which is as > follows: > > Motherboard: ASRock B85M Pro4, BIOS P2.50 12/11/2015, > with a Haswell CPU (core i5-4590S) > > xen hypervisor version: 4.14.2+25-gb6a8c4f72d-2, amd64 > > linux kernel version: 5.10.46-4 (the current amd64 kernel > for bullseye) > > Boot system: EFI, not using secure boot, booting xen > hypervisor and dom0 bullseye with grub-efi package for > bullseye, and it boots the xen-4.14-amd64.gz file, not > the xen-4.14-amd64.efi file. > > I also tested a buster dom0 with the 4.19 series kernel > on the xen-4.14 hypervisor from bullseye and saw the > problem, but I did not see the problem with either > a buster (linux 4.19) or bullseye (linux 5.10) dom0 on > the xen-4.11 hypervisor, so I think the problem is > with the Debian version of the xen-4.14 hypervisor, > not with src:linux. > > I also found a fix in src:xen: > > I noticed the series of patches in debian/patches of the > 4.14.2+25-gb6a8c4f72d-2 version of src:xen (and > earlier versions of xen-4.14 on Debian) have several patches > backported from the unstable branch of xen upstream. By > removing some of these patches from the patches > series of the src:xen package, the dom0 shuts down > as expected on my ASRock Haswell motherboard. > > I rebuilt the src:xen package after removing the following > patches from the debian/patches series and the result > was that the computer shuts down as expected if I boot > using the patched hypervisor: > > 0027-xen-rpi4-implement-watchdog-based-reset.patch > 0028-tools-python-Pass-linker-to-Python-build-process.patch > 0029-xen-arm-acpi-Don-t-fail-if-SPCR-table-is-absent.patch > 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch > 0031-xen-arm-acpi-The-fixmap-area-should-always-be-cleare.patch > 0032-xen-arm-Check-if-the-platform-is-not-using-ACPI-befo.patch > 0033-xen-arm-Introduce-fw_unreserved_regions-and-use-it.patch > 0034-xen-arm-acpi-add-BAD_MADT_GICC_ENTRY-macro.patch > 0035-xen-arm-traps-Don-t-panic-when-receiving-an-unknown-.patch > > Most of these patches seem unrelated to the amd64 > architecture and instead affect the arm architecture, and > removing all these patches is probably more than is needed to > fix this bug, but I removed them all because I could not find > them upstream on the 4.14 branch but instead only saw them > on the xen unstable branch upstream (I did not check if they are > on the 4.15 branch upstream), and I wanted to test > a true upstream 4.14 version without these seemingly > aggressive patches added by Debian from the unstable > branch of xen upstream, and I discovered by being > more conservative and not adding these patches from the > unstable branch upstream fixed the problem! > > I suspect the following patch is the culprit for problems > shutting down on the amd64 architecture: > > 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch > > The commit log for this patch states: > > From: Julien Grall <jgrall at amazon.com> > Date: Sat, 26 Sep 2020 17:44:29 +0100 > Subject: xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory() > > The functions acpi_os_{un,}map_memory() are meant to be arch-agnostic > while the __acpi_os_{un,}map_memory() are meant to be arch-specific. > > Currently, the former are still containing x86 specific code. > > To avoid this rather strange split, the generic helpers are reworked so > they are arch-agnostic. This requires the introduction of a new helper > __acpi_os_unmap_memory() that will undo any mapping done by > __acpi_os_map_memory(). > > Currently, the arch-helper for unmap is basically a no-op so it only > returns whether the mapping was arch specific. But this will change > in the future. > > Note that the x86 version of acpi_os_map_memory() was already able to > able the 1MB region. Hence why there is no addition of new code. > > Signed-off-by: Julien Grall <jgrall at amazon.com> > Reviewed-by: Rahul Singh <rahul.singh at arm.com> > Reviewed-by: Jan Beulich <jbeulich at suse.com> > Acked-by: Stefano Stabellini <sstabellini at kernel.org> > Tested-by: Rahul Singh <rahul.singh at arm.com> > Tested-by: Elliott Mitchell <ehem+xen at m5p.com> > (cherry picked from commit 1c4aa69ca1e1fad20b2158051eb152276d1eb973) > --------------------------------------------------- > > This patch does affect amd64 acpi code, and is probably causing > the problem on my amd64 system, so my build of the xen-4.14 > hypervisor without this patch fixed the problem. > > I think this bug should be re-classified as a bug in src:xen. > > I also would inquire with the Debian Xen Team about why they > are backporting patches from the upstream xen unstable > branch into Debian's 4.14 package that is currently shipping > on Debian stable (bullseye). IMHO, the aforementioned > patches that are not in the stable 4.14 branch upstream > should not be included in the xen package for Debian stable. > > Regards, > > Chuck ZmudzinskiOn zondag 19 september 2021 14:44:01 CEST Chuck Zmudzinski wrote:> As a follow-up to my last comment on this bug, the > problems I see with my bullseye amd64 dom0 point to > problems with ACPI powerdown/reset issue, but only on > the Debian version of Xen-4.14. I do not see the problem > on any version of the linux kernel, neither on bare metal > nor on the Debian version of the Xen-4.11 hypervisor > from buster. For example, the problem manifests itself > on the Debian Xen-4.14 hypervisor with the Debian > dom0 reaching the systemd power off target but the > power does not actually turn off. Moreover, I can only > recover by manually resetting the computer by pressing > the physical reset button on the computer or removing > power by physically unplugging the computer. > > One slight difference I see from what Elliott reported - > not only does the power supply remain powered after > shutdown, but also messages on the console about > powering down remain on the display monitor after > reaching the systemd power down target and power > to the display/monitor also persists. > > For my amd64 system, this bug would be probably fixed > on Debian stable by having a separate Xen-4.14 package > for Debian stable that removes at least the following > patches from the debian/patches series of the current > Xen-4.14 package for stable: > > 0027-xen-rpi4-implement-watchdog-based-reset.patch > 0029-xen-arm-acpi-Don-t-fail-if-SPCR-table-is-absent.patch > 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch > 0031-xen-arm-acpi-The-fixmap-area-should-always-be-cleare.patch > 0032-xen-arm-Check-if-the-platform-is-not-using-ACPI-befo.patch > 0033-xen-arm-Introduce-fw_unreserved_regions-and-use-it.patch > 0034-xen-arm-acpi-add-BAD_MADT_GICC_ENTRY-macro.patch > 0035-xen-arm-traps-Don-t-panic-when-receiving-an-unknown-.patch > > The 0028-tools-python-Pass-linker-to-Python-build-process.patch > is probably not related to this bug, but I have not verified that > the bug is fixed without removing that patch also. I would > defer to more knowledgeable people about the problems with > building Xen on Debian using various versions of python to decide > whether or not to remove the 0028-tools-python... patch. > > I think perhaps the aforementioned patches to xen/arm and > xen/acpi would be suitable for testing a Debian Xen package > targeting bookworm/testing or sid/unstable, but not for > Debian bullseye/stable. As it is now, it appears the Debian Xen > Team is not making any distinction between stable, testing, > and unstable for its current Xen-4.14 package, and IMHO > that is the root cause of this bug on Debian stable. > > If the Debian Xen Team wants to experiment with patches > from the unstable branch of upstream Xen on a Debian > version of Xen-4.14, I respectfully ask that it do so only on > bookworm/testing or unstable/sid and ship a separate > more conservative package for bullseye/stable that is > closer to the official upstream Xen 4.14.x version than > the package that is currently shipping on bullseye/stable.I don't have an opinion on whether the analyses is correct. I can tell that after I upgraded my server to Testing which is now Bullseye, my server running Xen (4.14) does no longer power off. It seems it does the whole shutdown procedure successfully, but it does not shut the machine off. I have an iKVM module in my server in which I can forcefully shut it off (remotely) and I use that as a workaround. I upgraded the whole machine back then so there were a LOT of potential causes and as the machine is off most of the time and I didn't/don't know how to debug/bi-sect the issue, I resorted to my workaround. But it is a workaround and a regression from what it was when the machine ran Buster. Cheers, Diederik -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: This is a digitally signed message part. URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20210919/afcc1905/attachment.sig>