Imre Szőllősi
2021-May-13 19:13 UTC
[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Package: src:xen Version: 4.14.1+11-gb0b734a8b3-1 Severity: critical Justification: causes serious data loss X-Debbugs-Cc: debianbts at virtualzone.hu Dear Maintainer, after a clean install of bullseye/testing the xen dmesg shows the following message: (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I this is the sata device: 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01) or on another mb 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb in the case of write operations - ie. dbench or windows guest - there are a lot of messages sometimes the filesystem goes to read-only state, and the windows guest goes bsod tested on 3 hw: 1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm: problem does appear 2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm: problem does appear 3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does not appear 3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does not appear 3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro, lvm, dbench on 2 ssds in parallel: problem does appear as i see, the problem does appear, when writes data parallel to 2 ssds Thanks! -- System Information: Debian Release: bullseye/sid APT prefers testing-security APT policy: (500, 'testing-security'), (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 5.10.0-6-amd64 (SMP w/12 CPU threads) Locale: LANG=hu_HU.UTF-8, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled xen-hypervisor-4.14-amd64 depends on no packages. Versions of packages xen-hypervisor-4.14-amd64 recommends: ii xen-hypervisor-common 4.14.1+11-gb0b734a8b3-1 ii xen-utils-4.14 4.14.1+11-gb0b734a8b3-1 xen-hypervisor-4.14-amd64 suggests no packages. -- no debconf information
Imre Szőllősi
2021-Jun-13 13:58 UTC
[Pkg-xen-devel] Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
i tested on 4th hw 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, lvm: problem does not appear as i see, not all mb/chipset/sata pcie device affected Thanks!
Hans van Kranenburg
2021-Aug-05 20:46 UTC
[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
severity 988477 normal tags 988477 + moreinfo + upstream - bullseye-ignore thanks Hi! On 6/13/21 3:58 PM, Imre Sz?ll?si wrote:> i tested on 4th hw > > 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, > lvm: problem does not appear > > as i see, not all mb/chipset/sata pcie device affectedThanks for your report, and for trying out different combinations of hardware. While doing a short internet search about the problems you're seeing while using AMD ryzen, sata, nvme and iommu, I suspect this problem does not have a lot to do with Xen specifically, but more with the hardware and its firmware. This also means that it's not a Debian packaging problem, and it cannot be fixed by me (or the Debian Xen team). If you want to research this problem more, I can maybe be of some help by providing suggestions. Still, you will have to do all of the actual work, since I do not have your hardware here. The first thing I would suggest is to try reproduce the problem when booting with just Linux without Xen, and then trying the dbench test. If you don't actually need to directly pass-through hardware to a Xen guest, you can also try disabling iommu, or researching other iommuoptions that can serve as a workaround. In any case, further reports will need to have more detailed information. For example, instead of "there are a lot of messages", provide a text attachment with a piece of logging that shows these messages. I'm tagging this bug 'moreinfo' now, since it will depend on your availability and abilities to work on it to have it advance. Have fun, Hans van Kranenburg
Imre Szőllősi
2021-Aug-08 13:34 UTC
[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20210808/f65cb55f/attachment.htm>
tags 988477 - moreinfo found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1 affects 988477 src:linux severity 988477 critical quit I am also observing #988477 occur. This machine has a AMD Zen 4 processor. The first observation was when motherboard/processor was swapped out, the older motherboard/processor was several generations old. The pattern which is emerging is Linux MD RAID1 plus recent AMD processor which has full IOMMU functionality. The older machine was believed to have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables (IVRS) and thus Xen was unable to utilize it. This seems to be occuring with a small percentage of write operations. Subsequent read operations appear to be fine. I am not convinced this is a Xen bug. I suspect this is instead a bug in the Linux MD subsystem. In particular if the DMA interface was designed assuming only a single device would ever access any page, but the MD RAID1 driver is reusing the same page for both devices. IOMMU page release could be handled by marking the page unused in a device data structure and later removed by sweeping a table. In such case if the MD-RAID1 driver was to redirect the page to another device between these two steps, the entry for a subsequent device could be wiped out when trying to invalidate an entry for a prior device. Anyway, I'm also observing bug #988477. This could also be a kernel bug. So far no crashes/confirmed data loss have occured, but sweeping the mirror does turn up small numbers of inconsistencies. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2024-Jul-10 19:25 UTC
[Pkg-xen-devel] Bug#988477: Potential Mitigation for #988477
It was suggested as a debugging step, but adding the option "iommu=no-intremap" to Xen's command-line may work as a short-term mitigation for #988477. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Maximilian Engelhardt
2024-Aug-25 21:41 UTC
[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Control: severity -1 normal Hi Elliott, I am changing the severity back to normal as the xen package works fine for many people without any serious issues. From your last message it also seems you found a workaround for your problem. Please don't change the bug severity without at least giving an explanation why you think the new severity is justified.>From the few log lines in this bug report this seems to be an upstream issuewith xen or the linux kernel. Please report your observations upstream. The Debian xen team does not have the resources and knowledge to debug or fix such problems. Once the issue has been identified and fixed upstream we can see if we can backport a fix to our Debian packages, but this is only possible once an upstream fix has landed. Thanks, Maxi -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part. URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20240825/937f5da9/attachment.sig>
Debian Bug Tracking System
2024-Aug-25 21:54 UTC
[Pkg-xen-devel] Processed: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Processing control commands:> severity -1 normalBug #988477 [src:xen] xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device Severity set to 'normal' from 'critical' -- 988477: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=988477 Debian Bug Tracking System Contact owner at bugs.debian.org with problems
Elliott Mitchell
2024-Aug-25 22:58 UTC
[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:> I am changing the severity back to normal as the xen package works fine for > many people without any serious issues. From your last message it also seemsYet for some lucky people data is corrupted/lost. There could be other people who reproduce this, but don't send e-mail saying "me too" to this bug report. Presently the main reason there aren't very many reproductions is few people are bothering to use RAID with flash. The initial reports are SSDs have a lower failure rate than disks, but the failure rate isn't even close to zero. Whereas the data loss/corruption easily reproduces. While both cases in #988477 were on systems with AMD hardware, I am presently doubtful that is a requirement. The most similar known bug was found to be more severe on AMD hardware, but also occur on Intel hardware. I suspect this issue may be similar, simply no one has noticed the problem yet...> you found a workaround for your problem. Please don't change the bug severitySomething was found which seems to have made another issue more prominent. It may reduce the rate at which data corruption occurs, but I've since confirmed data loss/corruption continues to occur.> without at least giving an explanation why you think the new severity is > justified.I had thought the original reporter's justification was sufficient. This appears to have some specific requirement to meet, but if you meet them you may be in trouble before alerts trigger. So far both reports are with AMD machines with IOMMUv2 functionality (I tried on a machine with IOMMUv1/GART and it didn't reproduce). Both reports feature Samsung SATA devices. A NVMe device from another manufacturer also showed the issue (I'm almost certain Samsung NVMe devices will also show the issue). I suspect Intel machines may also be effected by this issue, but it may not manifest as severely. I suspect this is a case of people with AMD machines being a bit more wary of hardware failure (thus actually bothering to use RAID1 even with flash devices).> >From the few log lines in this bug report this seems to be an upstream issue > with xen or the linux kernel. Please report your observations upstream. The > Debian xen team does not have the resources and knowledge to debug or fix such > problems. Once the issue has been identified and fixed upstream we can see if > we can backport a fix to our Debian packages, but this is only possible once > an upstream fix has landed.Perhaps it has become easier to report things upstream, but the original procedure was reportters were supposed to report to bugs.debian.org and NOT forward upstream. Other problem is I've run into a chasm with upstream and no way to build a bridge across. I do have one more thing to try, but don't yet have a time-frame for when I'll check that. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2024-Sep-03 21:58 UTC
[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
found 988477 4.17.3+10-g091466ba55-1~deb12u1 severity 988477 critical quit Justification is same as original, data loss. I'm unsure about of the border between "data loss" and "serious data loss" is, but the original reportter declared it so and I don't disagree. On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:> I am changing the severity back to normal as the xen package works fine for > many people without any serious issues. From your last message it also seemscritical makes unrelated software on the system (or the whole system) break, or causes serious data loss, or introduces a security hole on systems where you install the package. grave makes the package in question unusable or mostly so, or causes data loss, or introduces a security hole allowing access to the accounts of users who use the package. Both of those are lists of conditions. Since the conditions are "causes serious data loss" and "causes data loss", those have been met as there is no mention of "and cannot work acceptably for anyone".> you found a workaround for your problem. Please don't change the bug severity > without at least giving an explanation why you think the new severity is > justified.The key word was "may". I was being cautious when testing due to the severity of the issue. As stated in the previous message, it was found to merely mildly change the messages and not fix the issue.> >From the few log lines in this bug report this seems to be an upstream issue > with xen or the linux kernel. Please report your observations upstream. The > Debian xen team does not have the resources and knowledge to debug or fix such > problems. Once the issue has been identified and fixed upstream we can see if > we can backport a fix to our Debian packages, but this is only possible once > an upstream fix has landed.My understanding is being an upstream issue has no effect on severity. It allows tagging as "upstream", but does not allow reducing severity. The severity is meant as an alert to others there is a *severe* problem lurking. I've tried interacting with upstream, yet there has been a demand to release `xl dmesg` to a public area. While I cannot state any information in `xl dmesg` can be used to compromise systems, nor can point to hardware serial numbers or other private data which leak in, it still triggers the TMI detector. As such I'm uncomfortable with that being public and I don't know any way to bridge that chasm. If I was an installation of 10K nodes I wouldn't be too bothered with details of a single test machine leaking, alas I'm not in that category. I could also send someone a pair of SATA devices known to manifest the issue, but that has failed to generate interest. As such I'm stuck. Question for the original submitter, Imre Sz?ll?si, what was your situation prior to seeing #988477 manifest? Were you installing Xen 4.14 for the first time on Debian 11/bullseye? Had you previously used Xen 4.11 with Debian 10/buster or earlier? Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14 would be valuable knowledge if you have it. I had been using an older processor with 4.14, so I hadn't observed it until 4.17. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Possibly Parallel Threads
- [PATCH 00/12] Bunch of patches for cross-compilatio + RP4
- Bug#810964: only partial EDAC information with Xen
- Bug#1032480: xen: Important cherry-picks for bookworm/updates
- [PATCH] debian/scripts: Optimize scripts
- [PATCH 12/12] Partially revert "Cross-compilation fixes."