thr3ads.net - Pkg xen devel - [Pkg-xen-devel] Bug#988477: Also observing #988477 [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Imre Szőllősi

2021-May-13 19:13 UTC

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Package: src:xen
Version: 4.14.1+11-gb0b734a8b3-1
Severity: critical
Justification: causes serious data loss
X-Debbugs-Cc: debianbts at virtualzone.hu

Dear Maintainer,

after a clean install of bullseye/testing the xen dmesg shows the following
message:
(XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I
this is the sata device:
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset
SATA Controller (rev 01)
or on another mb
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb
in the case of write operations - ie. dbench or windows guest - there are a lot
of messages
sometimes the filesystem goes to read-only state, and the windows guest goes
bsod
tested on 3 hw:
1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm:
problem does appear
2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm:
problem does appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does
not appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does
not appear
3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro,
lvm, dbench on 2 ssds in parallel: problem does appear

as i see, the problem does appear, when writes data parallel to 2 ssds

Thanks!

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing-security
  APT policy: (500, 'testing-security'), (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-6-amd64 (SMP w/12 CPU threads)
Locale: LANG=hu_HU.UTF-8, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

xen-hypervisor-4.14-amd64 depends on no packages.

Versions of packages xen-hypervisor-4.14-amd64 recommends:
ii  xen-hypervisor-common  4.14.1+11-gb0b734a8b3-1
ii  xen-utils-4.14         4.14.1+11-gb0b734a8b3-1

xen-hypervisor-4.14-amd64 suggests no packages.

-- no debconf information

Imre Szőllősi

2021-Jun-13 13:58 UTC

head link

[Pkg-xen-devel] Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

i tested on 4th hw

4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, 
lvm: problem does not appear

as i see, not all mb/chipset/sata pcie device affected

Thanks!

Hans van Kranenburg

2021-Aug-05 20:46 UTC

head link

[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

severity 988477 normal
tags 988477 + moreinfo + upstream - bullseye-ignore
thanks

Hi!

On 6/13/21 3:58 PM, Imre Sz?ll?si wrote:> i tested on 4th hw
> 
> 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, 
> lvm: problem does not appear
> 
> as i see, not all mb/chipset/sata pcie device affected
Thanks for your report, and for trying out different combinations of
hardware.

While doing a short internet search about the problems you're seeing
while using AMD ryzen, sata, nvme and iommu, I suspect this problem does
not have a lot to do with Xen specifically, but more with the hardware
and its firmware.

This also means that it's not a Debian packaging problem, and it cannot
be fixed by me (or the Debian Xen team). If you want to research this
problem more, I can maybe be of some help by providing suggestions.
Still, you will have to do all of the actual work, since I do not have
your hardware here.

The first thing I would suggest is to try reproduce the problem when
booting with just Linux without Xen, and then trying the dbench test.

If you don't actually need to directly pass-through hardware to a Xen
guest, you can also try disabling iommu, or researching other iommuoptions that
can serve as a workaround.

In any case, further reports will need to have more detailed
information. For example, instead of "there are a lot of messages",
provide a text attachment with a piece of logging that shows these messages.

I'm tagging this bug 'moreinfo' now, since it will depend on your
availability and abilities to work on it to have it advance.

Have fun,
Hans van Kranenburg

Imre Szőllősi

2021-Aug-08 13:34 UTC

head link

[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20210808/f65cb55f/attachment.htm>

Elliott Mitchell

2024-Jan-18 16:04 UTC

head link

[Pkg-xen-devel] Bug#988477: Also observing #988477

tags 988477 - moreinfo
found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1
affects 988477 src:linux
severity 988477 critical
quit

I am also observing #988477 occur.  This machine has a AMD Zen 4
processor.  The first observation was when motherboard/processor was
swapped out, the older motherboard/processor was several generations old.

The pattern which is emerging is Linux MD RAID1 plus recent AMD processor
which has full IOMMU functionality.  The older machine was believed to
have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables
(IVRS) and thus Xen was unable to utilize it.

This seems to be occuring with a small percentage of write operations.
Subsequent read operations appear to be fine.

I am not convinced this is a Xen bug.  I suspect this is instead a bug
in the Linux MD subsystem.  In particular if the DMA interface was
designed assuming only a single device would ever access any page, but
the MD RAID1 driver is reusing the same page for both devices.

IOMMU page release could be handled by marking the page unused in a
device data structure and later removed by sweeping a table.  In such
case if the MD-RAID1 driver was to redirect the page to another device
between these two steps, the entry for a subsequent device could be wiped
out when trying to invalidate an entry for a prior device.


Anyway, I'm also observing bug #988477.  This could also be a kernel bug.
So far no crashes/confirmed data loss have occured, but sweeping the
mirror does turn up small numbers of inconsistencies.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Elliott Mitchell

2024-Jul-10 19:25 UTC

head link

[Pkg-xen-devel] Bug#988477: Potential Mitigation for #988477

It was suggested as a debugging step, but adding the option
"iommu=no-intremap" to Xen's command-line may work as a short-term
mitigation for #988477.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Maximilian Engelhardt

2024-Aug-25 21:41 UTC

head link

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Control: severity -1 normal

Hi Elliott,

I am changing the severity back to normal as the xen package works fine for 
many people without any serious issues. From your last message it also seems 
you found a workaround for your problem. Please don't change the bug
severity
without at least giving an explanation why you think the new severity is 
justified.
>From the few log lines in this bug report this seems to be an upstream issuewith xen or the linux kernel. Please report your observations upstream. The 
Debian xen team does not have the resources and knowledge to debug or fix such 
problems. Once the issue has been identified and fixed upstream we can see if 
we can backport a fix to our Debian packages, but this is only possible once 
an upstream fix has landed.

Thanks,
Maxi



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL:
<http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20240825/937f5da9/attachment.sig>

Debian Bug Tracking System

2024-Aug-25 21:54 UTC

head link

[Pkg-xen-devel] Processed: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Processing control commands:
> severity -1 normalBug #988477 [src:xen] xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi:
IO_PAGE_FAULT on sata pci device
Severity set to 'normal' from 'critical'

-- 
988477: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=988477
Debian Bug Tracking System
Contact owner at bugs.debian.org with problems

Elliott Mitchell

2024-Aug-25 22:58 UTC

head link

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt
wrote:> I am changing the severity back to normal as the xen package works fine for
> many people without any serious issues. From your last message it also
seems
Yet for some lucky people data is corrupted/lost.  There could be other
people who reproduce this, but don't send e-mail saying "me too"
to this
bug report.

Presently the main reason there aren't very many reproductions is few
people are bothering to use RAID with flash.  The initial reports are
SSDs have a lower failure rate than disks, but the failure rate isn't
even close to zero.  Whereas the data loss/corruption easily reproduces.

While both cases in #988477 were on systems with AMD hardware, I am
presently doubtful that is a requirement.  The most similar known bug was
found to be more severe on AMD hardware, but also occur on Intel
hardware.  I suspect this issue may be similar, simply no one has noticed
the problem yet...
> you found a workaround for your problem. Please don't change the bug
severity
Something was found which seems to have made another issue more
prominent.  It may reduce the rate at which data corruption occurs, but
I've since confirmed data loss/corruption continues to occur.
> without at least giving an explanation why you think the new severity is 
> justified.
I had thought the original reporter's justification was sufficient.  This
appears to have some specific requirement to meet, but if you meet them
you may be in trouble before alerts trigger.

So far both reports are with AMD machines with IOMMUv2 functionality (I
tried on a machine with IOMMUv1/GART and it didn't reproduce).  Both
reports feature Samsung SATA devices.  A NVMe device from another
manufacturer also showed the issue (I'm almost certain Samsung NVMe
devices will also show the issue).

I suspect Intel machines may also be effected by this issue, but it may
not manifest as severely.  I suspect this is a case of people with AMD
machines being a bit more wary of hardware failure (thus actually
bothering to use RAID1 even with flash devices).
> >From the few log lines in this bug report this seems to be an upstream
issue
> with xen or the linux kernel. Please report your observations upstream. The
> Debian xen team does not have the resources and knowledge to debug or fix
such
> problems. Once the issue has been identified and fixed upstream we can see
if
> we can backport a fix to our Debian packages, but this is only possible
once
> an upstream fix has landed.
Perhaps it has become easier to report things upstream, but the original
procedure was reportters were supposed to report to bugs.debian.org and
NOT forward upstream.

Other problem is I've run into a chasm with upstream and no way to build
a bridge across.

I do have one more thing to try, but don't yet have a time-frame for
when I'll check that.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Elliott Mitchell

2024-Sep-03 21:58 UTC

head link

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

found 988477 4.17.3+10-g091466ba55-1~deb12u1
severity 988477 critical
quit

Justification is same as original, data loss.  I'm unsure about of the
border between "data loss" and "serious data loss" is, but
the original
reportter declared it so and I don't disagree.

On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt
wrote:> I am changing the severity back to normal as the xen package works fine for
> many people without any serious issues. From your last message it also
seems
critical
    makes unrelated software on the system (or the whole system) break,
    or causes serious data loss, or introduces a security hole on systems
    where you install the package.

grave
    makes the package in question unusable or mostly so, or causes data
    loss, or introduces a security hole allowing access to the accounts
    of users who use the package.

Both of those are lists of conditions.  Since the conditions are
"causes serious data loss" and "causes data loss", those
have been met
as there is no mention of "and cannot work acceptably for anyone".

> you found a workaround for your problem. Please don't change the bug
severity
> without at least giving an explanation why you think the new severity is 
> justified.
The key word was "may".  I was being cautious when testing due to the
severity of the issue.  As stated in the previous message, it was found
to merely mildly change the messages and not fix the issue.
> >From the few log lines in this bug report this seems to be an upstream
issue
> with xen or the linux kernel. Please report your observations upstream. The
> Debian xen team does not have the resources and knowledge to debug or fix
such
> problems. Once the issue has been identified and fixed upstream we can see
if
> we can backport a fix to our Debian packages, but this is only possible
once
> an upstream fix has landed.
My understanding is being an upstream issue has no effect on severity.
It allows tagging as "upstream", but does not allow reducing severity.
The severity is meant as an alert to others there is a *severe* problem
lurking.

I've tried interacting with upstream, yet there has been a demand to
release `xl dmesg` to a public area.  While I cannot state any
information in `xl dmesg` can be used to compromise systems, nor can
point to hardware serial numbers or other private data which leak in, it
still triggers the TMI detector.

As such I'm uncomfortable with that being public and I don't know any
way
to bridge that chasm.  If I was an installation of 10K nodes I wouldn't
be too bothered with details of a single test machine leaking, alas I'm
not in that category.

I could also send someone a pair of SATA devices known to manifest the
issue, but that has failed to generate interest.  As such I'm stuck.

Question for the original submitter, Imre Sz?ll?si, what was your
situation prior to seeing #988477 manifest?

Were you installing Xen 4.14 for the first time on Debian 11/bullseye?

Had you previously used Xen 4.11 with Debian 10/buster or earlier?

Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14
would be valuable knowledge if you have it.  I had been using an older
processor with 4.14, so I hadn't observed it until 4.17.

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Reasonably Related Threads

Search for more maybe matching threads

Pkg xen devel - Jan 2024 - Bug#988477: Also observing #988477

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

[Pkg-xen-devel] Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

[Pkg-xen-devel] Bug#988477: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)

[Pkg-xen-devel] Bug#988477: Also observing #988477

[Pkg-xen-devel] Bug#988477: Potential Mitigation for #988477

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

[Pkg-xen-devel] Processed: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Reasonably Related Threads