thr3ads.net - Xen devel - [PATCH] amd iommu: Dump flags of IO page faults [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Wei Wang

2012-Sep-05 14:42 UTC

[PATCH] amd iommu: Dump flags of IO page faults

Hi Jan,
Attached patch dumps io page fault flags. The flags show the reason of 
the fault and tell us if this is an unmapped interrupt fault or a DMA fault.

Thanks,
Wei

signed-off-by: Wei Wang <wei.wang2@amd.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-05 22:59 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
> Hi Jan,
> Attached patch dumps io page fault flags. The flags show the reason of 
> the fault and tell us if this is an unmapped interrupt fault or a DMA
fault.
> Thanks,
> Wei
> signed-off-by: Wei Wang <wei.wang2@amd.com>

I have applied the patch and the flags seem to differ between the faults:

AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address =
0xc2c2c2c0, flags = 0x000
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8d339e0, flags = 0x020
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8d33a40, flags = 0x020

Complete xl dmesg and lspci -vvvknn attached.

Thx

--
Sander


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Wang

2012-Sep-06 13:32 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:>
> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>
>> Hi Jan,
>> Attached patch dumps io page fault flags. The flags show the reason of
>> the fault and tell us if this is an unmapped interrupt fault or a DMA
fault.
>
>> Thanks,
>> Wei
>
>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>
>
> I have applied the patch and the flags seem to differ between the faults:
>
> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address =
0xc2c2c2c0, flags = 0x000
> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8d339e0, flags = 0x020
> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8d33a40, flags = 0x020
OK, so they are not interrupt requests. I guess further information from 
your system would be helpful to debug this issue:
1) xl info
2) xl list
3) lscpi -vvv (NOTE: not in dom0 but in your guest)
4) cat /proc/iomem (in both dom0 and your hvm guest)

* I would also like to know the symptoms of device 0x0700 when IO_PF 
happened. Did it stop working?

(BTW: I copied a few options from your boot cmd line and it worked with 
my RD890 system

dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps 
cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug 
apic=debug iommu=on,verbose,debug,no-sharept

* so, what OEM board you have?)

Also from your log, these lines looks very strange:

(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xd5, mfn=0xa4a11
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xd7, mfn=0xa4a0f
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xd9, mfn=0xa4a0d
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xdb, mfn=0xa4a0b
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xdd, mfn=0xa4a09
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xdf, mfn=0xa4a07
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xe1, mfn=0xa4a05
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xe3, mfn=0xa4a03
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xe5, mfn=0xa4a01
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xe7, mfn=0xa463f
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xe9, mfn=0xa463d
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xeb, mfn=0xa463b
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xed, mfn=0xa4639
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
read-only memory page. gfn=0xef, mfn=0xa4637
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id 
= 0x0a06, fault address = 0xc2c2c2c0
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
id = 0x0700, fault address = 0xa90f8300
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
id = 0x0700, fault address = 0xa90f8340
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
id = 0x0700, fault address = 0xa90f8380
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
id = 0x0700, fault address = 0xa90f83c0

* they are just followed by the IO PAGE fault. Do you know where are 
they from? Your video card driver maybe?

Thanks,
Wei

> Complete xl dmesg and lspci -vvvknn attached.
>
> Thx
>
> --
> Sander

Sander Eikelenboom

2012-Sep-06 13:50 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Thursday, September 6, 2012, 3:32:51 PM, you wrote:
> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>
>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>
>>> Hi Jan,
>>> Attached patch dumps io page fault flags. The flags show the reason
of
>>> the fault and tell us if this is an unmapped interrupt fault or a
DMA fault.
>>
>>> Thanks,
>>> Wei
>>
>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>
>>
>> I have applied the patch and the flags seem to differ between the
faults:
>>
>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address =
0xc2c2c2c0, flags = 0x000
>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device
id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
> OK, so they are not interrupt requests. I guess further information from 
> your system would be helpful to debug this issue:
> 1) xl info
> 2) xl list
> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
> 4) cat /proc/iomem (in both dom0 and your hvm guest)
dom14 is not a HVM guest,it''s a PV guest.

I will try to make a complete package, and try with one pv domain only where the
devices are being passed through just to simplify the setup.

> * I would also like to know the symptoms of device 0x0700 when IO_PF 
> happened. Did it stop working?
Yes it stops working, the video capture just freezes, but the driver
doesn''t bail out.
For the USB controller (0x0a06) it starts to give errors for usbdev_open in the
guest.
> (BTW: I copied a few options from your boot cmd line and it worked with 
> my RD890 system
> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps 
> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug 
> apic=debug iommu=on,verbose,debug,no-sharept
> * so, what OEM board you have?)
MSI 890FXA-GD70
> Also from your log, these lines looks very strange:
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xd5, mfn=0xa4a11
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xd7, mfn=0xa4a0f
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xd9, mfn=0xa4a0d
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xdb, mfn=0xa4a0b
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xdd, mfn=0xa4a09
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xdf, mfn=0xa4a07
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xe1, mfn=0xa4a05
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xe3, mfn=0xa4a03
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xe5, mfn=0xa4a01
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xe7, mfn=0xa463f
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xe9, mfn=0xa463d
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xeb, mfn=0xa463b
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xed, mfn=0xa4639
> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to 
> read-only memory page. gfn=0xef, mfn=0xa4637
> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id 
> = 0x0a06, fault address = 0xc2c2c2c0
> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
> id = 0x0700, fault address = 0xa90f8300
> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
> id = 0x0700, fault address = 0xa90f8340
> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
> id = 0x0700, fault address = 0xa90f8380
> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
> id = 0x0700, fault address = 0xa90f83c0
> * they are just followed by the IO PAGE fault. Do you know where are 
> they from? Your video card driver maybe?
From a HVM domain with a old (3.0.3) kernel, but the faults also occur without
this domain being started.

> Thanks,
> Wei
>> Complete xl dmesg and lspci -vvvknn attached.
>>
>> Thx
>>
>> --
>> Sander

Wei Wang

2012-Sep-06 15:03 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:>
> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>
>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>
>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>
>>>> Hi Jan,
>>>> Attached patch dumps io page fault flags. The flags show the
reason of
>>>> the fault and tell us if this is an unmapped interrupt fault or
a DMA fault.
>>>
>>>> Thanks,
>>>> Wei
>>>
>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>
>>>
>>> I have applied the patch and the flags seem to differ between the
faults:
>>>
>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault
address = 0xc2c2c2c0, flags = 0x000
>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>
>> OK, so they are not interrupt requests. I guess further information
from
>> your system would be helpful to debug this issue:
>> 1) xl info
>> 2) xl list
>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>
> dom14 is not a HVM guest,it''s a PV guest.
Ah, I see. PV guest is quite different than hvm, it does use p2m tables 
as io page tables. So no-sharept option does not work in this case. PV 
guests always use separated io page tables. There might be some 
incorrect mappings on the page tables. I will check this on my side.

Thanks,
Wei
> I will try to make a complete package, and try with one pv domain only
where the devices are being passed through just to simplify the setup.
>
>
>> * I would also like to know the symptoms of device 0x0700 when IO_PF
>> happened. Did it stop working?
>
> Yes it stops working, the video capture just freezes, but the driver
doesn''t bail out.
> For the USB controller (0x0a06) it starts to give errors for usbdev_open in
the guest.
>
>> (BTW: I copied a few options from your boot cmd line and it worked with
>> my RD890 system
>
>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps
>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
>> apic=debug iommu=on,verbose,debug,no-sharept
>
>> * so, what OEM board you have?)
>
> MSI 890FXA-GD70
>
>> Also from your log, these lines looks very strange:
>
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xe7, mfn=0xa463f
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xe9, mfn=0xa463d
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xeb, mfn=0xa463b
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xed, mfn=0xa4639
>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>> read-only memory page. gfn=0xef, mfn=0xa4637
>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device
id
>> = 0x0a06, fault address = 0xc2c2c2c0
>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>> id = 0x0700, fault address = 0xa90f8300
>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>> id = 0x0700, fault address = 0xa90f8340
>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>> id = 0x0700, fault address = 0xa90f8380
>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>> id = 0x0700, fault address = 0xa90f83c0
>
>> * they are just followed by the IO PAGE fault. Do you know where are
>> they from? Your video card driver maybe?
>
>  From a HVM domain with a old (3.0.3) kernel, but the faults also occur
without this domain being started.
>
>
>> Thanks,
>> Wei
>
>
>>> Complete xl dmesg and lspci -vvvknn attached.
>>>
>>> Thx
>>>
>>> --
>>> Sander
>
>
>
>
>

Sander Eikelenboom

2012-Sep-06 15:08 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Thursday, September 6, 2012, 5:03:05 PM, you wrote:
> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>
>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>
>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>
>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>
>>>>> Hi Jan,
>>>>> Attached patch dumps io page fault flags. The flags show
the reason of
>>>>> the fault and tell us if this is an unmapped interrupt
fault or a DMA fault.
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>
>>>>
>>>> I have applied the patch and the flags seem to differ between
the faults:
>>>>
>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault
address = 0xc2c2c2c0, flags = 0x000
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>
>>> OK, so they are not interrupt requests. I guess further information
from
>>> your system would be helpful to debug this issue:
>>> 1) xl info
>>> 2) xl list
>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>
>> dom14 is not a HVM guest,it''s a PV guest.
> Ah, I see. PV guest is quite different than hvm, it does use p2m tables 
> as io page tables. So no-sharept option does not work in this case. PV 
> guests always use separated io page tables. There might be some 
> incorrect mappings on the page tables. I will check this on my side.
> Thanks,
> Wei
In that case it''s perhaps mysteriously semi related to a p2m bug in
kernels > 3.4 which freezes guests on my intel box.
Though guests start fine on the amd box with kernels > 3.4, perhaps it does
give issues for iommu if those are tied somehow.

>> I will try to make a complete package, and try with one pv domain only
where the devices are being passed through just to simplify the setup.
>>
>>
>>> * I would also like to know the symptoms of device 0x0700 when
IO_PF
>>> happened. Did it stop working?
>>
>> Yes it stops working, the video capture just freezes, but the driver
doesn''t bail out.
>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>
>>> (BTW: I copied a few options from your boot cmd line and it worked
with
>>> my RD890 system
>>
>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
>>> apic=debug iommu=on,verbose,debug,no-sharept
>>
>>> * so, what OEM board you have?)
>>
>> MSI 890FXA-GD70
>>
>>> Also from your log, these lines looks very strange:
>>
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id
>>> = 0x0a06, fault address = 0xc2c2c2c0
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8300
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8340
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8380
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f83c0
>>
>>> * they are just followed by the IO PAGE fault. Do you know where
are
>>> they from? Your video card driver maybe?
>>
>>  From a HVM domain with a old (3.0.3) kernel, but the faults also occur
without this domain being started.
>>
>>
>>> Thanks,
>>> Wei
>>
>>
>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>
>>>> Thx
>>>>
>>>> --
>>>> Sander
>>
>>
>>
>>
>>

Sander Eikelenboom

2012-Sep-07 07:32 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Thursday, September 6, 2012, 5:03:05 PM, you wrote:
> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>
>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>
>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>
>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>
>>>>> Hi Jan,
>>>>> Attached patch dumps io page fault flags. The flags show
the reason of
>>>>> the fault and tell us if this is an unmapped interrupt
fault or a DMA fault.
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>
>>>>
>>>> I have applied the patch and the flags seem to differ between
the faults:
>>>>
>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault
address = 0xc2c2c2c0, flags = 0x000
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>
>>> OK, so they are not interrupt requests. I guess further information
from
>>> your system would be helpful to debug this issue:
>>> 1) xl info
>>> 2) xl list
>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>
>> dom14 is not a HVM guest,it''s a PV guest.
> Ah, I see. PV guest is quite different than hvm, it does use p2m tables 
> as io page tables. So no-sharept option does not work in this case. PV 
> guests always use separated io page tables. There might be some 
> incorrect mappings on the page tables. I will check this on my side.
I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept
everything else the same.
I haven''t seen any IO PAGE FAULTS after that.

I did spot some differences in the output from lspci between xen 4.1 and 4.2,
related to MSI enabled or not for the IOMMU device.
Have attached the xl/xm dmesg and lspci from booting with both versions.

lspci:

00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
        Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit (IOMMU)
[1002:5a23]
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 10
        Capabilities: [40] Secure device <?>
4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0100c  Data: 4128
        Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+

Although it seems enabled, shouldn''t the IRQ number used be much higher
than 10 for MSI interrupts ?

There is another difference in the bridge device that''s in front of the
0a:00.6 device that faults before the kernel is even booted.

00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge (PCI
express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort+ <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
        I/O behind bridge: 0000f000-00000fff
        Memory behind bridge: f9f00000-f9ffffff
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
4.1:    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
4.2:    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort+
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns,
L1 <1us
                        ExtTag+ RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr-
TransPend-
                LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1, Latency L0
<1us, L1 <8us
                        ClockPM- Surprise- LLActRep+ BwNot+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug-
Surprise-
                        Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq-
LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+
Interlock-
                        Changed: MRL- PresDet+ LinkState+

serveerstertje:~# lspci -t
-[0000:00]-+-00.0
           +-00.2
           +-02.0-[0b]----00.0
           +-03.0-[0a]--+-00.0
           |            +-00.1
           |            +-00.2
           |            +-00.3
           |            +-00.4
           |            +-00.5
           |            +-00.6
           |            \-00.7
           +-05.0-[09]----00.0
           +-06.0-[08]----00.0
           +-0a.0-[07]----00.0
           +-0b.0-[06]--+-00.0
           |            \-00.1
           +-0c.0-[05]----00.0
           +-0d.0-[04]--+-00.0
           |            +-00.1
           |            +-00.2
           |            +-00.3
           |            +-00.4
           |            +-00.5
           |            +-00.6
           |            \-00.7
           +-11.0
           +-12.0
           +-12.2
           +-13.0
           +-13.2
           +-14.0
           +-14.3
           +-14.4-[03]----06.0
           +-14.5
           +-15.0-[02]--
           +-16.0
           +-16.2
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           \-18.4




> Thanks,
> Wei
>> I will try to make a complete package, and try with one pv domain only
where the devices are being passed through just to simplify the setup.
>>
>>
>>> * I would also like to know the symptoms of device 0x0700 when
IO_PF
>>> happened. Did it stop working?
>>
>> Yes it stops working, the video capture just freezes, but the driver
doesn''t bail out.
>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>
>>> (BTW: I copied a few options from your boot cmd line and it worked
with
>>> my RD890 system
>>
>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
>>> apic=debug iommu=on,verbose,debug,no-sharept
>>
>>> * so, what OEM board you have?)
>>
>> MSI 890FXA-GD70
>>
>>> Also from your log, these lines looks very strange:
>>
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id
>>> = 0x0a06, fault address = 0xc2c2c2c0
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8300
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8340
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f8380
>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>> id = 0x0700, fault address = 0xa90f83c0
>>
>>> * they are just followed by the IO PAGE fault. Do you know where
are
>>> they from? Your video card driver maybe?
>>
>>  From a HVM domain with a old (3.0.3) kernel, but the faults also occur
without this domain being started.
>>
>>
>>> Thanks,
>>> Wei
>>
>>
>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>
>>>> Thx
>>>>
>>>> --
>>>> Sander
>>
>>
>>
>>
>>






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Wang

2012-Sep-07 08:54 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:>
> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>
>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>
>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>
>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>
>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>
>>>>>> Hi Jan,
>>>>>> Attached patch dumps io page fault flags. The flags
show the reason of
>>>>>> the fault and tell us if this is an unmapped interrupt
fault or a DMA fault.
>>>>>
>>>>>> Thanks,
>>>>>> Wei
>>>>>
>>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>>
>>>>>
>>>>> I have applied the patch and the flags seem to differ
between the faults:
>>>>>
>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06,
fault address = 0xc2c2c2c0, flags = 0x000
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>
>>>> OK, so they are not interrupt requests. I guess further
information from
>>>> your system would be helpful to debug this issue:
>>>> 1) xl info
>>>> 2) xl list
>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>
>>> dom14 is not a HVM guest,it''s a PV guest.
>
>> Ah, I see. PV guest is quite different than hvm, it does use p2m tables
>> as io page tables. So no-sharept option does not work in this case. PV
>> guests always use separated io page tables. There might be some
>> incorrect mappings on the page tables. I will check this on my side.
>
> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept
everything else the same.
> I haven''t seen any IO PAGE FAULTS after that.
>
> I did spot some differences in the output from lspci between xen 4.1 and
4.2, related to MSI enabled or not for the IOMMU device.
> Have attached the xl/xm dmesg and lspci from booting with both versions.
>
> lspci:
>
> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O
Memory Management Unit (IOMMU) [1002:5a23]
>          Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit
(IOMMU) [1002:5a23]
>          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>          Latency: 0
>          Interrupt: pin A routed to IRQ 10
>          Capabilities: [40] Secure device<?>
> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
Eh... That is interesting. So which dom0 are you using?  There is a c/s 
in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
25492:61844569a432) Otherwise, iommu cannot send any events including IO 
PAGE faults. You could try to revert dom0 to an old version like 2.6 
pv_ops to see if you really have no io page faults on 4.1

> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                  Address: 00000000fee0100c  Data: 4128
>          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>
> Although it seems enabled, shouldn''t the IRQ number used be much
higher than 10 for MSI interrupts ?
The IRQ number is fine. MSI vector is stored at  Data: 4128
>
> There is another difference in the bridge device that''s in front
of the  0a:00.6 device that faults before the kernel is even booted.
>
> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge
(PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>          Latency: 0, Cache Line Size: 64 bytes
>          Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>          I/O behind bridge: 0000f000-00000fff
>          Memory behind bridge: f9f00000-f9ffffff
>          Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>          BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset- FastB2B-
>                  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>          Capabilities: [50] Power Management version 3
>                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>          Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>                  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency
L0s<64ns, L1<1us
>                          ExtTag+ RBE+ FLReset-
>                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
>                          RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                          MaxPayload 128 bytes, MaxReadReq 128 bytes
>                  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr-
TransPend-
>                  LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1,
Latency L0<1us, L1<8us
>                          ClockPM- Surprise- LLActRep+ BwNot+
>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk-
>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive+ BWMgmt+ ABWMgmt-
>                  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug-
Surprise-
>                          Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
>                  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
HPIrq- LinkChg-
>                          Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
>                  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+
Interlock-
>                          Changed: MRL- PresDet+ LinkState+
The probably because of the IO_PAGE_FAULT.

Thanks,
Wei
> serveerstertje:~# lspci -t
> -[0000:00]-+-00.0
>             +-00.2
>             +-02.0-[0b]----00.0
>             +-03.0-[0a]--+-00.0
>             |            +-00.1
>             |            +-00.2
>             |            +-00.3
>             |            +-00.4
>             |            +-00.5
>             |            +-00.6
>             |            \-00.7
>             +-05.0-[09]----00.0
>             +-06.0-[08]----00.0
>             +-0a.0-[07]----00.0
>             +-0b.0-[06]--+-00.0
>             |            \-00.1
>             +-0c.0-[05]----00.0
>             +-0d.0-[04]--+-00.0
>             |            +-00.1
>             |            +-00.2
>             |            +-00.3
>             |            +-00.4
>             |            +-00.5
>             |            +-00.6
>             |            \-00.7
>             +-11.0
>             +-12.0
>             +-12.2
>             +-13.0
>             +-13.2
>             +-14.0
>             +-14.3
>             +-14.4-[03]----06.0
>             +-14.5
>             +-15.0-[02]--
>             +-16.0
>             +-16.2
>             +-18.0
>             +-18.1
>             +-18.2
>             +-18.3
>             \-18.4
>
>
>
>
>
>> Thanks,
>> Wei
>
>>> I will try to make a complete package, and try with one pv domain
only where the devices are being passed through just to simplify the setup.
>>>
>>>
>>>> * I would also like to know the symptoms of device 0x0700 when
IO_PF
>>>> happened. Did it stop working?
>>>
>>> Yes it stops working, the video capture just freezes, but the
driver doesn''t bail out.
>>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>>
>>>> (BTW: I copied a few options from your boot cmd line and it
worked with
>>>> my RD890 system
>>>
>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>
>>>> * so, what OEM board you have?)
>>>
>>> MSI 890FXA-GD70
>>>
>>>> Also from your log, these lines looks very strange:
>>>
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id
>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8300
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8340
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8380
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f83c0
>>>
>>>> * they are just followed by the IO PAGE fault. Do you know
where are
>>>> they from? Your video card driver maybe?
>>>
>>>    From a HVM domain with a old (3.0.3) kernel, but the faults also
occur without this domain being started.
>>>
>>>
>>>> Thanks,
>>>> Wei
>>>
>>>
>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>
>>>>> Thx
>>>>>
>>>>> --
>>>>> Sander
>>>
>>>
>>>
>>>
>>>
>
>

Andrew Cooper

2012-Sep-07 09:17 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults (off topic - pci devices)

On 07/09/12 08:32, Sander Eikelenboom wrote:> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>
>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>
>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>
>>>>>> Hi Jan,
>>>>>> Attached patch dumps io page fault flags. The flags
show the reason of
>>>>>> the fault and tell us if this is an unmapped interrupt
fault or a DMA fault.
>>>>>> Thanks,
>>>>>> Wei
>>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>>
>>>>> I have applied the patch and the flags seem to differ
between the faults:
>>>>>
>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06,
fault address = 0xc2c2c2c0, flags = 0x000
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain =
14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>> OK, so they are not interrupt requests. I guess further
information from
>>>> your system would be helpful to debug this issue:
>>>> 1) xl info
>>>> 2) xl list
>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>> dom14 is not a HVM guest,it''s a PV guest.
>> Ah, I see. PV guest is quite different than hvm, it does use p2m tables
>> as io page tables. So no-sharept option does not work in this case. PV
>> guests always use separated io page tables. There might be some
>> incorrect mappings on the page tables. I will check this on my side.
> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept
everything else the same.
> I haven''t seen any IO PAGE FAULTS after that.
>
> I did spot some differences in the output from lspci between xen 4.1 and
4.2, related to MSI enabled or not for the IOMMU device.
> Have attached the xl/xm dmesg and lspci from booting with both versions.
>
> lspci:
>
> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O
Memory Management Unit (IOMMU) [1002:5a23]
>         Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit
(IOMMU) [1002:5a23]
>         Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 10
>         Capabilities: [40] Secure device <?>
> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee0100c  Data: 4128
>         Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>
> Although it seems enabled, shouldn''t the IRQ number used be much
higher than 10 for MSI interrupts ?
For compatibility reasons, all real PCI devices have to have the ability
to fall back to legacy line level interrupts.  This is the IRQ10 which
you see, which is #INTA in perhaps more recognizable notation.  The line
interrupt(s) will only be used if MSI and MSI-x interrupts are disabled.

You should find that all devices in lspci show between 1 and 4 #INTs (a
thru d), with the exception of SRIOV virtual function which are
specified to only support MSI/MSI-x

~Andrew
>
> There is another difference in the bridge device that''s in front
of the  0a:00.6 device that faults before the kernel is even booted.
>
> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge
(PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort+ <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff
>         Memory behind bridge: f9f00000-f9ffffff
>         Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
> 4.1:    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
> 4.2:    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort+
<TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [50] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>                 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
<64ns, L1 <1us
>                         ExtTag+ RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 128 bytes, MaxReadReq 128 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr-
TransPend-
>                 LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1,
Latency L0 <1us, L1 <8us
>                         ClockPM- Surprise- LLActRep+ BwNot+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive+ BWMgmt+ ABWMgmt-
>                 SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug-
Surprise-
>                         Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
>                 SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
HPIrq- LinkChg-
>                         Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
>                 SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+
Interlock-
>                         Changed: MRL- PresDet+ LinkState+
>
> serveerstertje:~# lspci -t
> -[0000:00]-+-00.0
>            +-00.2
>            +-02.0-[0b]----00.0
>            +-03.0-[0a]--+-00.0
>            |            +-00.1
>            |            +-00.2
>            |            +-00.3
>            |            +-00.4
>            |            +-00.5
>            |            +-00.6
>            |            \-00.7
>            +-05.0-[09]----00.0
>            +-06.0-[08]----00.0
>            +-0a.0-[07]----00.0
>            +-0b.0-[06]--+-00.0
>            |            \-00.1
>            +-0c.0-[05]----00.0
>            +-0d.0-[04]--+-00.0
>            |            +-00.1
>            |            +-00.2
>            |            +-00.3
>            |            +-00.4
>            |            +-00.5
>            |            +-00.6
>            |            \-00.7
>            +-11.0
>            +-12.0
>            +-12.2
>            +-13.0
>            +-13.2
>            +-14.0
>            +-14.3
>            +-14.4-[03]----06.0
>            +-14.5
>            +-15.0-[02]--
>            +-16.0
>            +-16.2
>            +-18.0
>            +-18.1
>            +-18.2
>            +-18.3
>            \-18.4
>
>
>
>
>
>> Thanks,
>> Wei
>>> I will try to make a complete package, and try with one pv domain
only where the devices are being passed through just to simplify the setup.
>>>
>>>
>>>> * I would also like to know the symptoms of device 0x0700 when
IO_PF
>>>> happened. Did it stop working?
>>> Yes it stops working, the video capture just freezes, but the
driver doesn''t bail out.
>>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>>
>>>> (BTW: I copied a few options from your boot cmd line and it
worked with
>>>> my RD890 system
>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>> * so, what OEM board you have?)
>>> MSI 890FXA-GD70
>>>
>>>> Also from your log, these lines looks very strange:
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id
>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8300
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8340
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f8380
>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14,
device
>>>> id = 0x0700, fault address = 0xa90f83c0
>>>> * they are just followed by the IO PAGE fault. Do you know
where are
>>>> they from? Your video card driver maybe?
>>>  From a HVM domain with a old (3.0.3) kernel, but the faults also
occur without this domain being started.
>>>
>>>
>>>> Thanks,
>>>> Wei
>>>
>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>
>>>>> Thx
>>>>>
>>>>> --
>>>>> Sander
>>>
>>>
>>>
>>>
>
-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Jan Beulich

2012-Sep-07 09:53 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

>>> On 07.09.12 at 09:32, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
No surprise you''re not seeing any faults on 4.1 - there''s no
way
they could get reported. I''m somewhat hesitant to pull the
workaround patch from 4.2 into 4.1, as it''s wrong for the kernel
to touch the MSI setting of the IOMMU (which is under the
control of Xen) in the first place, but the kernel side patch I had
submitted a while ago wasn''t received well. And that patch
isn''t
really small, and it would remain to be seen what other
dependencies it would have...

Jan

Sander Eikelenboom

2012-Sep-07 10:00 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Friday, September 7, 2012, 11:53:36 AM, you wrote:
>>>> On 07.09.12 at 09:32, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
> No surprise you''re not seeing any faults on 4.1 - there''s
no way
> they could get reported. I''m somewhat hesitant to pull the
> workaround patch from 4.2 into 4.1, as it''s wrong for the kernel
> to touch the MSI setting of the IOMMU (which is under the
> control of Xen) in the first place, but the kernel side patch I had
> submitted a while ago wasn''t received well. And that patch
isn''t
> really small, and it would remain to be seen what other
> dependencies it would have...
Ok so that would mean that in the 4.1 case, the IOMMU is doing nothing ?

Because if the IOMMU is working, i  would say:
           a) isn''t it a bit strange that everything keeps working,
although it should report the IO PAGE FAULT ?
           b) is the IO PAGE FAULT correct anyway, since the device keeps
working fine ?

> Jan

Sander Eikelenboom

2012-Sep-07 10:01 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Friday, September 7, 2012, 10:54:40 AM, you wrote:
> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>
>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>
>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>
>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>>
>>>>>>> Hi Jan,
>>>>>>> Attached patch dumps io page fault flags. The flags
show the reason of
>>>>>>> the fault and tell us if this is an unmapped
interrupt fault or a DMA fault.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>>>
>>>>>>
>>>>>> I have applied the patch and the flags seem to differ
between the faults:
>>>>>>
>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06,
fault address = 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>>
>>>>> OK, so they are not interrupt requests. I guess further
information from
>>>>> your system would be helpful to debug this issue:
>>>>> 1) xl info
>>>>> 2) xl list
>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>>
>>>> dom14 is not a HVM guest,it''s a PV guest.
>>
>>> Ah, I see. PV guest is quite different than hvm, it does use p2m
tables
>>> as io page tables. So no-sharept option does not work in this case.
PV
>>> guests always use separated io page tables. There might be some
>>> incorrect mappings on the page tables. I will check this on my
side.
>>
>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept
everything else the same.
>> I haven''t seen any IO PAGE FAULTS after that.
>>
>> I did spot some differences in the output from lspci between xen 4.1
and 4.2, related to MSI enabled or not for the IOMMU device.
>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>
>> lspci:
>>
>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990
I/O Memory Management Unit (IOMMU) [1002:5a23]
>>          Subsystem: ATI Technologies Inc RD990 I/O Memory Management
Unit (IOMMU) [1002:5a23]
>>          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
>>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>          Latency: 0
>>          Interrupt: pin A routed to IRQ 10
>>          Capabilities: [40] Secure device<?>
>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> Eh... That is interesting. So which dom0 are you using?  There is a c/s 
> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
> 25492:61844569a432) Otherwise, iommu cannot send any events including IO 
> PAGE faults. You could try to revert dom0 to an old version like 2.6 
> pv_ops to see if you really have no io page faults on 4.1
Ok i will give that a try, only dom0 will have to be a 2.6 pv_ops i assume ?

>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>                  Address: 00000000fee0100c  Data: 4128
>>          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>>
>> Although it seems enabled, shouldn''t the IRQ number used be
much higher than 10 for MSI interrupts ?
> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>> There is another difference in the bridge device that''s in
front of the  0a:00.6 device that faults before the kernel is even booted.
>>
>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge
(PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>          Latency: 0, Cache Line Size: 64 bytes
>>          Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>>          I/O behind bridge: 0000f000-00000fff
>>          Memory behind bridge: f9f00000-f9ffffff
>>          Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>          BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset-
FastB2B-
>>                  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>          Capabilities: [50] Power Management version 3
>>                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>>          Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>>                  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency
L0s<64ns, L1<1us
>>                          ExtTag+ RBE+ FLReset-
>>                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
>>                          RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>                          MaxPayload 128 bytes, MaxReadReq 128 bytes
>>                  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
>>                  LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1,
Latency L0<1us, L1<8us
>>                          ClockPM- Surprise- LLActRep+ BwNot+
>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk-
>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>                  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
HotPlug- Surprise-
>>                          Slot #3, PowerLimit 10.000W; Interlock-
NoCompl+
>>                  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet-
CmdCplt- HPIrq- LinkChg-
>>                          Control: AttnInd Unknown, PwrInd Unknown,
Power- Interlock-
>>                  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
>>                          Changed: MRL- PresDet+ LinkState+
> The probably because of the IO_PAGE_FAULT.
> Thanks,
> Wei
>> serveerstertje:~# lspci -t
>> -[0000:00]-+-00.0
>>             +-00.2
>>             +-02.0-[0b]----00.0
>>             +-03.0-[0a]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-05.0-[09]----00.0
>>             +-06.0-[08]----00.0
>>             +-0a.0-[07]----00.0
>>             +-0b.0-[06]--+-00.0
>>             |            \-00.1
>>             +-0c.0-[05]----00.0
>>             +-0d.0-[04]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-11.0
>>             +-12.0
>>             +-12.2
>>             +-13.0
>>             +-13.2
>>             +-14.0
>>             +-14.3
>>             +-14.4-[03]----06.0
>>             +-14.5
>>             +-15.0-[02]--
>>             +-16.0
>>             +-16.2
>>             +-18.0
>>             +-18.1
>>             +-18.2
>>             +-18.3
>>             \-18.4
>>
>>
>>
>>
>>
>>> Thanks,
>>> Wei
>>
>>>> I will try to make a complete package, and try with one pv
domain only where the devices are being passed through just to simplify the
setup.
>>>>
>>>>
>>>>> * I would also like to know the symptoms of device 0x0700
when IO_PF
>>>>> happened. Did it stop working?
>>>>
>>>> Yes it stops working, the video capture just freezes, but the
driver doesn''t bail out.
>>>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>>>
>>>>> (BTW: I copied a few options from your boot cmd line and it
worked with
>>>>> my RD890 system
>>>>
>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>
>>>>> * so, what OEM board you have?)
>>>>
>>>> MSI 890FXA-GD70
>>>>
>>>>> Also from your log, these lines looks very strange:
>>>>
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
0, device id
>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>
>>>>> * they are just followed by the IO PAGE fault. Do you know
where are
>>>>> they from? Your video card driver maybe?
>>>>
>>>>    From a HVM domain with a old (3.0.3) kernel, but the faults
also occur without this domain being started.
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>
>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>
>>>>>> Thx
>>>>>>
>>>>>> --
>>>>>> Sander
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>

Jan Beulich

2012-Sep-07 10:06 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

>>> On 07.09.12 at 12:00, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> Friday, September 7, 2012, 11:53:36 AM, you wrote:
> 
>>>>> On 07.09.12 at 09:32, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
> 
>> No surprise you''re not seeing any faults on 4.1 -
there''s no way
>> they could get reported. I''m somewhat hesitant to pull the
>> workaround patch from 4.2 into 4.1, as it''s wrong for the
kernel
>> to touch the MSI setting of the IOMMU (which is under the
>> control of Xen) in the first place, but the kernel side patch I had
>> submitted a while ago wasn''t received well. And that patch
isn''t
>> really small, and it would remain to be seen what other
>> dependencies it would have...
> 
> Ok so that would mean that in the 4.1 case, the IOMMU is doing nothing ?
No, it just can''t report faults (they would need to be polled for).
Also, saying "in the 4.1 case" is wrong here - this really depends
on whether you have an affected Dom0 kernel. Things work fine
if the Dom0 kernel doesn''t trample over Xen''s setup.

Jan

Sander Eikelenboom

2012-Sep-07 10:15 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Friday, September 7, 2012, 12:06:31 PM, you wrote:
>>>> On 07.09.12 at 12:00, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> Friday, September 7, 2012, 11:53:36 AM, you wrote:
>> 
>>>>>> On 07.09.12 at 09:32, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable-
64bit+
>>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable-
64bit+
>> 
>>> No surprise you''re not seeing any faults on 4.1 -
there''s no way
>>> they could get reported. I''m somewhat hesitant to pull the
>>> workaround patch from 4.2 into 4.1, as it''s wrong for the
kernel
>>> to touch the MSI setting of the IOMMU (which is under the
>>> control of Xen) in the first place, but the kernel side patch I had
>>> submitted a while ago wasn''t received well. And that patch
isn''t
>>> really small, and it would remain to be seen what other
>>> dependencies it would have...
>> 
>> Ok so that would mean that in the 4.1 case, the IOMMU is doing nothing
?
> No, it just can''t report faults (they would need to be polled
for).
> Also, saying "in the 4.1 case" is wrong here - this really
depends
> on whether you have an affected Dom0 kernel. Things work fine
> if the Dom0 kernel doesn''t trample over Xen''s setup.
Except for the IO PAGE FAULT in xl dmesg, which is reported before the kernel
even gets loaded with xen-4.2.
I don''t see that one on 4.1 either, so that wouldn''t add up to
it being a dom0 kernel problem only ...

> Jan

Jan Beulich

2012-Sep-07 11:17 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

>>> On 07.09.12 at 12:15, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> No, it just can''t report faults (they would need to be polled
for).
>> Also, saying "in the 4.1 case" is wrong here - this really
depends
>> on whether you have an affected Dom0 kernel. Things work fine
>> if the Dom0 kernel doesn''t trample over Xen''s setup.
> 
> Except for the IO PAGE FAULT in xl dmesg, which is reported before the 
> kernel even gets loaded with xen-4.2.
> I don''t see that one on 4.1 either, so that wouldn''t add
up to it being a
> dom0 kernel problem only ...
Oh, yes, that certainly is a hypervisor only bug (if a bug
under our control at all - as said before, a babbling device
- eg left active by BIOS - is likely beyond our control).

Jan

Jan Beulich

2012-Sep-07 11:29 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

>>> On 07.09.12 at 12:01, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> 
>> Eh... That is interesting. So which dom0 are you using?  There is a c/s
>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
>> 25492:61844569a432) Otherwise, iommu cannot send any events including
IO
>> PAGE faults. You could try to revert dom0 to an old version like 2.6 
>> pv_ops to see if you really have no io page faults on 4.1
> 
> Ok i will give that a try, only dom0 will have to be a 2.6 pv_ops i assume
?
You could also drop the patch below into a kernel that has
the problematic change. Will require

#define PCI_CLASS_SYSTEM_IOMMU		0x0806

to be added at a suitable spot in include/linux/pci_ids.h.

Jan

--- head.orig/drivers/pci/msi.c
+++ head/drivers/pci/msi.c
@@ -20,6 +20,7 @@
 #include <linux/errno.h>
 #include <linux/io.h>
 #include <linux/slab.h>
+#include <xen/xen.h>
 
 #include "pci.h"
 #include "msi.h"
@@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
 	/* Disable the msi hardware to avoid screaming interrupts
 	 * during boot.  This is the power on reset default so
 	 * usually this should be a noop.
+	 * But on a Xen host don''t do this for IOMMUs which the hypervisor
+	 * is in control of (and hence has already enabled on purpose).
 	 */
+	if (xen_initial_domain()
+	    && (dev->class >> 8) == PCI_CLASS_SYSTEM_IOMMU
+	    && dev->vendor == PCI_VENDOR_ID_AMD)
+		return;
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (pos)
 		msi_set_enable(dev, pos, 0);

Konrad Rzeszutek Wilk

2012-Sep-07 20:51 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

On Fri, Sep 07, 2012 at 12:01:33PM +0200, Sander Eikelenboom
wrote:> 
> Friday, September 7, 2012, 10:54:40 AM, you wrote:
> 
> > On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
> >>
> >> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
> >>
> >>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
> >>>>
> >>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
> >>>>
> >>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
> >>>>>>
> >>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you
wrote:
> >>>>>>
> >>>>>>> Hi Jan,
> >>>>>>> Attached patch dumps io page fault flags. The
flags show the reason of
> >>>>>>> the fault and tell us if this is an unmapped
interrupt fault or a DMA fault.
> >>>>>>
> >>>>>>> Thanks,
> >>>>>>> Wei
> >>>>>>
> >>>>>>> signed-off-by: Wei
Wang<wei.wang2@amd.com>
> >>>>>>
> >>>>>>
> >>>>>> I have applied the patch and the flags seem to
differ between the faults:
> >>>>>>
> >>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
> >>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
> >>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
> >>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
> >>>>
> >>>>> OK, so they are not interrupt requests. I guess
further information from
> >>>>> your system would be helpful to debug this issue:
> >>>>> 1) xl info
> >>>>> 2) xl list
> >>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
> >>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
> >>>>
> >>>> dom14 is not a HVM guest,it''s a PV guest.
> >>
> >>> Ah, I see. PV guest is quite different than hvm, it does use
p2m tables
> >>> as io page tables. So no-sharept option does not work in this
case. PV
> >>> guests always use separated io page tables. There might be
some
> >>> incorrect mappings on the page tables. I will check this on my
side.
> >>
> >> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and
kept everything else the same.
> >> I haven''t seen any IO PAGE FAULTS after that.
> >>
> >> I did spot some differences in the output from lspci between xen
4.1 and 4.2, related to MSI enabled or not for the IOMMU device.
> >> Have attached the xl/xm dmesg and lspci from booting with both
versions.
> >>
> >> lspci:
> >>
> >> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc
RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
> >>          Subsystem: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
> >>          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> >>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
> >>          Latency: 0
> >>          Interrupt: pin A routed to IRQ 10
> >>          Capabilities: [40] Secure device<?>
> >> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> 
> > Eh... That is interesting. So which dom0 are you using?  There is a
c/s
> > in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
> > 25492:61844569a432) Otherwise, iommu cannot send any events including
IO
> > PAGE faults. You could try to revert dom0 to an old version like 2.6 
> > pv_ops to see if you really have no io page faults on 4.1
> 
> Ok i will give that a try, only dom0 will have to be a 2.6 pv_ops i assume
?
> 
So the failure they are describing is due to:
http://lists.xen.org/archives/html/xen-devel/2012-06/msg00668.html

Or you can use the patch that Jan posted
http://lists.xen.org/archives/html/xen-devel/2012-06/msg01196.html

and use the existing kernel.. But more interesting - is this device
(00:00.2) in the Xen-pciback.hide arguments (if not, then don''t worry)?

Sander Eikelenboom

2012-Sep-24 08:38 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Friday, September 7, 2012, 10:54:40 AM, you wrote:
> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>
>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>
>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>
>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>>
>>>>>>> Hi Jan,
>>>>>>> Attached patch dumps io page fault flags. The flags
show the reason of
>>>>>>> the fault and tell us if this is an unmapped
interrupt fault or a DMA fault.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>> signed-off-by: Wei Wang<wei.wang2@amd.com>
>>>>>>
>>>>>>
>>>>>> I have applied the patch and the flags seem to differ
between the faults:
>>>>>>
>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06,
fault address = 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>>
>>>>> OK, so they are not interrupt requests. I guess further
information from
>>>>> your system would be helpful to debug this issue:
>>>>> 1) xl info
>>>>> 2) xl list
>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>>
>>>> dom14 is not a HVM guest,it''s a PV guest.
>>
>>> Ah, I see. PV guest is quite different than hvm, it does use p2m
tables
>>> as io page tables. So no-sharept option does not work in this case.
PV
>>> guests always use separated io page tables. There might be some
>>> incorrect mappings on the page tables. I will check this on my
side.
>>
>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept
everything else the same.
>> I haven''t seen any IO PAGE FAULTS after that.
>>
>> I did spot some differences in the output from lspci between xen 4.1
and 4.2, related to MSI enabled or not for the IOMMU device.
>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>
>> lspci:
>>
>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990
I/O Memory Management Unit (IOMMU) [1002:5a23]
>>          Subsystem: ATI Technologies Inc RD990 I/O Memory Management
Unit (IOMMU) [1002:5a23]
>>          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
>>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>          Latency: 0
>>          Interrupt: pin A routed to IRQ 10
>>          Capabilities: [40] Secure device<?>
>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
> Eh... That is interesting. So which dom0 are you using?  There is a c/s 
> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
> 25492:61844569a432) Otherwise, iommu cannot send any events including IO 
> PAGE faults. You could try to revert dom0 to an old version like 2.6 
> pv_ops to see if you really have no io page faults on 4.1
Ok i finally got the time to do some more testing, tested 4.2 around that
changeset, and made a copy of the guest using HVM instead of PV.

The results:
- On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device passed
through works fine, both in a HVM as a PV guest, i don''t see IO page
faults getting reported.
- On xen-4.2 changeset <  25492 and a 3.6-rc6 kernel (dom0 and domU):  the
video device passed through works fine, both in a HVM as a PV guest, i
don''t see IO page faults getting reported.
- On xen-4.2 changeset >  25492 and a 3.6-rc6 kernel (dom0 and domU): the
video device passed through works fine for a short while (around 5 to 10
minutes) in a PV guest, after that IO page faults get reported and the video
freezes, i don''t see any errors in the guest though.
- On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
                                                      PV:  the video device
passed through works fine for a short while (around 5 to 10 minutes), after that
IO page faults get reported and the video freezes, i don''t see any
errors in the guest though.
                                                      HVM: the video device
passed through doesn''t work from the start:
                                                                     - The
device is there according to lspci
                                                                     - The video
application start fine, but delivers a green image, so the device is not working
properly. I don''t see IO page faults though.

Attached are (all with xen-unstable tip and the guest as HVM (domain 15):
- xl dmesg
- Patch which adds some more info, but all values reported seem to be zero (see
xl dmesg)
- lspci dom0
- lspci HVM guest



>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>                  Address: 00000000fee0100c  Data: 4128
>>          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>>
>> Although it seems enabled, shouldn''t the IRQ number used be
much higher than 10 for MSI interrupts ?
> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>> There is another difference in the bridge device that''s in
front of the  0a:00.6 device that faults before the kernel is even booted.
>>
>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge
(PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>          Latency: 0, Cache Line Size: 64 bytes
>>          Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>>          I/O behind bridge: 0000f000-00000fff
>>          Memory behind bridge: f9f00000-f9ffffff
>>          Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>          BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset-
FastB2B-
>>                  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>          Capabilities: [50] Power Management version 3
>>                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>>          Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>>                  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency
L0s<64ns, L1<1us
>>                          ExtTag+ RBE+ FLReset-
>>                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
>>                          RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>                          MaxPayload 128 bytes, MaxReadReq 128 bytes
>>                  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
>>                  LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1,
Latency L0<1us, L1<8us
>>                          ClockPM- Surprise- LLActRep+ BwNot+
>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk-
>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>                  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
HotPlug- Surprise-
>>                          Slot #3, PowerLimit 10.000W; Interlock-
NoCompl+
>>                  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet-
CmdCplt- HPIrq- LinkChg-
>>                          Control: AttnInd Unknown, PwrInd Unknown,
Power- Interlock-
>>                  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
>>                          Changed: MRL- PresDet+ LinkState+
> The probably because of the IO_PAGE_FAULT.
> Thanks,
> Wei
>> serveerstertje:~# lspci -t
>> -[0000:00]-+-00.0
>>             +-00.2
>>             +-02.0-[0b]----00.0
>>             +-03.0-[0a]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-05.0-[09]----00.0
>>             +-06.0-[08]----00.0
>>             +-0a.0-[07]----00.0
>>             +-0b.0-[06]--+-00.0
>>             |            \-00.1
>>             +-0c.0-[05]----00.0
>>             +-0d.0-[04]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-11.0
>>             +-12.0
>>             +-12.2
>>             +-13.0
>>             +-13.2
>>             +-14.0
>>             +-14.3
>>             +-14.4-[03]----06.0
>>             +-14.5
>>             +-15.0-[02]--
>>             +-16.0
>>             +-16.2
>>             +-18.0
>>             +-18.1
>>             +-18.2
>>             +-18.3
>>             \-18.4
>>
>>
>>
>>
>>
>>> Thanks,
>>> Wei
>>
>>>> I will try to make a complete package, and try with one pv
domain only where the devices are being passed through just to simplify the
setup.
>>>>
>>>>
>>>>> * I would also like to know the symptoms of device 0x0700
when IO_PF
>>>>> happened. Did it stop working?
>>>>
>>>> Yes it stops working, the video capture just freezes, but the
driver doesn''t bail out.
>>>> For the USB controller (0x0a06) it starts to give errors for
usbdev_open in the guest.
>>>>
>>>>> (BTW: I copied a few options from your boot cmd line and it
worked with
>>>>> my RD890 system
>>>>
>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>
>>>>> * so, what OEM board you have?)
>>>>
>>>> MSI 890FXA-GD70
>>>>
>>>>> Also from your log, these lines looks very strange:
>>>>
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted
write to
>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
0, device id
>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain =
14, device
>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>
>>>>> * they are just followed by the IO PAGE fault. Do you know
where are
>>>>> they from? Your video card driver maybe?
>>>>
>>>>    From a HVM domain with a old (3.0.3) kernel, but the faults
also occur without this domain being started.
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>
>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>
>>>>>> Thx
>>>>>>
>>>>>> --
>>>>>> Sander
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Wang

2012-Sep-24 12:24 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:>
> Friday, September 7, 2012, 10:54:40 AM, you wrote:
>
>> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>>
>>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>>
>>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>>
>>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>>
>>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>>
>>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you
wrote:
>>>>>>>
>>>>>>>> Hi Jan,
>>>>>>>> Attached patch dumps io page fault flags. The
flags show the reason of
>>>>>>>> the fault and tell us if this is an unmapped
interrupt fault or a DMA fault.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Wei
>>>>>>>
>>>>>>>> signed-off-by: Wei
Wang<wei.wang2@amd.com>
>>>>>>>
>>>>>>>
>>>>>>> I have applied the patch and the flags seem to
differ between the faults:
>>>>>>>
>>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>>>
>>>>>> OK, so they are not interrupt requests. I guess further
information from
>>>>>> your system would be helpful to debug this issue:
>>>>>> 1) xl info
>>>>>> 2) xl list
>>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>>>
>>>>> dom14 is not a HVM guest,it''s a PV guest.
>>>
>>>> Ah, I see. PV guest is quite different than hvm, it does use
p2m tables
>>>> as io page tables. So no-sharept option does not work in this
case. PV
>>>> guests always use separated io page tables. There might be some
>>>> incorrect mappings on the page tables. I will check this on my
side.
>>>
>>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and
kept everything else the same.
>>> I haven''t seen any IO PAGE FAULTS after that.
>>>
>>> I did spot some differences in the output from lspci between xen
4.1 and 4.2, related to MSI enabled or not for the IOMMU device.
>>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>>
>>> lspci:
>>>
>>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc
RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
>>>           Subsystem: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
>>>           Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>           Latency: 0
>>>           Interrupt: pin A routed to IRQ 10
>>>           Capabilities: [40] Secure device<?>
>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
>
>> Eh... That is interesting. So which dom0 are you using?  There is a c/s
>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
>> 25492:61844569a432) Otherwise, iommu cannot send any events including
IO
>> PAGE faults. You could try to revert dom0 to an old version like 2.6
>> pv_ops to see if you really have no io page faults on 4.1
>
> Ok i finally got the time to do some more testing, tested 4.2 around that
changeset, and made a copy of the guest using HVM instead of PV.
>
> The results:
> - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device
passed through works fine, both in a HVM as a PV guest, i don''t see IO
page faults getting reported.
> - On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and domU): 
the video device passed through works fine, both in a HVM as a PV guest, i
don''t see IO page faults getting reported.
> - On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and domU):
the video device passed through works fine for a short while (around 5 to 10
minutes) in a PV guest, after that IO page faults get reported and the video
freezes, i don''t see any errors in the guest though.
> - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
>                                                        PV:  the video
device passed through works fine for a short while (around 5 to 10 minutes),
after that IO page faults get reported and the video freezes, i don''t
see any errors in the guest though.
>                                                        HVM: the video
device passed through doesn''t work from the start:
>                                                                       - The
device is there according to lspci
>                                                                       - The
video application start fine, but delivers a green image, so the device is not
working properly. I don''t see IO page faults though.
>
> Attached are (all with xen-unstable tip and the guest as HVM (domain 15):
> - xl dmesg
> - Patch which adds some more info, but all values reported seem to be zero
(see xl dmesg)
> - lspci dom0
> - lspci HVM guest
HI,
Thanks for the information, very very helpful for debugging. I hope I 
could start to look at this right after sending my next iommu patch 
queue upstream...another question is: Did you see this issue on a single 
pv/hvm guest system or you only saw it on a system with about 16 running 
VMs?

Thanks,
Wei
>
>
>
>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>                   Address: 00000000fee0100c  Data: 4128
>>>           Capabilities: [64] HyperTransport: MSI Mapping Enable+
Fixed+
>>>
>>> Although it seems enabled, shouldn''t the IRQ number used
be much higher than 10 for MSI interrupts ?
>
>> The IRQ number is fine. MSI vector is stored at  Data: 4128
>
>>>
>>> There is another difference in the bridge device that''s in
front of the  0a:00.6 device that faults before the kernel is even booted.
>>>
>>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI
bridge (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>>           Latency: 0, Cache Line Size: 64 bytes
>>>           Bus: primary=00, secondary=0a, subordinate=0a,
sec-latency=0
>>>           I/O behind bridge: 0000f000-00000fff
>>>           Memory behind bridge: f9f00000-f9ffffff
>>>           Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>>           BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset-
FastB2B-
>>>                   PriDiscTmr- SecDiscTmr- DiscTmrStat-
DiscTmrSERREn-
>>>           Capabilities: [50] Power Management version 3
>>>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>                   Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0
PME-
>>>           Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>>>                   DevCap: MaxPayload 128 bytes, PhantFunc 0,
Latency L0s<64ns, L1<1us
>>>                           ExtTag+ RBE+ FLReset-
>>>                   DevCtl: Report errors: Correctable- Non-Fatal-
Fatal- Unsupported-
>>>                           RlxdOrd+ ExtTag- PhantFunc- AuxPwr-
NoSnoop+
>>>                           MaxPayload 128 bytes, MaxReadReq 128
bytes
>>>                   DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
>>>                   LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s
L1, Latency L0<1us, L1<8us
>>>                           ClockPM- Surprise- LLActRep+ BwNot+
>>>                   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
Retrain- CommClk-
>>>                           ExtSynch- ClockPM- AutWidDis- BWInt-
AutBWInt-
>>>                   LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>>                   SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
HotPlug- Surprise-
>>>                           Slot #3, PowerLimit 10.000W; Interlock-
NoCompl+
>>>                   SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet-
CmdCplt- HPIrq- LinkChg-
>>>                           Control: AttnInd Unknown, PwrInd Unknown,
Power- Interlock-
>>>                   SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
>>>                           Changed: MRL- PresDet+ LinkState+
>
>> The probably because of the IO_PAGE_FAULT.
>
>> Thanks,
>> Wei
>
>>> serveerstertje:~# lspci -t
>>> -[0000:00]-+-00.0
>>>              +-00.2
>>>              +-02.0-[0b]----00.0
>>>              +-03.0-[0a]--+-00.0
>>>              |            +-00.1
>>>              |            +-00.2
>>>              |            +-00.3
>>>              |            +-00.4
>>>              |            +-00.5
>>>              |            +-00.6
>>>              |            \-00.7
>>>              +-05.0-[09]----00.0
>>>              +-06.0-[08]----00.0
>>>              +-0a.0-[07]----00.0
>>>              +-0b.0-[06]--+-00.0
>>>              |            \-00.1
>>>              +-0c.0-[05]----00.0
>>>              +-0d.0-[04]--+-00.0
>>>              |            +-00.1
>>>              |            +-00.2
>>>              |            +-00.3
>>>              |            +-00.4
>>>              |            +-00.5
>>>              |            +-00.6
>>>              |            \-00.7
>>>              +-11.0
>>>              +-12.0
>>>              +-12.2
>>>              +-13.0
>>>              +-13.2
>>>              +-14.0
>>>              +-14.3
>>>              +-14.4-[03]----06.0
>>>              +-14.5
>>>              +-15.0-[02]--
>>>              +-16.0
>>>              +-16.2
>>>              +-18.0
>>>              +-18.1
>>>              +-18.2
>>>              +-18.3
>>>              \-18.4
>>>
>>>
>>>
>>>
>>>
>>>> Thanks,
>>>> Wei
>>>
>>>>> I will try to make a complete package, and try with one pv
domain only where the devices are being passed through just to simplify the
setup.
>>>>>
>>>>>
>>>>>> * I would also like to know the symptoms of device
0x0700 when IO_PF
>>>>>> happened. Did it stop working?
>>>>>
>>>>> Yes it stops working, the video capture just freezes, but
the driver doesn''t bail out.
>>>>> For the USB controller (0x0a06) it starts to give errors
for usbdev_open in the guest.
>>>>>
>>>>>> (BTW: I copied a few options from your boot cmd line
and it worked with
>>>>>> my RD890 system
>>>>>
>>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all
console_timestamps
>>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>>
>>>>>> * so, what OEM board you have?)
>>>>>
>>>>> MSI 890FXA-GD70
>>>>>
>>>>>> Also from your log, these lines looks very strange:
>>>>>
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id
>>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>>
>>>>>> * they are just followed by the IO PAGE fault. Do you
know where are
>>>>>> they from? Your video card driver maybe?
>>>>>
>>>>>      From a HVM domain with a old (3.0.3) kernel, but the
faults also occur without this domain being started.
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Wei
>>>>>
>>>>>
>>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>>
>>>>>>> Thx
>>>>>>>
>>>>>>> --
>>>>>>> Sander
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Sander Eikelenboom

2012-Sep-24 12:27 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Monday, September 24, 2012, 2:24:16 PM, you wrote:
> On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:
>>
>> Friday, September 7, 2012, 10:54:40 AM, you wrote:
>>
>>> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>>>
>>>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>>>
>>>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>>>
>>>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you
wrote:
>>>>>>>>
>>>>>>>>> Hi Jan,
>>>>>>>>> Attached patch dumps io page fault flags.
The flags show the reason of
>>>>>>>>> the fault and tell us if this is an
unmapped interrupt fault or a DMA fault.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wei
>>>>>>>>
>>>>>>>>> signed-off-by: Wei
Wang<wei.wang2@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have applied the patch and the flags seem to
differ between the faults:
>>>>>>>>
>>>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags
= 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d339e0,
flags = 0x020
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d33a40,
flags = 0x020
>>>>>>
>>>>>>> OK, so they are not interrupt requests. I guess
further information from
>>>>>>> your system would be helpful to debug this issue:
>>>>>>> 1) xl info
>>>>>>> 2) xl list
>>>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>>>> 4) cat /proc/iomem (in both dom0 and your hvm
guest)
>>>>>>
>>>>>> dom14 is not a HVM guest,it''s a PV guest.
>>>>
>>>>> Ah, I see. PV guest is quite different than hvm, it does
use p2m tables
>>>>> as io page tables. So no-sharept option does not work in
this case. PV
>>>>> guests always use separated io page tables. There might be
some
>>>>> incorrect mappings on the page tables. I will check this on
my side.
>>>>
>>>> I have reverted the machine to xen-4.1.4-pre (changeset 23353)
and kept everything else the same.
>>>> I haven''t seen any IO PAGE FAULTS after that.
>>>>
>>>> I did spot some differences in the output from lspci between
xen 4.1 and 4.2, related to MSI enabled or not for the IOMMU device.
>>>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>>>
>>>> lspci:
>>>>
>>>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc
RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
>>>>           Subsystem: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
>>>>           Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0
>>>>           Interrupt: pin A routed to IRQ 10
>>>>           Capabilities: [40] Secure device<?>
>>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable-
64bit+
>>
>>> Eh... That is interesting. So which dom0 are you using?  There is a
c/s
>>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
>>> 25492:61844569a432) Otherwise, iommu cannot send any events
including IO
>>> PAGE faults. You could try to revert dom0 to an old version like
2.6
>>> pv_ops to see if you really have no io page faults on 4.1
>>
>> Ok i finally got the time to do some more testing, tested 4.2 around
that changeset, and made a copy of the guest using HVM instead of PV.
>>
>> The results:
>> - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device
passed through works fine, both in a HVM as a PV guest, i don''t see IO
page faults getting reported.
>> - On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and
domU):  the video device passed through works fine, both in a HVM as a PV guest,
i don''t see IO page faults getting reported.
>> - On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and
domU): the video device passed through works fine for a short while (around 5 to
10 minutes) in a PV guest, after that IO page faults get reported and the video
freezes, i don''t see any errors in the guest though.
>> - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
>>                                                        PV:  the video
device passed through works fine for a short while (around 5 to 10 minutes),
after that IO page faults get reported and the video freezes, i don''t
see any errors in the guest though.
>>                                                        HVM: the video
device passed through doesn''t work from the start:
>>                                                                       -
The device is there according to lspci
>>                                                                       -
The video application start fine, but delivers a green image, so the device is
not working properly. I don''t see IO page faults though.
>>
>> Attached are (all with xen-unstable tip and the guest as HVM (domain
15):
>> - xl dmesg
>> - Patch which adds some more info, but all values reported seem to be
zero (see xl dmesg)
>> - lspci dom0
>> - lspci HVM guest
> HI,
> Thanks for the information, very very helpful for debugging. I hope I 
> could start to look at this right after sending my next iommu patch 
> queue upstream...another question is: Did you see this issue on a single 
> pv/hvm guest system or you only saw it on a system with about 16 running 
> VMs?
> Thanks,
> Wei
If you need more info, i''m more than happy to run additional debug
patches.
Haven''t tested it with a single guest, will try right a way

Thanks !

--
Sander

>>
>>
>>
>>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable-
64bit+
>>>>                   Address: 00000000fee0100c  Data: 4128
>>>>           Capabilities: [64] HyperTransport: MSI Mapping
Enable+ Fixed+
>>>>
>>>> Although it seems enabled, shouldn''t the IRQ number
used be much higher than 10 for MSI interrupts ?
>>
>>> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>>>>
>>>> There is another difference in the bridge device
that''s in front of the  0a:00.6 device that faults before the kernel is
even booted.
>>>>
>>>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to
PCI bridge (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0, Cache Line Size: 64 bytes
>>>>           Bus: primary=00, secondary=0a, subordinate=0a,
sec-latency=0
>>>>           I/O behind bridge: 0000f000-00000fff
>>>>           Memory behind bridge: f9f00000-f9ffffff
>>>>           Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>>>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>>>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>>>           BridgeCtl: Parity+ SERR+ NoISA+ VGA-
MAbort->Reset- FastB2B-
>>>>                   PriDiscTmr- SecDiscTmr- DiscTmrStat-
DiscTmrSERREn-
>>>>           Capabilities: [50] Power Management version 3
>>>>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>                   Status: D0 NoSoftRst- PME-Enable- DSel=0
DScale=0 PME-
>>>>           Capabilities: [58] Express (v2) Root Port (Slot+),
MSI 00
>>>>                   DevCap: MaxPayload 128 bytes, PhantFunc 0,
Latency L0s<64ns, L1<1us
>>>>                           ExtTag+ RBE+ FLReset-
>>>>                   DevCtl: Report errors: Correctable-
Non-Fatal- Fatal- Unsupported-
>>>>                           RlxdOrd+ ExtTag- PhantFunc- AuxPwr-
NoSnoop+
>>>>                           MaxPayload 128 bytes, MaxReadReq 128
bytes
>>>>                   DevSta: CorrErr- UncorrErr- FatalErr-
UnsuppReq- AuxPwr- TransPend-
>>>>                   LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM
L0s L1, Latency L0<1us, L1<8us
>>>>                           ClockPM- Surprise- LLActRep+ BwNot+
>>>>                   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
Retrain- CommClk-
>>>>                           ExtSynch- ClockPM- AutWidDis- BWInt-
AutBWInt-
>>>>                   LnkSta: Speed 2.5GT/s, Width x1, TrErr-
Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>>>                   SltCap: AttnBtn- PwrCtrl- MRL- AttnInd-
PwrInd- HotPlug- Surprise-
>>>>                           Slot #3, PowerLimit 10.000W;
Interlock- NoCompl+
>>>>                   SltCtl: Enable: AttnBtn- PwrFlt- MRL-
PresDet- CmdCplt- HPIrq- LinkChg-
>>>>                           Control: AttnInd Unknown, PwrInd
Unknown, Power- Interlock-
>>>>                   SltSta: Status: AttnBtn- PowerFlt- MRL-
CmdCplt- PresDet+ Interlock-
>>>>                           Changed: MRL- PresDet+ LinkState+
>>
>>> The probably because of the IO_PAGE_FAULT.
>>
>>> Thanks,
>>> Wei
>>
>>>> serveerstertje:~# lspci -t
>>>> -[0000:00]-+-00.0
>>>>              +-00.2
>>>>              +-02.0-[0b]----00.0
>>>>              +-03.0-[0a]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-05.0-[09]----00.0
>>>>              +-06.0-[08]----00.0
>>>>              +-0a.0-[07]----00.0
>>>>              +-0b.0-[06]--+-00.0
>>>>              |            \-00.1
>>>>              +-0c.0-[05]----00.0
>>>>              +-0d.0-[04]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-11.0
>>>>              +-12.0
>>>>              +-12.2
>>>>              +-13.0
>>>>              +-13.2
>>>>              +-14.0
>>>>              +-14.3
>>>>              +-14.4-[03]----06.0
>>>>              +-14.5
>>>>              +-15.0-[02]--
>>>>              +-16.0
>>>>              +-16.2
>>>>              +-18.0
>>>>              +-18.1
>>>>              +-18.2
>>>>              +-18.3
>>>>              \-18.4
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>>> I will try to make a complete package, and try with one
pv domain only where the devices are being passed through just to simplify the
setup.
>>>>>>
>>>>>>
>>>>>>> * I would also like to know the symptoms of device
0x0700 when IO_PF
>>>>>>> happened. Did it stop working?
>>>>>>
>>>>>> Yes it stops working, the video capture just freezes,
but the driver doesn''t bail out.
>>>>>> For the USB controller (0x0a06) it starts to give
errors for usbdev_open in the guest.
>>>>>>
>>>>>>> (BTW: I copied a few options from your boot cmd
line and it worked with
>>>>>>> my RD890 system
>>>>>>
>>>>>>> dom0_mem=1024M,max:1024M loglvl=all
loglvl_guest=all console_timestamps
>>>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>>>
>>>>>>> * so, what OEM board you have?)
>>>>>>
>>>>>> MSI 890FXA-GD70
>>>>>>
>>>>>>> Also from your log, these lines looks very strange:
>>>>>>
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id
>>>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>>>
>>>>>>> * they are just followed by the IO PAGE fault. Do
you know where are
>>>>>>> they from? Your video card driver maybe?
>>>>>>
>>>>>>      From a HVM domain with a old (3.0.3) kernel, but
the faults also occur without this domain being started.
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>
>>>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>>>
>>>>>>>> Thx
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sander
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

Sander Eikelenboom

2012-Sep-24 21:08 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Monday, September 24, 2012, 2:24:16 PM, you wrote:
> On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:
>>
>> Friday, September 7, 2012, 10:54:40 AM, you wrote:
>>
>>> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>>>
>>>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>>>
>>>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>>>
>>>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you
wrote:
>>>>>>>>
>>>>>>>>> Hi Jan,
>>>>>>>>> Attached patch dumps io page fault flags.
The flags show the reason of
>>>>>>>>> the fault and tell us if this is an
unmapped interrupt fault or a DMA fault.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wei
>>>>>>>>
>>>>>>>>> signed-off-by: Wei
Wang<wei.wang2@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have applied the patch and the flags seem to
differ between the faults:
>>>>>>>>
>>>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags
= 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d339e0,
flags = 0x020
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d33a40,
flags = 0x020
>>>>>>
>>>>>>> OK, so they are not interrupt requests. I guess
further information from
>>>>>>> your system would be helpful to debug this issue:
>>>>>>> 1) xl info
>>>>>>> 2) xl list
>>>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>>>> 4) cat /proc/iomem (in both dom0 and your hvm
guest)
>>>>>>
>>>>>> dom14 is not a HVM guest,it''s a PV guest.
>>>>
>>>>> Ah, I see. PV guest is quite different than hvm, it does
use p2m tables
>>>>> as io page tables. So no-sharept option does not work in
this case. PV
>>>>> guests always use separated io page tables. There might be
some
>>>>> incorrect mappings on the page tables. I will check this on
my side.
>>>>
>>>> I have reverted the machine to xen-4.1.4-pre (changeset 23353)
and kept everything else the same.
>>>> I haven''t seen any IO PAGE FAULTS after that.
>>>>
>>>> I did spot some differences in the output from lspci between
xen 4.1 and 4.2, related to MSI enabled or not for the IOMMU device.
>>>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>>>
>>>> lspci:
>>>>
>>>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc
RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
>>>>           Subsystem: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
>>>>           Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0
>>>>           Interrupt: pin A routed to IRQ 10
>>>>           Capabilities: [40] Secure device<?>
>>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable-
64bit+
>>
>>> Eh... That is interesting. So which dom0 are you using?  There is a
c/s
>>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
>>> 25492:61844569a432) Otherwise, iommu cannot send any events
including IO
>>> PAGE faults. You could try to revert dom0 to an old version like
2.6
>>> pv_ops to see if you really have no io page faults on 4.1
>>
>> Ok i finally got the time to do some more testing, tested 4.2 around
that changeset, and made a copy of the guest using HVM instead of PV.
>>
>> The results:
>> - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device
passed through works fine, both in a HVM as a PV guest, i don''t see IO
page faults getting reported.
>> - On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and
domU):  the video device passed through works fine, both in a HVM as a PV guest,
i don''t see IO page faults getting reported.
>> - On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and
domU): the video device passed through works fine for a short while (around 5 to
10 minutes) in a PV guest, after that IO page faults get reported and the video
freezes, i don''t see any errors in the guest though.
>> - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
>>                                                        PV:  the video
device passed through works fine for a short while (around 5 to 10 minutes),
after that IO page faults get reported and the video freezes, i don''t
see any errors in the guest though.
>>                                                        HVM: the video
device passed through doesn''t work from the start:
>>                                                                       -
The device is there according to lspci
>>                                                                       -
The video application start fine, but delivers a green image, so the device is
not working properly. I don''t see IO page faults though.
>>
>> Attached are (all with xen-unstable tip and the guest as HVM (domain
15):
>> - xl dmesg
>> - Patch which adds some more info, but all values reported seem to be
zero (see xl dmesg)
>> - lspci dom0
>> - lspci HVM guest
> HI,
> Thanks for the information, very very helpful for debugging. I hope I 
> could start to look at this right after sending my next iommu patch 
> queue upstream...another question is: Did you see this issue on a single 
> pv/hvm guest system or you only saw it on a system with about 16 running 
> VMs?
The issue of the hvm not giving a video image also happens when it''s
the first and only guest running after a cold boot.
> Thanks,
> Wei
>>
>>
>>
>>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable-
64bit+
>>>>                   Address: 00000000fee0100c  Data: 4128
>>>>           Capabilities: [64] HyperTransport: MSI Mapping
Enable+ Fixed+
>>>>
>>>> Although it seems enabled, shouldn''t the IRQ number
used be much higher than 10 for MSI interrupts ?
>>
>>> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>>>>
>>>> There is another difference in the bridge device
that''s in front of the  0a:00.6 device that faults before the kernel is
even booted.
>>>>
>>>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to
PCI bridge (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0, Cache Line Size: 64 bytes
>>>>           Bus: primary=00, secondary=0a, subordinate=0a,
sec-latency=0
>>>>           I/O behind bridge: 0000f000-00000fff
>>>>           Memory behind bridge: f9f00000-f9ffffff
>>>>           Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>>>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>>>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>>>           BridgeCtl: Parity+ SERR+ NoISA+ VGA-
MAbort->Reset- FastB2B-
>>>>                   PriDiscTmr- SecDiscTmr- DiscTmrStat-
DiscTmrSERREn-
>>>>           Capabilities: [50] Power Management version 3
>>>>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>                   Status: D0 NoSoftRst- PME-Enable- DSel=0
DScale=0 PME-
>>>>           Capabilities: [58] Express (v2) Root Port (Slot+),
MSI 00
>>>>                   DevCap: MaxPayload 128 bytes, PhantFunc 0,
Latency L0s<64ns, L1<1us
>>>>                           ExtTag+ RBE+ FLReset-
>>>>                   DevCtl: Report errors: Correctable-
Non-Fatal- Fatal- Unsupported-
>>>>                           RlxdOrd+ ExtTag- PhantFunc- AuxPwr-
NoSnoop+
>>>>                           MaxPayload 128 bytes, MaxReadReq 128
bytes
>>>>                   DevSta: CorrErr- UncorrErr- FatalErr-
UnsuppReq- AuxPwr- TransPend-
>>>>                   LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM
L0s L1, Latency L0<1us, L1<8us
>>>>                           ClockPM- Surprise- LLActRep+ BwNot+
>>>>                   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
Retrain- CommClk-
>>>>                           ExtSynch- ClockPM- AutWidDis- BWInt-
AutBWInt-
>>>>                   LnkSta: Speed 2.5GT/s, Width x1, TrErr-
Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>>>                   SltCap: AttnBtn- PwrCtrl- MRL- AttnInd-
PwrInd- HotPlug- Surprise-
>>>>                           Slot #3, PowerLimit 10.000W;
Interlock- NoCompl+
>>>>                   SltCtl: Enable: AttnBtn- PwrFlt- MRL-
PresDet- CmdCplt- HPIrq- LinkChg-
>>>>                           Control: AttnInd Unknown, PwrInd
Unknown, Power- Interlock-
>>>>                   SltSta: Status: AttnBtn- PowerFlt- MRL-
CmdCplt- PresDet+ Interlock-
>>>>                           Changed: MRL- PresDet+ LinkState+
>>
>>> The probably because of the IO_PAGE_FAULT.
>>
>>> Thanks,
>>> Wei
>>
>>>> serveerstertje:~# lspci -t
>>>> -[0000:00]-+-00.0
>>>>              +-00.2
>>>>              +-02.0-[0b]----00.0
>>>>              +-03.0-[0a]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-05.0-[09]----00.0
>>>>              +-06.0-[08]----00.0
>>>>              +-0a.0-[07]----00.0
>>>>              +-0b.0-[06]--+-00.0
>>>>              |            \-00.1
>>>>              +-0c.0-[05]----00.0
>>>>              +-0d.0-[04]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-11.0
>>>>              +-12.0
>>>>              +-12.2
>>>>              +-13.0
>>>>              +-13.2
>>>>              +-14.0
>>>>              +-14.3
>>>>              +-14.4-[03]----06.0
>>>>              +-14.5
>>>>              +-15.0-[02]--
>>>>              +-16.0
>>>>              +-16.2
>>>>              +-18.0
>>>>              +-18.1
>>>>              +-18.2
>>>>              +-18.3
>>>>              \-18.4
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>>> I will try to make a complete package, and try with one
pv domain only where the devices are being passed through just to simplify the
setup.
>>>>>>
>>>>>>
>>>>>>> * I would also like to know the symptoms of device
0x0700 when IO_PF
>>>>>>> happened. Did it stop working?
>>>>>>
>>>>>> Yes it stops working, the video capture just freezes,
but the driver doesn''t bail out.
>>>>>> For the USB controller (0x0a06) it starts to give
errors for usbdev_open in the guest.
>>>>>>
>>>>>>> (BTW: I copied a few options from your boot cmd
line and it worked with
>>>>>>> my RD890 system
>>>>>>
>>>>>>> dom0_mem=1024M,max:1024M loglvl=all
loglvl_guest=all console_timestamps
>>>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>>>
>>>>>>> * so, what OEM board you have?)
>>>>>>
>>>>>> MSI 890FXA-GD70
>>>>>>
>>>>>>> Also from your log, these lines looks very strange:
>>>>>>
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id
>>>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>>>
>>>>>>> * they are just followed by the IO PAGE fault. Do
you know where are
>>>>>>> they from? Your video card driver maybe?
>>>>>>
>>>>>>      From a HVM domain with a old (3.0.3) kernel, but
the faults also occur without this domain being started.
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>
>>>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>>>
>>>>>>>> Thx
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sander
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

Sander Eikelenboom

2012-Oct-01 15:02 UTC

head link

Re: [PATCH] amd iommu: Dump flags of IO page faults

Monday, September 24, 2012, 2:24:16 PM, you wrote:
> On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:
>>
>> Friday, September 7, 2012, 10:54:40 AM, you wrote:
>>
>>> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>>>
>>>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>>>
>>>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>>>
>>>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you
wrote:
>>>>>>>>
>>>>>>>>> Hi Jan,
>>>>>>>>> Attached patch dumps io page fault flags.
The flags show the reason of
>>>>>>>>> the fault and tell us if this is an
unmapped interrupt fault or a DMA fault.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wei
>>>>>>>>
>>>>>>>>> signed-off-by: Wei
Wang<wei.wang2@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have applied the patch and the flags seem to
differ between the faults:
>>>>>>>>
>>>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id =
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 0xc2c2c2c0, flags
= 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d339e0,
flags = 0x020
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi:
IO_PAGE_FAULT: domain = 14, device id = 0x0700, fault address = 0xa8d33a40,
flags = 0x020
>>>>>>
>>>>>>> OK, so they are not interrupt requests. I guess
further information from
>>>>>>> your system would be helpful to debug this issue:
>>>>>>> 1) xl info
>>>>>>> 2) xl list
>>>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>>>> 4) cat /proc/iomem (in both dom0 and your hvm
guest)
>>>>>>
>>>>>> dom14 is not a HVM guest,it''s a PV guest.
>>>>
>>>>> Ah, I see. PV guest is quite different than hvm, it does
use p2m tables
>>>>> as io page tables. So no-sharept option does not work in
this case. PV
>>>>> guests always use separated io page tables. There might be
some
>>>>> incorrect mappings on the page tables. I will check this on
my side.
>>>>
>>>> I have reverted the machine to xen-4.1.4-pre (changeset 23353)
and kept everything else the same.
>>>> I haven''t seen any IO PAGE FAULTS after that.
>>>>
>>>> I did spot some differences in the output from lspci between
xen 4.1 and 4.2, related to MSI enabled or not for the IOMMU device.
>>>> Have attached the xl/xm dmesg and lspci from booting with both
versions.
>>>>
>>>> lspci:
>>>>
>>>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc
RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
>>>>           Subsystem: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU) [1002:5a23]
>>>>           Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0
>>>>           Interrupt: pin A routed to IRQ 10
>>>>           Capabilities: [40] Secure device<?>
>>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable-
64bit+
>>
>>> Eh... That is interesting. So which dom0 are you using?  There is a
c/s
>>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
>>> 25492:61844569a432) Otherwise, iommu cannot send any events
including IO
>>> PAGE faults. You could try to revert dom0 to an old version like
2.6
>>> pv_ops to see if you really have no io page faults on 4.1
>>
>> Ok i finally got the time to do some more testing, tested 4.2 around
that changeset, and made a copy of the guest using HVM instead of PV.
>>
>> The results:
>> - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device
passed through works fine, both in a HVM as a PV guest, i don''t see IO
page faults getting reported.
>> - On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and
domU):  the video device passed through works fine, both in a HVM as a PV guest,
i don''t see IO page faults getting reported.
>> - On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and
domU): the video device passed through works fine for a short while (around 5 to
10 minutes) in a PV guest, after that IO page faults get reported and the video
freezes, i don''t see any errors in the guest though.
>> - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
>>                                                        PV:  the video
device passed through works fine for a short while (around 5 to 10 minutes),
after that IO page faults get reported and the video freezes, i don''t
see any errors in the guest though.
>>                                                        HVM: the video
device passed through doesn''t work from the start:
>>                                                                       -
The device is there according to lspci
>>                                                                       -
The video application start fine, but delivers a green image, so the device is
not working properly. I don''t see IO page faults though.
>>
>> Attached are (all with xen-unstable tip and the guest as HVM (domain
15):
>> - xl dmesg
>> - Patch which adds some more info, but all values reported seem to be
zero (see xl dmesg)
>> - lspci dom0
>> - lspci HVM guest
> HI,
> Thanks for the information, very very helpful for debugging. I hope I 
> could start to look at this right after sending my next iommu patch 
> queue upstream...another question is: Did you see this issue on a single 
> pv/hvm guest system or you only saw it on a system with about 16 running 
> VMs?
I have an update on this one...
The green screen when using a HVM guest was due to the driver no being able to
communicate with the device via I2C.
This problem disappeared when updating to the latest xen-unstable and 3.6-rc7
kernel with additionally the linux-next branch from konrad''s tree
pulled in.
At the moment the HVM guest works: it shows video, it doesn''t give IO
PAGE FAULT''s.
Will try and see if it''s also miraculously fixed for PV as well.

> Thanks,
> Wei
>>
>>
>>
>>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable-
64bit+
>>>>                   Address: 00000000fee0100c  Data: 4128
>>>>           Capabilities: [64] HyperTransport: MSI Mapping
Enable+ Fixed+
>>>>
>>>> Although it seems enabled, shouldn''t the IRQ number
used be much higher than 10 for MSI interrupts ?
>>
>>> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>>>>
>>>> There is another difference in the bridge device
that''s in front of the  0a:00.6 device that faults before the kernel is
even booted.
>>>>
>>>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to
PCI bridge (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV-
VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0, Cache Line Size: 64 bytes
>>>>           Bus: primary=00, secondary=0a, subordinate=0a,
sec-latency=0
>>>>           I/O behind bridge: 0000f000-00000fff
>>>>           Memory behind bridge: f9f00000-f9ffffff
>>>>           Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff
>>>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>>>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr-
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>>>           BridgeCtl: Parity+ SERR+ NoISA+ VGA-
MAbort->Reset- FastB2B-
>>>>                   PriDiscTmr- SecDiscTmr- DiscTmrStat-
DiscTmrSERREn-
>>>>           Capabilities: [50] Power Management version 3
>>>>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>                   Status: D0 NoSoftRst- PME-Enable- DSel=0
DScale=0 PME-
>>>>           Capabilities: [58] Express (v2) Root Port (Slot+),
MSI 00
>>>>                   DevCap: MaxPayload 128 bytes, PhantFunc 0,
Latency L0s<64ns, L1<1us
>>>>                           ExtTag+ RBE+ FLReset-
>>>>                   DevCtl: Report errors: Correctable-
Non-Fatal- Fatal- Unsupported-
>>>>                           RlxdOrd+ ExtTag- PhantFunc- AuxPwr-
NoSnoop+
>>>>                           MaxPayload 128 bytes, MaxReadReq 128
bytes
>>>>                   DevSta: CorrErr- UncorrErr- FatalErr-
UnsuppReq- AuxPwr- TransPend-
>>>>                   LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM
L0s L1, Latency L0<1us, L1<8us
>>>>                           ClockPM- Surprise- LLActRep+ BwNot+
>>>>                   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
Retrain- CommClk-
>>>>                           ExtSynch- ClockPM- AutWidDis- BWInt-
AutBWInt-
>>>>                   LnkSta: Speed 2.5GT/s, Width x1, TrErr-
Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>>>                   SltCap: AttnBtn- PwrCtrl- MRL- AttnInd-
PwrInd- HotPlug- Surprise-
>>>>                           Slot #3, PowerLimit 10.000W;
Interlock- NoCompl+
>>>>                   SltCtl: Enable: AttnBtn- PwrFlt- MRL-
PresDet- CmdCplt- HPIrq- LinkChg-
>>>>                           Control: AttnInd Unknown, PwrInd
Unknown, Power- Interlock-
>>>>                   SltSta: Status: AttnBtn- PowerFlt- MRL-
CmdCplt- PresDet+ Interlock-
>>>>                           Changed: MRL- PresDet+ LinkState+
>>
>>> The probably because of the IO_PAGE_FAULT.
>>
>>> Thanks,
>>> Wei
>>
>>>> serveerstertje:~# lspci -t
>>>> -[0000:00]-+-00.0
>>>>              +-00.2
>>>>              +-02.0-[0b]----00.0
>>>>              +-03.0-[0a]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-05.0-[09]----00.0
>>>>              +-06.0-[08]----00.0
>>>>              +-0a.0-[07]----00.0
>>>>              +-0b.0-[06]--+-00.0
>>>>              |            \-00.1
>>>>              +-0c.0-[05]----00.0
>>>>              +-0d.0-[04]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-11.0
>>>>              +-12.0
>>>>              +-12.2
>>>>              +-13.0
>>>>              +-13.2
>>>>              +-14.0
>>>>              +-14.3
>>>>              +-14.4-[03]----06.0
>>>>              +-14.5
>>>>              +-15.0-[02]--
>>>>              +-16.0
>>>>              +-16.2
>>>>              +-18.0
>>>>              +-18.1
>>>>              +-18.2
>>>>              +-18.3
>>>>              \-18.4
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>>> I will try to make a complete package, and try with one
pv domain only where the devices are being passed through just to simplify the
setup.
>>>>>>
>>>>>>
>>>>>>> * I would also like to know the symptoms of device
0x0700 when IO_PF
>>>>>>> happened. Did it stop working?
>>>>>>
>>>>>> Yes it stops working, the video capture just freezes,
but the driver doesn''t bail out.
>>>>>> For the USB controller (0x0a06) it starts to give
errors for usbdev_open in the guest.
>>>>>>
>>>>>>> (BTW: I copied a few options from your boot cmd
line and it worked with
>>>>>>> my RD890 system
>>>>>>
>>>>>>> dom0_mem=1024M,max:1024M loglvl=all
loglvl_guest=all console_timestamps
>>>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug
apic_verbosity=debug
>>>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>>>
>>>>>>> * so, what OEM board you have?)
>>>>>>
>>>>>> MSI 890FXA-GD70
>>>>>>
>>>>>>> Also from your log, these lines looks very strange:
>>>>>>
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest
attempted write to
>>>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id
>>>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT:
domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>>>
>>>>>>> * they are just followed by the IO PAGE fault. Do
you know where are
>>>>>>> they from? Your video card driver maybe?
>>>>>>
>>>>>>      From a HVM domain with a old (3.0.3) kernel, but
the faults also occur without this domain being started.
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>
>>>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>>>
>>>>>>>> Thx
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sander
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

Apparently Analagous Threads

Search for more apparently analagous threads

Xen devel - Sep 2012 - [PATCH] amd iommu: Dump flags of IO page faults

[PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults (off topic - pci devices)

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Re: [PATCH] amd iommu: Dump flags of IO page faults

Apparently Analagous Threads