thr3ads.net - Xen devel - Using debug-key ''o: Dump IOMMU p2m table, locks up machine [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Sander Eikelenboom

2012-Aug-31 21:45 UTC

Using debug-key ''o: Dump IOMMU p2m table, locks up machine

I was trying to use the ''o'' debug key to make a bug report
about an "AMD-Vi: IO_PAGE_FAULT".

The result:
- When using "xl debug-keys o", the machine seems in a infinite loop,
can hardly login, eventually resulting in a kernel RCU stall and complete
lockup.
- When using serial console: I get a infinite stream of "gfn:  mfn: "
lines, mean while on the normal console, S-ATA devices are starting to give
errors.

So either option trashes the machine, other debug-keys work fine.

Machine has a 890-fx chipset and AMD phenom x6 proc.

xl dmesg with bootup and output from some other debug-keys is attached.

--

Sander

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Santosh Jodh

2012-Aug-31 22:24 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Depending on how many VMs you have and the size of the IOMMU p2m table, it can
take a while. It should not be infinite though.

How many VMs do you have running?

Can you please send the serial output when you press ''o''?

Santosh
> -----Original Message-----
> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> Sent: Friday, August 31, 2012 2:45 PM
> To: Santosh Jodh; wei.wang2@amd.com
> Cc: xen-devel@lists.xen.org
> Subject: Using debug-key ''o: Dump IOMMU p2m table, locks up
machine
> 
> 
> I was trying to use the ''o'' debug key to make a bug
report about an "AMD-Vi:
> IO_PAGE_FAULT".
> 
> The result:
> - When using "xl debug-keys o", the machine seems in a infinite
loop, can
> hardly login, eventually resulting in a kernel RCU stall and complete
lockup.
> - When using serial console: I get a infinite stream of "gfn:  mfn:
" lines, mean
> while on the normal console, S-ATA devices are starting to give errors.
> 
> So either option trashes the machine, other debug-keys work fine.
> 
> Machine has a 890-fx chipset and AMD phenom x6 proc.
> 
> xl dmesg with bootup and output from some other debug-keys is attached.
> 
> --
> 
> Sander

Sander Eikelenboom

2012-Aug-31 22:42 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Saturday, September 1, 2012, 12:24:32 AM, you wrote:
> Depending on how many VMs you have and the size of the IOMMU p2m table, it
can take a while. It should not be infinite though.
> How many VMs do you have running?
15
> Can you please send the serial output when you press ''o''?
Attached, to the end you will see the s-ata errors coming through while the dump
still runs.
This is not a complete dump, only a few minutes after which i did a hard reset.
> Santosh
>> -----Original Message-----
>> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
>> Sent: Friday, August 31, 2012 2:45 PM
>> To: Santosh Jodh; wei.wang2@amd.com
>> Cc: xen-devel@lists.xen.org
>> Subject: Using debug-key ''o: Dump IOMMU p2m table, locks up
machine
>> 
>> 
>> I was trying to use the ''o'' debug key to make a bug
report about an "AMD-Vi:
>> IO_PAGE_FAULT".
>> 
>> The result:
>> - When using "xl debug-keys o", the machine seems in a
infinite loop, can
>> hardly login, eventually resulting in a kernel RCU stall and complete
lockup.
>> - When using serial console: I get a infinite stream of "gfn: 
mfn: " lines, mean
>> while on the normal console, S-ATA devices are starting to give errors.
>> 
>> So either option trashes the machine, other debug-keys work fine.
>> 
>> Machine has a 890-fx chipset and AMD phenom x6 proc.
>> 
>> xl dmesg with bootup and output from some other debug-keys is attached.
>> 
>> --
>> 
>> Sander



-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Santosh Jodh

2012-Aug-31 22:57 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

The dump should complete - would be curious to see how long it takes on serial
console. What baudrate is the console running at?

The code does allow processing of pending softirqs quite frequently. I am not
sure why you are still seeing SATA errors.
> -----Original Message-----
> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> Sent: Friday, August 31, 2012 3:43 PM
> To: Santosh Jodh
> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> Subject: Re: Using debug-key ''o: Dump IOMMU p2m table, locks up
machine
> 
> 
> Saturday, September 1, 2012, 12:24:32 AM, you wrote:
> 
> > Depending on how many VMs you have and the size of the IOMMU p2m
> table, it can take a while. It should not be infinite though.
> 
> > How many VMs do you have running?
> 
> 15
> 
> > Can you please send the serial output when you press
''o''?
> 
> Attached, to the end you will see the s-ata errors coming through while the
> dump still runs.
> This is not a complete dump, only a few minutes after which i did a hard
> reset.
> 
> > Santosh
> 
> >> -----Original Message-----
> >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> >> Sent: Friday, August 31, 2012 2:45 PM
> >> To: Santosh Jodh; wei.wang2@amd.com
> >> Cc: xen-devel@lists.xen.org
> >> Subject: Using debug-key ''o: Dump IOMMU p2m table, locks
up machine
> >>
> >>
> >> I was trying to use the ''o'' debug key to make a
bug report about an "AMD-
> Vi:
> >> IO_PAGE_FAULT".
> >>
> >> The result:
> >> - When using "xl debug-keys o", the machine seems in a
infinite loop,
> >> can hardly login, eventually resulting in a kernel RCU stall and
complete
> lockup.
> >> - When using serial console: I get a infinite stream of "gfn:
mfn: "
> >> lines, mean while on the normal console, S-ATA devices are
starting to
> give errors.
> >>
> >> So either option trashes the machine, other debug-keys work fine.
> >>
> >> Machine has a 890-fx chipset and AMD phenom x6 proc.
> >>
> >> xl dmesg with bootup and output from some other debug-keys is
> attached.
> >>
> >> --
> >>
> >> Sander
> 
> 
> 
> 
> --
> Best regards,
>  Sander                            mailto:linux@eikelenboom.it

Sander Eikelenboom

2012-Aug-31 23:16 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Saturday, September 1, 2012, 12:57:45 AM, you wrote:
> The dump should complete - would be curious to see how long it takes on
serial console. What baudrate is the console running at?
I think for ages, this part seems only to cover a bit of the first of 3 pv
guests which have devices passed through.
38400

And i wonder if the information is very valuable, gfn == mfn for every line ...
at an increment of 1 ...
Perhaps a uhmmm more compact way of getting the interesting data would be handy
?
Or is this the intended output ?
> The code does allow processing of pending softirqs quite frequently. I am
not sure why you are still seeing SATA errors.
The machine is completely unresponsive in every way.

And using it with "xl debug-keys o" is never going to work i guess,
since the information flood is far larger than "xl dmesg" keeps ?


>> -----Original Message-----
>> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
>> Sent: Friday, August 31, 2012 3:43 PM
>> To: Santosh Jodh
>> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
>> Subject: Re: Using debug-key ''o: Dump IOMMU p2m table, locks
up machine
>> 
>> 
>> Saturday, September 1, 2012, 12:24:32 AM, you wrote:
>> 
>> > Depending on how many VMs you have and the size of the IOMMU p2m
>> table, it can take a while. It should not be infinite though.
>> 
>> > How many VMs do you have running?
>> 
>> 15
>> 
>> > Can you please send the serial output when you press
''o''?
>> 
>> Attached, to the end you will see the s-ata errors coming through while
the
>> dump still runs.
>> This is not a complete dump, only a few minutes after which i did a
hard
>> reset.
>> 
>> > Santosh
>> 
>> >> -----Original Message-----
>> >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
>> >> Sent: Friday, August 31, 2012 2:45 PM
>> >> To: Santosh Jodh; wei.wang2@amd.com
>> >> Cc: xen-devel@lists.xen.org
>> >> Subject: Using debug-key ''o: Dump IOMMU p2m table,
locks up machine
>> >>
>> >>
>> >> I was trying to use the ''o'' debug key to
make a bug report about an "AMD-
>> Vi:
>> >> IO_PAGE_FAULT".
>> >>
>> >> The result:
>> >> - When using "xl debug-keys o", the machine seems in
a infinite loop,
>> >> can hardly login, eventually resulting in a kernel RCU stall
and complete
>> lockup.
>> >> - When using serial console: I get a infinite stream of
"gfn:  mfn: "
>> >> lines, mean while on the normal console, S-ATA devices are
starting to
>> give errors.
>> >>
>> >> So either option trashes the machine, other debug-keys work
fine.
>> >>
>> >> Machine has a 890-fx chipset and AMD phenom x6 proc.
>> >>
>> >> xl dmesg with bootup and output from some other debug-keys is
>> attached.
>> >>
>> >> --
>> >>
>> >> Sander
>> 
>> 
>> 
>> 
>> --
>> Best regards,
>>  Sander                            mailto:linux@eikelenboom.it



-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

Santosh Jodh

2012-Aug-31 23:58 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

1:1 mapping is not the common case for gfn-mfn. It is hard to say how much
output would shrink by dumping contiguous ranges instead of individual pfns in
the general case.
> -----Original Message-----
> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> Sent: Friday, August 31, 2012 4:16 PM
> To: Santosh Jodh
> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> Subject: Re: Using debug-key ''o: Dump IOMMU p2m table, locks up
machine
> 
> 
> Saturday, September 1, 2012, 12:57:45 AM, you wrote:
> 
> > The dump should complete - would be curious to see how long it takes
on
> serial console. What baudrate is the console running at?
> 
> I think for ages, this part seems only to cover a bit of the first of 3 pv
guests
> which have devices passed through.
> 38400
> 
> And i wonder if the information is very valuable, gfn == mfn for every line
...
> at an increment of 1 ...
> Perhaps a uhmmm more compact way of getting the interesting data would
> be handy ?
> Or is this the intended output ?
> 
> > The code does allow processing of pending softirqs quite frequently. I
am
> not sure why you are still seeing SATA errors.
> 
> The machine is completely unresponsive in every way.
> 
> And using it with "xl debug-keys o" is never going to work i
guess, since the
> information flood is far larger than "xl dmesg" keeps ?
> 
> 
> 
> >> -----Original Message-----
> >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> >> Sent: Friday, August 31, 2012 3:43 PM
> >> To: Santosh Jodh
> >> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> >> Subject: Re: Using debug-key ''o: Dump IOMMU p2m table,
locks up
> >> machine
> >>
> >>
> >> Saturday, September 1, 2012, 12:24:32 AM, you wrote:
> >>
> >> > Depending on how many VMs you have and the size of the IOMMU
> p2m
> >> table, it can take a while. It should not be infinite though.
> >>
> >> > How many VMs do you have running?
> >>
> >> 15
> >>
> >> > Can you please send the serial output when you press
''o''?
> >>
> >> Attached, to the end you will see the s-ata errors coming through
> >> while the dump still runs.
> >> This is not a complete dump, only a few minutes after which i did
a
> >> hard reset.
> >>
> >> > Santosh
> >>
> >> >> -----Original Message-----
> >> >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> >> >> Sent: Friday, August 31, 2012 2:45 PM
> >> >> To: Santosh Jodh; wei.wang2@amd.com
> >> >> Cc: xen-devel@lists.xen.org
> >> >> Subject: Using debug-key ''o: Dump IOMMU p2m
table, locks up
> >> >> machine
> >> >>
> >> >>
> >> >> I was trying to use the ''o'' debug key
to make a bug report about
> >> >> an "AMD-
> >> Vi:
> >> >> IO_PAGE_FAULT".
> >> >>
> >> >> The result:
> >> >> - When using "xl debug-keys o", the machine
seems in a infinite
> >> >> loop, can hardly login, eventually resulting in a kernel
RCU stall
> >> >> and complete
> >> lockup.
> >> >> - When using serial console: I get a infinite stream of
"gfn:  mfn: "
> >> >> lines, mean while on the normal console, S-ATA devices
are
> >> >> starting to
> >> give errors.
> >> >>
> >> >> So either option trashes the machine, other debug-keys
work fine.
> >> >>
> >> >> Machine has a 890-fx chipset and AMD phenom x6 proc.
> >> >>
> >> >> xl dmesg with bootup and output from some other
debug-keys is
> >> attached.
> >> >>
> >> >> --
> >> >>
> >> >> Sander
> >>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>  Sander                            mailto:linux@eikelenboom.it
> 
> 
> 
> 
> --
> Best regards,
>  Sander                            mailto:linux@eikelenboom.it

Santosh Jodh

2012-Sep-01 00:42 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

BTW, I should add that 1:1 mapping for the VM seems very suspicious. Wei can
comment for sure.
> -----Original Message-----
> From: Santosh Jodh
> Sent: Friday, August 31, 2012 4:58 PM
> To: ''Sander Eikelenboom''
> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> Subject: RE: Using debug-key ''o: Dump IOMMU p2m table, locks up
machine
> 
> 1:1 mapping is not the common case for gfn-mfn. It is hard to say how much
> output would shrink by dumping contiguous ranges instead of individual pfns
> in the general case.
> 
> > -----Original Message-----
> > From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> > Sent: Friday, August 31, 2012 4:16 PM
> > To: Santosh Jodh
> > Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> > Subject: Re: Using debug-key ''o: Dump IOMMU p2m table, locks
up
> > machine
> >
> >
> > Saturday, September 1, 2012, 12:57:45 AM, you wrote:
> >
> > > The dump should complete - would be curious to see how long it
takes
> > > on
> > serial console. What baudrate is the console running at?
> >
> > I think for ages, this part seems only to cover a bit of the first of
> > 3 pv guests which have devices passed through.
> > 38400
> >
> > And i wonder if the information is very valuable, gfn == mfn for every
line
> ...
> > at an increment of 1 ...
> > Perhaps a uhmmm more compact way of getting the interesting data
> would
> > be handy ?
> > Or is this the intended output ?
> >
> > > The code does allow processing of pending softirqs quite
frequently.
> > > I am
> > not sure why you are still seeing SATA errors.
> >
> > The machine is completely unresponsive in every way.
> >
> > And using it with "xl debug-keys o" is never going to work i
guess,
> > since the information flood is far larger than "xl dmesg"
keeps ?
> >
> >
> >
> > >> -----Original Message-----
> > >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
> > >> Sent: Friday, August 31, 2012 3:43 PM
> > >> To: Santosh Jodh
> > >> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> > >> Subject: Re: Using debug-key ''o: Dump IOMMU p2m
table, locks up
> > >> machine
> > >>
> > >>
> > >> Saturday, September 1, 2012, 12:24:32 AM, you wrote:
> > >>
> > >> > Depending on how many VMs you have and the size of the
IOMMU
> > p2m
> > >> table, it can take a while. It should not be infinite though.
> > >>
> > >> > How many VMs do you have running?
> > >>
> > >> 15
> > >>
> > >> > Can you please send the serial output when you press
''o''?
> > >>
> > >> Attached, to the end you will see the s-ata errors coming
through
> > >> while the dump still runs.
> > >> This is not a complete dump, only a few minutes after which i
did a
> > >> hard reset.
> > >>
> > >> > Santosh
> > >>
> > >> >> -----Original Message-----
> > >> >> From: Sander Eikelenboom
[mailto:linux@eikelenboom.it]
> > >> >> Sent: Friday, August 31, 2012 2:45 PM
> > >> >> To: Santosh Jodh; wei.wang2@amd.com
> > >> >> Cc: xen-devel@lists.xen.org
> > >> >> Subject: Using debug-key ''o: Dump IOMMU p2m
table, locks up
> > >> >> machine
> > >> >>
> > >> >>
> > >> >> I was trying to use the ''o'' debug
key to make a bug report about
> > >> >> an "AMD-
> > >> Vi:
> > >> >> IO_PAGE_FAULT".
> > >> >>
> > >> >> The result:
> > >> >> - When using "xl debug-keys o", the
machine seems in a infinite
> > >> >> loop, can hardly login, eventually resulting in a
kernel RCU
> > >> >> stall and complete
> > >> lockup.
> > >> >> - When using serial console: I get a infinite stream
of "gfn:  mfn: "
> > >> >> lines, mean while on the normal console, S-ATA
devices are
> > >> >> starting to
> > >> give errors.
> > >> >>
> > >> >> So either option trashes the machine, other
debug-keys work fine.
> > >> >>
> > >> >> Machine has a 890-fx chipset and AMD phenom x6 proc.
> > >> >>
> > >> >> xl dmesg with bootup and output from some other
debug-keys is
> > >> attached.
> > >> >>
> > >> >> --
> > >> >>
> > >> >> Sander
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >>  Sander                           
mailto:linux@eikelenboom.it
> >
> >
> >
> >
> > --
> > Best regards,
> >  Sander                            mailto:linux@eikelenboom.it

Keir Fraser

2012-Sep-01 02:01 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 01/09/2012 00:16, "Sander Eikelenboom" <linux@eikelenboom.it>
wrote:
> 
> Saturday, September 1, 2012, 12:57:45 AM, you wrote:
> 
>> The dump should complete - would be curious to see how long it takes on
>> serial console. What baudrate is the console running at?
> 
> I think for ages, this part seems only to cover a bit of the first of 3 pv
> guests which have devices passed through.
> 38400
> 
> And i wonder if the information is very valuable, gfn == mfn for every line
> ... at an increment of 1 ...
> Perhaps a uhmmm more compact way of getting the interesting data would be
> handy ?
> Or is this the intended output ?
> 
>> The code does allow processing of pending softirqs quite frequently. I
am not
>> sure why you are still seeing SATA errors.
> 
> The machine is completely unresponsive in every way.
It might schedule softirqs but that won''t include scheduling or running
any
guest vcpus. The vcpu that happens to be running on that cpu at the time the
debug dump starts, will be stuck unrunnable until the dump completes.

Well, anyway, I don''t know how useful a massive dump of the entire p2m
is
going to be for debugging anyway. If investigating an IOMMU page fault,
I''d
just want the info pertaining to that fault, and all the mapping information
for that IO virtual address, dumped. :)

 -- Keir
> And using it with "xl debug-keys o" is never going to work i
guess, since the
> information flood is far larger than "xl dmesg" keeps ?
> 
> 
> 
>>> -----Original Message-----
>>> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
>>> Sent: Friday, August 31, 2012 3:43 PM
>>> To: Santosh Jodh
>>> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
>>> Subject: Re: Using debug-key ''o: Dump IOMMU p2m table,
locks up machine
>>> 
>>> 
>>> Saturday, September 1, 2012, 12:24:32 AM, you wrote:
>>> 
>>>> Depending on how many VMs you have and the size of the IOMMU
p2m
>>> table, it can take a while. It should not be infinite though.
>>> 
>>>> How many VMs do you have running?
>>> 
>>> 15
>>> 
>>>> Can you please send the serial output when you press
''o''?
>>> 
>>> Attached, to the end you will see the s-ata errors coming through
while the
>>> dump still runs.
>>> This is not a complete dump, only a few minutes after which i did a
hard
>>> reset.
>>> 
>>>> Santosh
>>> 
>>>>> -----Original Message-----
>>>>> From: Sander Eikelenboom [mailto:linux@eikelenboom.it]
>>>>> Sent: Friday, August 31, 2012 2:45 PM
>>>>> To: Santosh Jodh; wei.wang2@amd.com
>>>>> Cc: xen-devel@lists.xen.org
>>>>> Subject: Using debug-key ''o: Dump IOMMU p2m table,
locks up machine
>>>>> 
>>>>> 
>>>>> I was trying to use the ''o'' debug key to
make a bug report about an "AMD-
>>> Vi:
>>>>> IO_PAGE_FAULT".
>>>>> 
>>>>> The result:
>>>>> - When using "xl debug-keys o", the machine seems
in a infinite loop,
>>>>> can hardly login, eventually resulting in a kernel RCU
stall and complete
>>> lockup.
>>>>> - When using serial console: I get a infinite stream of
"gfn:  mfn: "
>>>>> lines, mean while on the normal console, S-ATA devices are
starting to
>>> give errors.
>>>>> 
>>>>> So either option trashes the machine, other debug-keys work
fine.
>>>>> 
>>>>> Machine has a 890-fx chipset and AMD phenom x6 proc.
>>>>> 
>>>>> xl dmesg with bootup and output from some other debug-keys
is
>>> attached.
>>>>> 
>>>>> --
>>>>> 
>>>>> Sander
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>>  Sander                            mailto:linux@eikelenboom.it
> 
> 
>

Santosh Jodh

2012-Sep-01 17:03 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Sent: Friday, August 31, 2012 7:01 PM
> To: Sander Eikelenboom; Santosh Jodh
> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] Using debug-key ''o: Dump IOMMU p2m table,
locks
> up machine
> 
> On 01/09/2012 00:16, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
> 
> >
> > Saturday, September 1, 2012, 12:57:45 AM, you wrote:
> >
> >> The dump should complete - would be curious to see how long it
takes
> >> on serial console. What baudrate is the console running at?
> >
> > I think for ages, this part seems only to cover a bit of the first of
> > 3 pv guests which have devices passed through.
> > 38400
> >
> > And i wonder if the information is very valuable, gfn == mfn for every
> > line ... at an increment of 1 ...
> > Perhaps a uhmmm more compact way of getting the interesting data
> would
> > be handy ?
> > Or is this the intended output ?
> >
> >> The code does allow processing of pending softirqs quite
frequently.
> >> I am not sure why you are still seeing SATA errors.
> >
> > The machine is completely unresponsive in every way.
> 
> It might schedule softirqs but that won''t include scheduling or
running any
> guest vcpus. The vcpu that happens to be running on that cpu at the time
the
> debug dump starts, will be stuck unrunnable until the dump completes.
Why does''nt that vCPU get scheduled on some other pCPU? Is there  a way
to yield the CPU from the key handler?
> 
> Well, anyway, I don''t know how useful a massive dump of the entire
p2m is
> going to be for debugging anyway. If investigating an IOMMU page fault,
I''d
> just want the info pertaining to that fault, and all the mapping
information for
> that IO virtual address, dumped. :)
It is not a generically useful command - its usefulness is in the same category
as dumping the MMU table. Unfortunately, there is no way to pass arguments to
the key handler - to say provide the VM and or starting gfn and length for a
more selective output.

Keir Fraser

2012-Sep-01 19:13 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 01/09/2012 18:03, "Santosh Jodh" <Santosh.Jodh@citrix.com>
wrote:
>> It might schedule softirqs but that won''t include scheduling
or running any
>> guest vcpus. The vcpu that happens to be running on that cpu at the
time the
>> debug dump starts, will be stuck unrunnable until the dump completes.
> 
> Why does''nt that vCPU get scheduled on some other pCPU? Is there 
a way to
> yield the CPU from the key handler?
It can''t be descheduled from this pCPU without running through the
scheduler. You could try running the handler in a tasklet -- a tasklet
causes other vCPUs to be descheduled from that pCPU, before it starts
running.

So you''d register a keyhandler which does a tasklet_schedule(), and do
your
logging work in the tasklet handler.

Worth a shot maybe?
>> 
>> Well, anyway, I don''t know how useful a massive dump of the
entire p2m is
>> going to be for debugging anyway. If investigating an IOMMU page fault,
I''d
>> just want the info pertaining to that fault, and all the mapping
information
>> for
>> that IO virtual address, dumped. :)
> 
> It is not a generically useful command - its usefulness is in the same
> category as dumping the MMU table. Unfortunately, there is no way to pass
> arguments to the key handler - to say provide the VM and or starting gfn
and
> length for a more selective output.
Quite simply, there likely needs to be more tracing on the IOMMU fault path.
That''s a separate concern from your keyhandler of course, but just
saying
I''d be looking for the former rather than the latter, for diagnosing
Sander''s bug.

 -- Keir

Santosh Jodh

2012-Sep-02 02:08 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Sent: Saturday, September 01, 2012 12:13 PM
> To: Santosh Jodh; Sander Eikelenboom
> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] Using debug-key ''o: Dump IOMMU p2m table,
locks
> up machine
> 
> On 01/09/2012 18:03, "Santosh Jodh"
<Santosh.Jodh@citrix.com> wrote:
> 
> >> It might schedule softirqs but that won''t include
scheduling or
> >> running any guest vcpus. The vcpu that happens to be running on
that
> >> cpu at the time the debug dump starts, will be stuck unrunnable
until the
> dump completes.
> >
> > Why does''nt that vCPU get scheduled on some other pCPU? Is
there  a
> > way to yield the CPU from the key handler?
> 
> It can''t be descheduled from this pCPU without running through the
> scheduler. You could try running the handler in a tasklet -- a tasklet
causes
> other vCPUs to be descheduled from that pCPU, before it starts running.
> 
> So you''d register a keyhandler which does a tasklet_schedule(),
and do your
> logging work in the tasklet handler.
> 
> Worth a shot maybe?
Yes - certainly. Is there a reason why all key handlers should not be tasklets?

Keir Fraser

2012-Sep-02 07:13 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 02/09/2012 03:08, "Santosh Jodh" <Santosh.Jodh@citrix.com>
wrote:
>>>> It might schedule softirqs but that won''t include
scheduling or
>>>> running any guest vcpus. The vcpu that happens to be running on
that
>>>> cpu at the time the debug dump starts, will be stuck unrunnable
until the
>> dump completes.
>>> 
>>> Why does''nt that vCPU get scheduled on some other pCPU? Is
there  a
>>> way to yield the CPU from the key handler?
>> 
>> It can''t be descheduled from this pCPU without running through
the
>> scheduler. You could try running the handler in a tasklet -- a tasklet
causes
>> other vCPUs to be descheduled from that pCPU, before it starts running.
>> 
>> So you''d register a keyhandler which does a
tasklet_schedule(), and do your
>> logging work in the tasklet handler.
>> 
>> Worth a shot maybe?
> 
> Yes - certainly. Is there a reason why all key handlers should not be
> tasklets?
Some keys you want to print immediately (stack trace), or you are using them
when the system is in a bad way, and deferring the tracing may cause you to
get no tracing at all. There may be a few informational keys, for irqs and
the like, that could be moved to tasklet context though, yes. It''s just
the
tasklet-in-hypervisor-thread mechanism is newer than the key handlers. ;-)

 -- Keir

Keir Fraser

2012-Sep-02 07:19 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 02/09/2012 08:13, "Keir Fraser" <keir.xen@gmail.com> wrote:
> On 02/09/2012 03:08, "Santosh Jodh"
<Santosh.Jodh@citrix.com> wrote:
> 
>>>>> It might schedule softirqs but that won''t include
scheduling or
>>>>> running any guest vcpus. The vcpu that happens to be
running on that
>>>>> cpu at the time the debug dump starts, will be stuck
unrunnable until the
>>> dump completes.
>>>> 
>>>> Why does''nt that vCPU get scheduled on some other
pCPU? Is there  a
>>>> way to yield the CPU from the key handler?
>>> 
>>> It can''t be descheduled from this pCPU without running
through the
>>> scheduler. You could try running the handler in a tasklet -- a
tasklet
>>> causes
>>> other vCPUs to be descheduled from that pCPU, before it starts
running.
>>> 
>>> So you''d register a keyhandler which does a
tasklet_schedule(), and do your
>>> logging work in the tasklet handler.
>>> 
>>> Worth a shot maybe?
>> 
>> Yes - certainly. Is there a reason why all key handlers should not be
>> tasklets?
> 
> Some keys you want to print immediately (stack trace), or you are using
them
> when the system is in a bad way, and deferring the tracing may cause you to
> get no tracing at all. There may be a few informational keys, for irqs and
> the like, that could be moved to tasklet context though, yes. It''s
just the
> tasklet-in-hypervisor-thread mechanism is newer than the key handlers. ;-)
Actually, ignore me, most keyhandlers are getting deferred to a tasklet
context already. At least when triggered from a serial irq. See
common/keyhandler.c:handle_keypress().

So your handler, when triggered by ''o'' over the serial line,
will be running
in tasklet context already. So vCPU execution is getting stalled just
because, I''m not sure, not running through the scheduler softirq for
ages on
that pCPU is maybe confusing the scheduler? :(

 -- Ker
>  -- Keir
> 
>

Keir Fraser

2012-Sep-02 07:42 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 31/08/2012 22:45, "Sander Eikelenboom" <linux@eikelenboom.it>
wrote:
> 
> I was trying to use the ''o'' debug key to make a bug
report about an "AMD-Vi:
> IO_PAGE_FAULT".
> 
> The result:
> - When using "xl debug-keys o", the machine seems in a infinite
loop, can
> hardly login, eventually resulting in a kernel RCU stall and complete
lockup.
We don''t defer the key handler to tasklet context in this case (because
of
''if !in_irq()'' check in keyhandler.c:handle_keypress()). Hence
the dom0 vCPU
gets ''stuck'' while the handler executes. Possibly we should
always defer
non-irq keyhandlers to tasklet context, even when executed via sysctl.
> - When using serial console: I get a infinite stream of "gfn:  mfn:
" lines,
> mean while on the normal console, S-ATA devices are starting to give
errors.
In this case the handler must be running in tasklet context. Not sure why
SATA interrupts would be affected as hardirq and softirq work will still be
carried out during execution of the keyhandler (the handler voluntarily
preempts itself for softirq work).

Would need more investigation. :)

 -- Keir
> So either option trashes the machine, other debug-keys work fine.
> 
> Machine has a 890-fx chipset and AMD phenom x6 proc.
> 
> xl dmesg with bootup and output from some other debug-keys is attached.
> 
> --
> 
> Sander
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-02 08:43 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Saturday, September 1, 2012, 9:13:17 PM, you wrote:
> On 01/09/2012 18:03, "Santosh Jodh"
<Santosh.Jodh@citrix.com> wrote:
>>> It might schedule softirqs but that won''t include
scheduling or running any
>>> guest vcpus. The vcpu that happens to be running on that cpu at the
time the
>>> debug dump starts, will be stuck unrunnable until the dump
completes.
>> 
>> Why does''nt that vCPU get scheduled on some other pCPU? Is
there  a way to
>> yield the CPU from the key handler?
> It can''t be descheduled from this pCPU without running through the
> scheduler. You could try running the handler in a tasklet -- a tasklet
> causes other vCPUs to be descheduled from that pCPU, before it starts
> running.
> So you''d register a keyhandler which does a tasklet_schedule(),
and do your
> logging work in the tasklet handler.
> Worth a shot maybe?
>>> 
>>> Well, anyway, I don''t know how useful a massive dump of
the entire p2m is
>>> going to be for debugging anyway. If investigating an IOMMU page
fault, I''d
>>> just want the info pertaining to that fault, and all the mapping
information
>>> for
>>> that IO virtual address, dumped. :)
>> 
>> It is not a generically useful command - its usefulness is in the same
>> category as dumping the MMU table. Unfortunately, there is no way to
pass
>> arguments to the key handler - to say provide the VM and or starting
gfn and
>> length for a more selective output.
> Quite simply, there likely needs to be more tracing on the IOMMU fault
path.
> That''s a separate concern from your keyhandler of course, but just
saying
> I''d be looking for the former rather than the latter, for
diagnosing
> Sander''s bug.
Are there any printk''s I could add to get more relevant info about the
AMD-Vi: IO_PAGE_FAULT ?

I have attached new output from xl dmesg, this time with iommu=debug on (the
option changed from 4.1 to 4.2).


>  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Keir Fraser

2012-Sep-02 14:58 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 02/09/2012 09:43, "Sander Eikelenboom" <linux@eikelenboom.it>
wrote:
>> Quite simply, there likely needs to be more tracing on the IOMMU fault
path.
>> That''s a separate concern from your keyhandler of course, but
just saying
>> I''d be looking for the former rather than the latter, for
diagnosing
>> Sander''s bug.
> 
> Are there any printk''s I could add to get more relevant info about
the AMD-Vi:
> IO_PAGE_FAULT ?
No really straightforward one. I think we need a per-IOMMU-type handler to
walk the IOMMU page table for a given virtual address, and dump every
page-table-entry on the path. Like an IOMMU version of show_page_walk().
Personally I would suspect this is more useful than the dump-everything
handlers: just give a *full* *detailed* walk for the actually interesting
virtual address (the one faulted on).
> I have attached new output from xl dmesg, this time with iommu=debug on
(the
> option changed from 4.1 to 4.2).
Not easy to glean any more from that, without extra tracing such as
described above, and/or digging into the guest to find what driver-side
actions are causing the faults.

 -- Keir
> 
> 
>>  -- Keir
>

Sander Eikelenboom

2012-Sep-02 15:14 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Sunday, September 2, 2012, 4:58:58 PM, you wrote:
> On 02/09/2012 09:43, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
>>> Quite simply, there likely needs to be more tracing on the IOMMU
fault path.
>>> That''s a separate concern from your keyhandler of course,
but just saying
>>> I''d be looking for the former rather than the latter, for
diagnosing
>>> Sander''s bug.
>> 
>> Are there any printk''s I could add to get more relevant info
about the AMD-Vi:
>> IO_PAGE_FAULT ?
> No really straightforward one. I think we need a per-IOMMU-type handler to
> walk the IOMMU page table for a given virtual address, and dump every
> page-table-entry on the path. Like an IOMMU version of show_page_walk().
> Personally I would suspect this is more useful than the dump-everything
> handlers: just give a *full* *detailed* walk for the actually interesting
> virtual address (the one faulted on).
>> I have attached new output from xl dmesg, this time with iommu=debug on
(the
>> option changed from 4.1 to 4.2).
> Not easy to glean any more from that, without extra tracing such as
> described above, and/or digging into the guest to find what driver-side
> actions are causing the faults.
OK, too bad!
With xen 4.1 i haven''t experienced those page faults, but a diff
between /xen/drivers/passthrough/amd in both trees show quite some changes :(
>  -- Keir
>> 
>> 
>>>  -- Keir
>> 





-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

Jan Beulich

2012-Sep-03 08:14 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 01.09.12 at 02:42, Santosh Jodh <Santosh.Jodh@citrix.com>
wrote:
> BTW, I should add that 1:1 mapping for the VM seems very suspicious. Wei
can
> comment for sure.
For PV guests, that''s very much expected, I would say.

Jan
>> -----Original Message-----
>> From: Santosh Jodh
>> Sent: Friday, August 31, 2012 4:58 PM
>> To: ''Sander Eikelenboom''
>> Cc: wei.wang2@amd.com; xen-devel@lists.xen.org 
>> Subject: RE: Using debug-key ''o: Dump IOMMU p2m table, locks
up machine
>> 
>> 1:1 mapping is not the common case for gfn-mfn. It is hard to say how
much
>> output would shrink by dumping contiguous ranges instead of individual
pfns
>> in the general case.

Jan Beulich

2012-Sep-03 08:21 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 02.09.12 at 10:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> I have attached new output from xl dmesg, this time with iommu=debug on
(the
> option changed from 4.1 to 4.2).
This one
>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060): ffff82c48015c9ee ->
ffff82c480224b13
also worries me. While Xen gracefully recovers from it, these
messages still generally indicate a problem somewhere. Could
you resolve the addresses to file:line tuples? And, assuming
this happens in the context of doing something on behalf of
the guest in the context of a guest vCPU, could you also
check what guest side action triggers this (e.g. by adding a
call to show_execution_state() alongside the printing of the
message)?

Jan

Sander Eikelenboom

2012-Sep-03 08:33 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Jan,

Monday, September 3, 2012, 10:21:04 AM, you wrote:
>>>> On 02.09.12 at 10:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> I have attached new output from xl dmesg, this time with iommu=debug on
(the
>> option changed from 4.1 to 4.2).
> This one
>>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060): ffff82c48015c9ee
-> ffff82c480224b13
> also worries me. While Xen gracefully recovers from it, these
> messages still generally indicate a problem somewhere. Could
> you resolve the addresses to file:line tuples? And, assuming
> this happens in the context of doing something on behalf of
> the guest in the context of a guest vCPU, could you also
> check what guest side action triggers this (e.g. by adding a
> call to show_execution_state() alongside the printing of the
> message)?
If you could elaborate a bit abouw HOW :-)

From what i recall i also had these since some time on xen 4.1 too. Probably a
kernel thing, in about 3.5 or 3.6 is my estimation, will see if i can find out
some more today or tomorrow.

> Jan

Jan Beulich

2012-Sep-03 09:05 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 03.09.12 at 10:33, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> Monday, September 3, 2012, 10:21:04 AM, you wrote:
>> This one
> 
>>>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060):
ffff82c48015c9ee ->
> ffff82c480224b13
> 
>> also worries me. While Xen gracefully recovers from it, these
>> messages still generally indicate a problem somewhere. Could
>> you resolve the addresses to file:line tuples? And, assuming
>> this happens in the context of doing something on behalf of
>> the guest in the context of a guest vCPU, could you also
>> check what guest side action triggers this (e.g. by adding a
>> call to show_execution_state() alongside the printing of the
>> message)?
> 
> If you could elaborate a bit abouw HOW :-)
I assume this refers to the first question above only (as I
assume adding the indicated function call at the right place
wouldn''t be a big deal for you)?

I think people generally use addr2line for this; I normally simply
disassemble xen-syms, and then do a manual lookup (so if you
have the very xen-syms still around, just making that one
accessible would also do), as in many cases (having full
register/stack dumps at hand) understanding which variables/
expressions register values correspond to would make this
necessary anyway.
> From what i recall i also had these since some time on xen 4.1 too.
Probably
> a kernel thing, in about 3.5 or 3.6 is my estimation, will see if i can
find
> out some more today or tomorrow.
Thanks, Jan

Wei Wang

2012-Sep-03 15:20 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 09/02/2012 05:14 PM, Sander Eikelenboom wrote:> Sunday, September 2, 2012, 4:58:58 PM, you wrote:
>
>> On 02/09/2012 09:43, "Sander
Eikelenboom"<linux@eikelenboom.it>  wrote:
>
>>>> Quite simply, there likely needs to be more tracing on the
IOMMU fault path.
>>>> That''s a separate concern from your keyhandler of
course, but just saying
>>>> I''d be looking for the former rather than the latter,
for diagnosing
>>>> Sander''s bug.
>>>
>>> Are there any printk''s I could add to get more relevant
info about the AMD-Vi:
>>> IO_PAGE_FAULT ?
>
>> No really straightforward one. I think we need a per-IOMMU-type handler
to
>> walk the IOMMU page table for a given virtual address, and dump every
>> page-table-entry on the path. Like an IOMMU version of
show_page_walk().
>> Personally I would suspect this is more useful than the dump-everything
>> handlers: just give a *full* *detailed* walk for the actually
interesting
>> virtual address (the one faulted on).
>
>>> I have attached new output from xl dmesg, this time with
iommu=debug on (the
>>> option changed from 4.1 to 4.2).
>
>> Not easy to glean any more from that, without extra tracing such as
>> described above, and/or digging into the guest to find what driver-side
>> actions are causing the faults.
>
> OK, too bad!
> With xen 4.1 i haven''t experienced those page faults, but a diff
between /xen/drivers/passthrough/amd in both trees show quite some changes :(
Did you also update xen tools accordingly? Sometime I also saw a few 
IO_PAGE_FAULTs came from nic if my tools version and HV version did not 
match. But using recent 4.2 and corresponding xl, my tests went well.
BTW: You could also try iommu=no-sharept to see if it helps.

Thanks,
Wei
>>   -- Keir
>
>>>
>>>
>>>>   -- Keir
>>>
>
>
>
>
>
>

Jan Beulich

2012-Sep-04 06:35 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 02.09.12 at 09:42, Keir Fraser <keir.xen@gmail.com> wrote:
> On 31/08/2012 22:45, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
>> - When using serial console: I get a infinite stream of "gfn: 
mfn: " lines,
>> mean while on the normal console, S-ATA devices are starting to give
errors.
> 
> In this case the handler must be running in tasklet context. Not sure why
> SATA interrupts would be affected as hardirq and softirq work will still be
> carried out during execution of the keyhandler (the handler voluntarily
> preempts itself for softirq work).
> 
> Would need more investigation. :)
Isn''t that because tasklets (i.e. idle vCPU-s with tasklets active)
get preferred in the schedulers? Some compensation might be
needed for the penalized vCPU, at least if that one is pinned
(not sure whether load balancing would be able to steal the
head of the run queue from a remote CPU). Sander - are you
by chance pinning Dom0 vCPU-s? And how many of them does
your Dom0 have?

Jan

Sander Eikelenboom

2012-Sep-04 06:52 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Jan,

Tuesday, September 4, 2012, 8:35:40 AM, you wrote:
>>>> On 02.09.12 at 09:42, Keir Fraser <keir.xen@gmail.com>
wrote:
>> On 31/08/2012 22:45, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
>>> - When using serial console: I get a infinite stream of "gfn: 
mfn: " lines,
>>> mean while on the normal console, S-ATA devices are starting to
give errors.
>> 
>> In this case the handler must be running in tasklet context. Not sure
why
>> SATA interrupts would be affected as hardirq and softirq work will
still be
>> carried out during execution of the keyhandler (the handler voluntarily
>> preempts itself for softirq work).
>> 
>> Would need more investigation. :)
> Isn''t that because tasklets (i.e. idle vCPU-s with tasklets
active)
> get preferred in the schedulers? Some compensation might be
> needed for the penalized vCPU, at least if that one is pinned
> (not sure whether load balancing would be able to steal the
> head of the run queue from a remote CPU). Sander - are you
> by chance pinning Dom0 vCPU-s? And how many of them does
> your Dom0 have?

Yes i do, could perhaps be the pinning of CPU0 ?

serveerstertje:~# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0    0   r--    7711.7  0
Domain-0                             0     1    2   -b-    4048.1  2-5
Domain-0                             0     2    3   -b-    1129.1  2-5
Domain-0                             0     3    4   r--    1229.9  2-5
Domain-0                             0     4    5   -b-     885.5  2-5
Domain-0                             0     5    5   -b-    1192.5  2-5
database                             1     0    2   -b-    1063.6  2-5
database                             1     1    2   -b-     496.2  2-5
mail                                 2     0    5   -b-      42.3  2-5
samba                                3     0    4   -b-     178.4  2-5
webproxy                             4     0    2   -b-      49.9  2-5
www                                  5     0    4   -b-     104.1  2-5
davical                              6     0    2   -b-     119.6  2-5
backup                               7     0    5   -b-    1052.8  2-5
git                                  8     0    4   -b-      55.8  2-5
zabbix                               9     0    4   -b-     426.0  2-5
gallery3                            10     0    4   -b-      47.6  2-5
media                               11     0    2   -b-      40.9  2-5
torrentflux                         12     0    4   -b-      56.0  2-5
vpn                                 13     0    2   -b-     170.8  2-5
security                            14     0    2   -b-    2246.0  1-5
security                            14     1    4   -b-    1778.4  1-5
security                            14     2    1   -b-    1659.7  1-5
security                            14     3    5   -b-    1841.3  1-5
security                            14     4    3   -b-     981.1  1-5
creabox_hvm                         15     0    5   -b-     196.5  2-5
creabox_hvm                         15     1    4   -b-     255.8  2-5
creaexp                             16     0    2   -b-     297.8  2-5

serveerstertje:~# xl sched-credit
libxl: error: libxl.c:596:cpupool_info: failed to get info for cpupool1
: No such file or directory
Cpupool Pool-0: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
Domain-0                             0   1024    0
database                             1    256    0
mail                                 2    256    0
samba                                3    256    0
webproxy                             4    256    0
www                                  5    256    0
davical                              6    256    0
backup                               7    256    0
git                                  8    256    0
zabbix                               9    256    0
gallery3                            10    256    0
media                               11    256    0
torrentflux                         12    256    0
vpn                                 13    256    0
security                            14    768    0
creabox_hvm                         15    256    0
creaexp                             16    256    0

> Jan




-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

Keir Fraser

2012-Sep-04 06:59 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/2012 07:35, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 02.09.12 at 09:42, Keir Fraser <keir.xen@gmail.com>
wrote:
>> On 31/08/2012 22:45, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
>>> - When using serial console: I get a infinite stream of "gfn: 
mfn: " lines,
>>> mean while on the normal console, S-ATA devices are starting to
give errors.
>> 
>> In this case the handler must be running in tasklet context. Not sure
why
>> SATA interrupts would be affected as hardirq and softirq work will
still be
>> carried out during execution of the keyhandler (the handler voluntarily
>> preempts itself for softirq work).
>> 
>> Would need more investigation. :)
> 
> Isn''t that because tasklets (i.e. idle vCPU-s with tasklets
active)
> get preferred in the schedulers? Some compensation might be
> needed for the penalized vCPU, at least if that one is pinned
> (not sure whether load balancing would be able to steal the
> head of the run queue from a remote CPU). Sander - are you
> by chance pinning Dom0 vCPU-s? And how many of them does
> your Dom0 have?
Jan, Yes you could be right, if Sander is pinning CPUs. Anyway, I
wasn''t
going to expend too much brain power on this situation. The case of spending
a few minutes in one key handler is not one I think is particularly sane.

 -- Keir
> Jan
>

Keir Fraser

2012-Sep-04 07:01 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/2012 07:52, "Sander Eikelenboom" <linux@eikelenboom.it>
wrote:
>> Isn''t that because tasklets (i.e. idle vCPU-s with tasklets
active)
>> get preferred in the schedulers? Some compensation might be
>> needed for the penalized vCPU, at least if that one is pinned
>> (not sure whether load balancing would be able to steal the
>> head of the run queue from a remote CPU). Sander - are you
>> by chance pinning Dom0 vCPU-s? And how many of them does
>> your Dom0 have?
>
> Yes i do, could perhaps be the pinning of CPU0 ?
Yeah. :) Case closed?

Sander Eikelenboom

2012-Sep-04 07:08 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Jan,

Monday, September 3, 2012, 10:21:04 AM, you wrote:
>>>> On 02.09.12 at 10:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> I have attached new output from xl dmesg, this time with iommu=debug on
(the
>> option changed from 4.1 to 4.2).
> This one
>>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060): ffff82c48015c9ee
-> ffff82c480224b13
> also worries me. While Xen gracefully recovers from it, these
> messages still generally indicate a problem somewhere. Could
> you resolve the addresses to file:line tuples? And, assuming
> this happens in the context of doing something on behalf of
> the guest in the context of a guest vCPU, could you also
> check what guest side action triggers this (e.g. by adding a
> call to show_execution_state() alongside the printing of the
> message)?
Hope i have done it right:

diff -r a0b5f8102a00 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c      Tue Aug 28 22:40:45 2012 +0100
+++ b/xen/arch/x86/traps.c      Tue Sep 04 08:53:54 2012 +0200
@@ -3154,6 +3154,11 @@
     {
         dprintk(XENLOG_INFO, "GPF (%04x): %p -> %p\n",
                 regs->error_code, _p(regs->eip), _p(fixup));
+        dprintk(XENLOG_INFO, " show_execution_state(regs): \n");
+       show_execution_state(regs);
+        dprintk(XENLOG_INFO, " 
show_execution_state(guest_cpu_user_regs()): \n");
+       show_execution_state(guest_cpu_user_regs());
+
         regs->eip = fixup;
         return;
     }


Gives (complete dmesg attached:

(XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8ee82c0
(XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8ee8320
(XEN) [2012-09-04 03:00:34] traps.c:3156: GPF (0060): ffff82c48015c9ee ->
ffff82c480224b73
(XEN) [2012-09-04 03:00:34] traps.c:3157:  show_execution_state(regs): 
(XEN) [2012-09-04 03:00:34] ----[ Xen-4.2.0-rc4-pre  x86_64  debug=y  Not
tainted ]----
(XEN) [2012-09-04 03:00:34] CPU:    3
(XEN) [2012-09-04 03:00:34] RIP:    e008:[<ffff82c48015c9ee>]
context_switch+0x394/0xeeb
(XEN) [2012-09-04 03:00:34] RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) [2012-09-04 03:00:34] rax: 0000000000000001   rbx: ffff8300a52da000   rcx:
0000000000000001
(XEN) [2012-09-04 03:00:34] rdx: 0000000000000063   rsi: 0000000000000001   rdi:
000000000000037e
(XEN) [2012-09-04 03:00:34] rbp: ffff83024d8a7e28   rsp: ffff83024d8a7d88   r8: 
0000000000000006
(XEN) [2012-09-04 03:00:34] r9:  ffff83024d95ebb8   r10: 00000000deadbeef   r11:
0000000000000246
(XEN) [2012-09-04 03:00:34] r12: ffff8300afd11000   r13: 0000000000000003   r14:
0000000000000003
(XEN) [2012-09-04 03:00:34] r15: ffff83024d8aa048   cr0: 000000008005003b   cr4:
00000000000006f0
(XEN) [2012-09-04 03:00:34] cr3: 0000000068506000   cr2: ffffffffff600400
(XEN) [2012-09-04 03:00:34] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010
cs: e008
(XEN) [2012-09-04 03:00:34] Xen stack trace from rsp=ffff83024d8a7d88:
(XEN) [2012-09-04 03:00:34]    0000000000000029 0000000000000000
0000001e00000000 0000000000000000
(XEN) [2012-09-04 03:00:34]    ffff83024d8a7db8 ffff83024d8aa060
ffff83024d8a7e18 ffff82c4801805ae
(XEN) [2012-09-04 03:00:34]    0000000000012f22 00003fd9ab6d6ca6
0000000000000000 0000000000000000
(XEN) [2012-09-04 03:00:34]    0000000000000000 0000000000000000
ffff83024d8a7e28 ffff8300afd11000
(XEN) [2012-09-04 03:00:34]    ffff8300a52da000 000013ebba37e10c
0000000000000002 ffff83024d8aa048
(XEN) [2012-09-04 03:00:34]    ffff83024d8a7eb8 ffff82c480124a70
0000000000000000 ffff83024d8aa040
(XEN) [2012-09-04 03:00:34]    000000034d8a7e68 000013ebba37e10c
ffff83024d8a7e88 ffff82c480189483
(XEN) [2012-09-04 03:00:34]    ffff8300a52da000 0000000001c9c380
ffff83024d8a7e00 ffff82c4801226ce
(XEN) [2012-09-04 03:00:34]    ffff83024d8a7ef8 ffff82c4802d8180
00000000ffffffff ffff82c4802d8000
(XEN) [2012-09-04 03:00:34]    ffff83024d8a7f18 ffffffffffffffff
ffff83024d8a7ef8 ffff82c480125e31
(XEN) [2012-09-04 03:00:34]    0000000000000246 ffff8300afd11000
ffffffff81ece5d8 ffffffff81f420c0
(XEN) [2012-09-04 03:00:34]    0000000000000000 0000000000000000
ffff83024d8a7f08 ffff82c480125e68
(XEN) [2012-09-04 03:00:34]    00007cfdb27580c7 ffff82c480222ef6
0000000000000000 ffff8800030e14a0
(XEN) [2012-09-04 03:00:34]    0000000000000000 ffff88001a0800d8
ffff88001cd17bf0 ffff88001fc0b100
(XEN) [2012-09-04 03:00:34]    0000000000000202 0000000000000000
0000000000000001 0000000000000000
(XEN) [2012-09-04 03:00:34]    0000000000000000 ffffffff810011aa
ffff88001e99e180 00000000deadbeef
(XEN) [2012-09-04 03:00:34]    00000000deadbeef 0000010000000000
ffffffff810011aa 000000000000e033
(XEN) [2012-09-04 03:00:34]    0000000000000202 ffff88001cd17bb8
000000000000e02b 000053fd0000beef
(XEN) [2012-09-04 03:00:34]    800000000000beef 740000000000beef
000000000018beef 000053fe00000003
(XEN) [2012-09-04 03:00:34]    ffff8300a52da000 0000003dcd5a8680
000000000018e0c9
(XEN) [2012-09-04 03:00:34] Xen call trace:
(XEN) [2012-09-04 03:00:34]    [<ffff82c48015c9ee>]
context_switch+0x394/0xeeb
(XEN) [2012-09-04 03:00:34]    [<ffff82c480124a70>] schedule+0x666/0x675
(XEN) [2012-09-04 03:00:34]    [<ffff82c480125e31>] __do_softirq+0xa4/0xb5
(XEN) [2012-09-04 03:00:34]    [<ffff82c480125e68>] do_softirq+0x26/0x28
(XEN) [2012-09-04 03:00:34]    
(XEN) [2012-09-04 03:00:34] traps.c:3159:  
show_execution_state(guest_cpu_user_regs()):
(XEN) [2012-09-04 03:00:34] ----[ Xen-4.2.0-rc4-pre  x86_64  debug=y  Not
tainted ]----
(XEN) [2012-09-04 03:00:34] CPU:    3
(XEN) [2012-09-04 03:00:34] RIP:    e033:[<ffffffff810011aa>]
(XEN) [2012-09-04 03:00:34] RFLAGS: 0000000000000202   EM: 1   CONTEXT: pv guest
(XEN) [2012-09-04 03:00:34] rax: 0000000000000000   rbx: ffff88001fc0b100   rcx:
ffffffff810011aa
(XEN) [2012-09-04 03:00:34] rdx: ffff88001e99e180   rsi: 00000000deadbeef   rdi:
00000000deadbeef
(XEN) [2012-09-04 03:00:34] rbp: ffff88001cd17bf0   rsp: ffff88001cd17bb8   r8: 
0000000000000000
(XEN) [2012-09-04 03:00:34] r9:  0000000000000001   r10: 0000000000000000   r11:
0000000000000202
(XEN) [2012-09-04 03:00:34] r12: ffff88001a0800d8   r13: 0000000000000000   r14:
ffff8800030e14a0
(XEN) [2012-09-04 03:00:34] r15: 0000000000000000   cr0: 000000008005003b   cr4:
00000000000006f0
(XEN) [2012-09-04 03:00:34] cr3: 0000000068506000   cr2: 00000000f76e4000
(XEN) [2012-09-04 03:00:34] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b
cs: e033
(XEN) [2012-09-04 03:00:34] Guest stack trace from rsp=ffff88001cd17bb8:
(XEN) [2012-09-04 03:00:34]    0000000000000000 0000000000000001
ffffffff81004942 ffff8800030e1040
(XEN) [2012-09-04 03:00:34]    ffff88001e99e180 0000000000000000
ffff8800030e14a0 ffff88001cd17c10
(XEN) [2012-09-04 03:00:34]    ffffffff81003941 ffff88001e99e180
ffff8800030e1040 ffff88001cd17c70
(XEN) [2012-09-04 03:00:34]    ffffffff8100b850 ffff8800030e1040
ffff88001d280080 0000000000000063
(XEN) [2012-09-04 03:00:34]    ffff88001fc10a80 ffff88001cd17c80
ffff88001fc12e80 0000000000000000
(XEN) [2012-09-04 03:00:34]    ffff88001d285b00 0000000000000000
0000000000000000 ffff8800030e1040
(XEN) [2012-09-04 03:00:34]    ffffffff817fa2f5 ffff88001cd17dd0
0000000000000216 ffffffff810700fe
(XEN) [2012-09-04 03:00:34]    ffff88001fc0e018 ffff8800030e1040
0000000000012e80 ffff88001cd17fd8
(XEN) [2012-09-04 03:00:34]    ffff88001cd16010 0000000000012e80
0000000000012e80 ffff88001cd17fd8
(XEN) [2012-09-04 03:00:34]    0000000000012e80 ffff88001e999040
ffff8800030e1040 ffff880000000000
(XEN) [2012-09-04 03:00:34]    ffff880000000000 ffff88001fc0e000
ffff88001cdb3300 ffff88001fc16e00
(XEN) [2012-09-04 03:00:34]    ffff88001fc0e000 ffff88001cd17d50
ffffffff817fb614 ffff88001d08c140
(XEN) [2012-09-04 03:00:34]    ffff88001cdb3300 ffff88001fc16e00
ffff88001fc0e000 ffff88001cd17de0
(XEN) [2012-09-04 03:00:34]    ffffffff8107f059 ffff8800030e1040
ffff8800030e1040 ffffffff817fbe7b
(XEN) [2012-09-04 03:00:34]    ffff88001fc0e448 ffff8800030e1040
ffff88001cdb3320 ffff88001cd17db0
(XEN) [2012-09-04 03:00:34]    ffffffff810acb78 ffff88001fc0e000
ffff88001cdb3300 ffff88001fc0e438
(XEN) [2012-09-04 03:00:34]    ffff88001fc0e448 ffff8800030e1040
ffff88001cdb3320 ffff88001cd17de0
(XEN) [2012-09-04 03:00:34]    ffffffff817fa814 ffff88001cd17eb0
ffffffff8107f6f9 0000000000000000
(XEN) [2012-09-04 03:00:34]    ffff88001cd17e50 ffff8800030e1040
ffff88001cd16010 ffff8800030e0240
(XEN) [2012-09-04 03:00:34]    ffff88001cd17e68 ffff8800030e1040
ffff8800030e1040 ffff8800030e1040
(XEN) [2012-09-04 03:15:12] grant_table.c:254:d0 Increased maptrack size to 2
frames

> Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Sep-04 07:46 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 09:08, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> This one
> 
>>>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060):
ffff82c48015c9ee ->
> ffff82c480224b13
> 
>> also worries me. While Xen gracefully recovers from it, these
>> messages still generally indicate a problem somewhere. Could
>> you resolve the addresses to file:line tuples? And, assuming
>> this happens in the context of doing something on behalf of
>> the guest in the context of a guest vCPU, could you also
>> check what guest side action triggers this (e.g. by adding a
>> call to show_execution_state() alongside the printing of the
>> message)?
> 
> Hope i have done it right:
Yes.
> Gives (complete dmesg attached:
> 
> (XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8ee82c0
> (XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id =
0x0700, fault address = 0xa8ee8320
> (XEN) [2012-09-04 03:00:34] traps.c:3156: GPF (0060): ffff82c48015c9ee
-> ffff82c480224b73
> (XEN) [2012-09-04 03:00:34] traps.c:3157:  show_execution_state(regs): 
> (XEN) [2012-09-04 03:00:34] ----[ Xen-4.2.0-rc4-pre  x86_64  debug=y  Not
tainted ]----
> (XEN) [2012-09-04 03:00:34] CPU:    3
> (XEN) [2012-09-04 03:00:34] RIP:    e008:[<ffff82c48015c9ee>]
context_switch+0x394/0xeeb
Now that - in the middle of context switch code - almost certainly
wants to be fixed, but we first need to understand what it is (and
how it gets triggered by the guest). I.e. once again this requires
resolving to file/line - care to do the conversion yourself, or send
(or make available somewhere) the very xen-syms?

Jan

Jan Beulich

2012-Sep-04 07:55 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 08:59, Keir Fraser <keir.xen@gmail.com> wrote:
> On 04/09/2012 07:35, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>>>> On 02.09.12 at 09:42, Keir Fraser
<keir.xen@gmail.com> wrote:
>>> On 31/08/2012 22:45, "Sander Eikelenboom"
<linux@eikelenboom.it> wrote:
>>>> - When using serial console: I get a infinite stream of
"gfn:  mfn: " lines,
>>>> mean while on the normal console, S-ATA devices are starting to
give errors.
>>> 
>>> In this case the handler must be running in tasklet context. Not
sure why
>>> SATA interrupts would be affected as hardirq and softirq work will
still be
>>> carried out during execution of the keyhandler (the handler
voluntarily
>>> preempts itself for softirq work).
>>> 
>>> Would need more investigation. :)
>> 
>> Isn''t that because tasklets (i.e. idle vCPU-s with tasklets
active)
>> get preferred in the schedulers? Some compensation might be
>> needed for the penalized vCPU, at least if that one is pinned
>> (not sure whether load balancing would be able to steal the
>> head of the run queue from a remote CPU). Sander - are you
>> by chance pinning Dom0 vCPU-s? And how many of them does
>> your Dom0 have?
> 
> Jan, Yes you could be right, if Sander is pinning CPUs. Anyway, I
wasn''t
> going to expend too much brain power on this situation. The case of
spending
> a few minutes in one key handler is not one I think is particularly sane.
Which imo would call for reverting the patch. But then again, other
key handlers can easily take pretty long too (particularly on large
systems, albeit it is clear that the one here is particularly bad), and
declaring all of them pretty much useless probably isn''t the best
choice (as then we could as well rip them all out).

Bottom line - _I_ think we should try to do something about this.
An apparent option would be to have low priority tasklets (for
just this purpose, as all others we certainly want to take priority),
if that can reasonably be integrated with the schedulers.

Jan

Keir Fraser

2012-Sep-04 08:04 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/2012 08:55, "Jan Beulich" <JBeulich@suse.com> wrote:
>> Jan, Yes you could be right, if Sander is pinning CPUs. Anyway, I
wasn''t
>> going to expend too much brain power on this situation. The case of
spending
>> a few minutes in one key handler is not one I think is particularly
sane.
> 
> Which imo would call for reverting the patch. But then again, other
> key handlers can easily take pretty long too (particularly on large
> systems, albeit it is clear that the one here is particularly bad), and
> declaring all of them pretty much useless probably isn''t the best
> choice (as then we could as well rip them all out).
> 
> Bottom line - _I_ think we should try to do something about this.
> An apparent option would be to have low priority tasklets (for
> just this purpose, as all others we certainly want to take priority),
> if that can reasonably be integrated with the schedulers.
Do you expect to be able to use the log-running key handlers and still need
a running system afterwards (rather than using them as a final
dump-everything when the system has already gone bad)? Then I suppose you
would need something like this, with voluntary preemption in the key
handlers. You then need to be able to recommence the keyhandlers where they
left off, retaking locks, finding their place in lists, trees, etc, even
when state of the system has significantly changed between preemption and
resumption. Well, I''m sure it can be done, but can anyone be bothered.

 -- Keir

Keir Fraser

2012-Sep-04 08:11 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/2012 09:04, "Keir Fraser" <keir.xen@gmail.com> wrote:
> On 04/09/2012 08:55, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>> Jan, Yes you could be right, if Sander is pinning CPUs. Anyway, I
wasn''t
>>> going to expend too much brain power on this situation. The case of
spending
>>> a few minutes in one key handler is not one I think is particularly
sane.
>> 
>> Which imo would call for reverting the patch. But then again, other
>> key handlers can easily take pretty long too (particularly on large
>> systems, albeit it is clear that the one here is particularly bad), and
>> declaring all of them pretty much useless probably isn''t the
best
>> choice (as then we could as well rip them all out).
>> 
>> Bottom line - _I_ think we should try to do something about this.
>> An apparent option would be to have low priority tasklets (for
>> just this purpose, as all others we certainly want to take priority),
>> if that can reasonably be integrated with the schedulers.
> 
> Do you expect to be able to use the log-running key handlers and still need
> a running system afterwards (rather than using them as a final
> dump-everything when the system has already gone bad)? Then I suppose you
> would need something like this, with voluntary preemption in the key
> handlers. You then need to be able to recommence the keyhandlers where they
> left off, retaking locks, finding their place in lists, trees, etc, even
> when state of the system has significantly changed between preemption and
> resumption. Well, I''m sure it can be done, but can anyone be
bothered.
My pragmatic take would be that: (a) Really long-running handlers that want
to dump every page mapping of every domain are pretty bloody stupid, and yes
we should consider if they are worthwhile at all; (b) moderately
long-running but useful handlers which nonetheless take a long time to dump
to Xen''s console, I would consider a sysctl to allow dom0 to request
dump
into a supplied memory buffer.
>  -- Keir
> 
>

Sander Eikelenboom

2012-Sep-04 08:13 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Jan,

Tuesday, September 4, 2012, 9:46:27 AM, you wrote:
>>>> On 04.09.12 at 09:08, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> This one
>> 
>>>>(XEN) [2012-09-02 00:55:02] traps.c:3156: GPF (0060):
ffff82c48015c9ee ->
>> ffff82c480224b13
>> 
>>> also worries me. While Xen gracefully recovers from it, these
>>> messages still generally indicate a problem somewhere. Could
>>> you resolve the addresses to file:line tuples? And, assuming
>>> this happens in the context of doing something on behalf of
>>> the guest in the context of a guest vCPU, could you also
>>> check what guest side action triggers this (e.g. by adding a
>>> call to show_execution_state() alongside the printing of the
>>> message)?
>> 
>> Hope i have done it right:
> Yes.
>> Gives (complete dmesg attached:
>> 
>> (XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa8ee82c0
>> (XEN) [2012-09-03 21:20:49] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa8ee8320
>> (XEN) [2012-09-04 03:00:34] traps.c:3156: GPF (0060): ffff82c48015c9ee
-> ffff82c480224b73
>> (XEN) [2012-09-04 03:00:34] traps.c:3157:  show_execution_state(regs): 
>> (XEN) [2012-09-04 03:00:34] ----[ Xen-4.2.0-rc4-pre  x86_64  debug=y 
Not tainted ]----
>> (XEN) [2012-09-04 03:00:34] CPU:    3
>> (XEN) [2012-09-04 03:00:34] RIP:    e008:[<ffff82c48015c9ee>]
context_switch+0x394/0xeeb
> Now that - in the middle of context switch code - almost certainly
> wants to be fixed, but we first need to understand what it is (and
> how it gets triggered by the guest). I.e. once again this requires
> resolving to file/line - care to do the conversion yourself, or send
> (or make available somewhere) the very xen-syms?
> Jan
Hmm don''t know how to get the file/line, only thing i have found is:

serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
(gdb) x/i 0xffff82c48015c9ee
0xffff82c48015c9ee <context_switch+916>:        mov    %edx,%gs
(gdb)


How to resolve the RIP could be a nice addition to the
http://wiki.xen.org/wiki/Debugging_Xen, so one could easily refer to that on how
to do it :-)

--
Sander

Sander Eikelenboom

2012-Sep-04 08:20 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Keir,

Tuesday, September 4, 2012, 10:11:42 AM, you wrote:
> On 04/09/2012 09:04, "Keir Fraser" <keir.xen@gmail.com>
wrote:
>> On 04/09/2012 08:55, "Jan Beulich" <JBeulich@suse.com>
wrote:
>> 
>>>> Jan, Yes you could be right, if Sander is pinning CPUs. Anyway,
I wasn''t
>>>> going to expend too much brain power on this situation. The
case of spending
>>>> a few minutes in one key handler is not one I think is
particularly sane.
>>> 
>>> Which imo would call for reverting the patch. But then again, other
>>> key handlers can easily take pretty long too (particularly on large
>>> systems, albeit it is clear that the one here is particularly bad),
and
>>> declaring all of them pretty much useless probably isn''t
the best
>>> choice (as then we could as well rip them all out).
>>> 
>>> Bottom line - _I_ think we should try to do something about this.
>>> An apparent option would be to have low priority tasklets (for
>>> just this purpose, as all others we certainly want to take
priority),
>>> if that can reasonably be integrated with the schedulers.
>> 
>> Do you expect to be able to use the log-running key handlers and still
need
>> a running system afterwards (rather than using them as a final
>> dump-everything when the system has already gone bad)? Then I suppose
you
>> would need something like this, with voluntary preemption in the key
>> handlers. You then need to be able to recommence the keyhandlers where
they
>> left off, retaking locks, finding their place in lists, trees, etc,
even
>> when state of the system has significantly changed between preemption
and
>> resumption. Well, I''m sure it can be done, but can anyone be
bothered.
> My pragmatic take would be that: (a) Really long-running handlers that want
> to dump every page mapping of every domain are pretty bloody stupid, and
yes
> we should consider if they are worthwhile at all; (b) moderately
> long-running but useful handlers which nonetheless take a long time to dump
> to Xen''s console, I would consider a sysctl to allow dom0 to
request dump
> into a supplied memory buffer.
Is it necessary for this case to let it be a key-handler for which one
can''t specify parameters to limit the output ?

In this case both hypervisor and kernel are running fine, so a interface via say
"xl debug" should be perfectly fine and providing parameters should be
possible ?
>>  -- Keir
>> 
>>

Sander Eikelenboom

2012-Sep-04 08:21 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Wei,

Monday, September 3, 2012, 5:20:55 PM, you wrote:
> On 09/02/2012 05:14 PM, Sander Eikelenboom wrote:
>> Sunday, September 2, 2012, 4:58:58 PM, you wrote:
>>
>>> On 02/09/2012 09:43, "Sander
Eikelenboom"<linux@eikelenboom.it>  wrote:
>>
>>>>> Quite simply, there likely needs to be more tracing on the
IOMMU fault path.
>>>>> That''s a separate concern from your keyhandler of
course, but just saying
>>>>> I''d be looking for the former rather than the
latter, for diagnosing
>>>>> Sander''s bug.
>>>>
>>>> Are there any printk''s I could add to get more
relevant info about the AMD-Vi:
>>>> IO_PAGE_FAULT ?
>>
>>> No really straightforward one. I think we need a per-IOMMU-type
handler to
>>> walk the IOMMU page table for a given virtual address, and dump
every
>>> page-table-entry on the path. Like an IOMMU version of
show_page_walk().
>>> Personally I would suspect this is more useful than the
dump-everything
>>> handlers: just give a *full* *detailed* walk for the actually
interesting
>>> virtual address (the one faulted on).
>>
>>>> I have attached new output from xl dmesg, this time with
iommu=debug on (the
>>>> option changed from 4.1 to 4.2).
>>
>>> Not easy to glean any more from that, without extra tracing such as
>>> described above, and/or digging into the guest to find what
driver-side
>>> actions are causing the faults.
>>
>> OK, too bad!
>> With xen 4.1 i haven''t experienced those page faults, but a
diff between /xen/drivers/passthrough/amd in both trees show quite some changes
:(
> Did you also update xen tools accordingly? Sometime I also saw a few 
> IO_PAGE_FAULTs came from nic if my tools version and HV version did not 
> match. But using recent 4.2 and corresponding xl, my tests went well.
> BTW: You could also try iommu=no-sharept to see if it helps.
I have done a make world && make install, after that checked the date on
(most of) the binaries and libs.
All should be 4.2, will try the iommu=no-sharept, but as said, this
wasn''t necessary with 4.1.3.

> Thanks,
> Wei
>>>   -- Keir
>>
>>>>
>>>>
>>>>>   -- Keir
>>>>
>>
>>
>>
>>
>>
>>





-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

Jan Beulich

2012-Sep-04 08:38 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 10:11, Keir Fraser <keir.xen@gmail.com> wrote:
> On 04/09/2012 09:04, "Keir Fraser" <keir.xen@gmail.com>
wrote:
> 
>> On 04/09/2012 08:55, "Jan Beulich" <JBeulich@suse.com>
wrote:
>> 
>>>> Jan, Yes you could be right, if Sander is pinning CPUs. Anyway,
I wasn''t
>>>> going to expend too much brain power on this situation. The
case of spending
>>>> a few minutes in one key handler is not one I think is
particularly sane.
>>> 
>>> Which imo would call for reverting the patch. But then again, other
>>> key handlers can easily take pretty long too (particularly on large
>>> systems, albeit it is clear that the one here is particularly bad),
and
>>> declaring all of them pretty much useless probably isn''t
the best
>>> choice (as then we could as well rip them all out).
>>> 
>>> Bottom line - _I_ think we should try to do something about this.
>>> An apparent option would be to have low priority tasklets (for
>>> just this purpose, as all others we certainly want to take
priority),
>>> if that can reasonably be integrated with the schedulers.
>> 
>> Do you expect to be able to use the log-running key handlers and still
need
>> a running system afterwards (rather than using them as a final
>> dump-everything when the system has already gone bad)? Then I suppose
you
>> would need something like this, with voluntary preemption in the key
>> handlers. You then need to be able to recommence the keyhandlers where
they
>> left off, retaking locks, finding their place in lists, trees, etc,
even
>> when state of the system has significantly changed between preemption
and
>> resumption. Well, I''m sure it can be done, but can anyone be
bothered.
It may not be that difficult for e.g. the ''d'' and
''0'' handlers.
> My pragmatic take would be that: (a) Really long-running handlers that want
> to dump every page mapping of every domain are pretty bloody stupid, and
yes
> we should consider if they are worthwhile at all; (b) moderately
> long-running but useful handlers which nonetheless take a long time to dump
> to Xen''s console, I would consider a sysctl to allow dom0 to
request dump
> into a supplied memory buffer.
At a first glance that sounds like a viable option, assuming that
the bulk of the time otherwise is being spent actually getting the
data out through the serial line. But if the long-running-ness is
in the nature of the keyhandler itself, then this wouldn''t help
much though. And I''d be afraid that ought to be the common
case when not running with sync_console, since actual serial
output happens asynchronously and hence shouldn''t affect the
latency of the keyhandler''s completion too much.

Jan

Keir Fraser

2012-Sep-04 08:54 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/2012 09:38, "Jan Beulich" <JBeulich@suse.com> wrote:
>> My pragmatic take would be that: (a) Really long-running handlers that
want
>> to dump every page mapping of every domain are pretty bloody stupid,
and yes
>> we should consider if they are worthwhile at all; (b) moderately
>> long-running but useful handlers which nonetheless take a long time to
dump
>> to Xen''s console, I would consider a sysctl to allow dom0 to
request dump
>> into a supplied memory buffer.
> 
> At a first glance that sounds like a viable option, assuming that
> the bulk of the time otherwise is being spent actually getting the
> data out through the serial line. But if the long-running-ness is
> in the nature of the keyhandler itself, then this wouldn''t help
> much though. And I''d be afraid that ought to be the common
> case when not running with sync_console, since actual serial
> output happens asynchronously and hence shouldn''t affect the
> latency of the keyhandler''s completion too much.
Well then, have we actually seen problems with async serial output, a
decent-sized serial output buffer, and the
''d''/''0'' handlers? Because if not,
we don''t have a problem to be solved. :)

Jan Beulich

2012-Sep-04 09:26 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> Hmm don''t know how to get the file/line, only thing i have found
is:
> 
> serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
> (gdb) x/i 0xffff82c48015c9ee
> 0xffff82c48015c9ee <context_switch+916>:        mov    %edx,%gs
> (gdb)
I''m not really a gdb expert, so I don''t know off the top of my
head either. I thought I said in a previous reply that people
generally appear to use the addr2line utility for that purpose.

But the disassembly already tells us where precisely the
problem is: The selector value (0x0063) attempted to be put
into %gs is apparently wrong in the context of the current
GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
and ought to be valid. I''m surprised the guest (and the current
process in it) survives this (as the failure here results in a failsafe
callback into the guest).

Looking at the Linux side of things, this has been that way
forever, and I think has always been broken: On x86-64, it
should also clear %gs here (since 32-bit processes use it for
their TLS, and there''s nothing wrong for a 64-bit process to put
something in there either), albeit not via loadsegment(), but
through xen_load_gs_index(). And I neither see why on 32-bit
it only clears %gs - %fs can as much hold a selector that might
get invalidated with the TLS descriptor updates. Eduardo,
Jeremy, Konrad?

Jan

Jan Beulich

2012-Sep-04 09:40 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 10:54, Keir Fraser <keir@xen.org> wrote:
> On 04/09/2012 09:38, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>> My pragmatic take would be that: (a) Really long-running handlers
that want
>>> to dump every page mapping of every domain are pretty bloody
stupid, and yes
>>> we should consider if they are worthwhile at all; (b) moderately
>>> long-running but useful handlers which nonetheless take a long time
to dump
>>> to Xen''s console, I would consider a sysctl to allow dom0
to request dump
>>> into a supplied memory buffer.
>> 
>> At a first glance that sounds like a viable option, assuming that
>> the bulk of the time otherwise is being spent actually getting the
>> data out through the serial line. But if the long-running-ness is
>> in the nature of the keyhandler itself, then this wouldn''t
help
>> much though. And I''d be afraid that ought to be the common
>> case when not running with sync_console, since actual serial
>> output happens asynchronously and hence shouldn''t affect the
>> latency of the keyhandler''s completion too much.
> 
> Well then, have we actually seen problems with async serial output, a
> decent-sized serial output buffer, and the
''d''/''0'' handlers? Because if not,
> we don''t have a problem to be solved. :)
To a degree - we have seen (large) systems becoming unstable
after making use of these keys, but obviously people were
instructed to use the keys because the system already had some
sort of problem (e.g. were dead locked on some spin lock, and
after use of the debug keys additionally got their time screwed
up).

The ''o'' key here just gets this to the extreme, which is why
I''m
wondering whether it was a good decision to add it in the first
place. And the same would apply to EPT''s ''D'' key. The
more that
these keys would presumably be used when a guest had a
problem, yet their use could render the whole system dead
(whereas ''d'' and ''0'' generally get used when
the host is in a
bad state already). Perhaps a minimal step would be to build/
enabled these only in debug=y builds? But really this functionality
should be exposed _only_ through the tools (similar to xenctx
and lsevtchn) imo (and along those lines I think ''e'' should
only
dump Dom0''s event channels).

Jan

Andrew Cooper

2012-Sep-04 13:29 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 04/09/12 10:40, Jan Beulich wrote:>>>> On 04.09.12 at 10:54, Keir Fraser <keir@xen.org> wrote:
>> On 04/09/2012 09:38, "Jan Beulich" <JBeulich@suse.com>
wrote:
>>
>>>> My pragmatic take would be that: (a) Really long-running
handlers that want
>>>> to dump every page mapping of every domain are pretty bloody
stupid, and yes
>>>> we should consider if they are worthwhile at all; (b)
moderately
>>>> long-running but useful handlers which nonetheless take a long
time to dump
>>>> to Xen''s console, I would consider a sysctl to allow
dom0 to request dump
>>>> into a supplied memory buffer.
>>> At a first glance that sounds like a viable option, assuming that
>>> the bulk of the time otherwise is being spent actually getting the
>>> data out through the serial line. But if the long-running-ness is
>>> in the nature of the keyhandler itself, then this wouldn''t
help
>>> much though. And I''d be afraid that ought to be the common
>>> case when not running with sync_console, since actual serial
>>> output happens asynchronously and hence shouldn''t affect
the
>>> latency of the keyhandler''s completion too much.
>> Well then, have we actually seen problems with async serial output, a
>> decent-sized serial output buffer, and the
''d''/''0'' handlers? Because if not,
>> we don''t have a problem to be solved. :)
> To a degree - we have seen (large) systems becoming unstable
> after making use of these keys, but obviously people were
> instructed to use the keys because the system already had some
> sort of problem (e.g. were dead locked on some spin lock, and
> after use of the debug keys additionally got their time screwed
> up).
>
> The ''o'' key here just gets this to the extreme, which is
why I''m
> wondering whether it was a good decision to add it in the first
> place. And the same would apply to EPT''s ''D''
key. The more that
> these keys would presumably be used when a guest had a
> problem, yet their use could render the whole system dead
> (whereas ''d'' and ''0'' generally get used
when the host is in a
> bad state already). Perhaps a minimal step would be to build/
> enabled these only in debug=y builds? But really this functionality
> should be exposed _only_ through the tools (similar to xenctx
> and lsevtchn) imo (and along those lines I think ''e''
should only
> dump Dom0''s event channels).
>
> Jan
I would disagree with that final part.  I have lost count of the number
of times that I have used the ''e'' debug key to diagnose a
problem with a
locked up system where dom0 was not necessarily accessible.  Getting all
domains information is very useful, even if it can be long at times. 
The times when the length is a problem are also the times when the
server is broken to a point that it is not a problem people are
concerned with.

~Andrew
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Sander Eikelenboom

2012-Sep-04 16:43 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Hello Wei,

Monday, September 3, 2012, 5:20:55 PM, you wrote:
> On 09/02/2012 05:14 PM, Sander Eikelenboom wrote:
>> Sunday, September 2, 2012, 4:58:58 PM, you wrote:
>>
>>> On 02/09/2012 09:43, "Sander
Eikelenboom"<linux@eikelenboom.it>  wrote:
>>
>>>>> Quite simply, there likely needs to be more tracing on the
IOMMU fault path.
>>>>> That''s a separate concern from your keyhandler of
course, but just saying
>>>>> I''d be looking for the former rather than the
latter, for diagnosing
>>>>> Sander''s bug.
>>>>
>>>> Are there any printk''s I could add to get more
relevant info about the AMD-Vi:
>>>> IO_PAGE_FAULT ?
>>
>>> No really straightforward one. I think we need a per-IOMMU-type
handler to
>>> walk the IOMMU page table for a given virtual address, and dump
every
>>> page-table-entry on the path. Like an IOMMU version of
show_page_walk().
>>> Personally I would suspect this is more useful than the
dump-everything
>>> handlers: just give a *full* *detailed* walk for the actually
interesting
>>> virtual address (the one faulted on).
>>
>>>> I have attached new output from xl dmesg, this time with
iommu=debug on (the
>>>> option changed from 4.1 to 4.2).
>>
>>> Not easy to glean any more from that, without extra tracing such as
>>> described above, and/or digging into the guest to find what
driver-side
>>> actions are causing the faults.
>>
>> OK, too bad!
>> With xen 4.1 i haven''t experienced those page faults, but a
diff between /xen/drivers/passthrough/amd in both trees show quite some changes
:(
> Did you also update xen tools accordingly? Sometime I also saw a few 
> IO_PAGE_FAULTs came from nic if my tools version and HV version did not 
> match. But using recent 4.2 and corresponding xl, my tests went well.
> BTW: You could also try iommu=no-sharept to see if it helps.
Tried it and it doesn''t help.
I now even got a "xl dmesg" which shows a IO_PAGE_FAULT occuring very
early, before any toolstack or guest can be involved:

(XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id = 0x0a05,
root table = 0x24d84b000, domain = 0, paging mode = 3
(XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id = 0x0a06,
root table = 0x24d84b000, domain = 0, paging mode = 3
(XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id = 0x0a07,
root table = 0x24d84b000, domain = 0, paging mode = 3
(XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id = 0x0b00,
root table = 0x24d84b000, domain = 0, paging mode = 3
(XEN) [2012-09-04 15:51:17] Scrubbing Free RAM:
...........................<0>AMD-Vi: IO_PAGE_FAULT: domain = 0, device id
= 0x0a06, fault address = 0xc2c2c2c0
(XEN) [2012-09-04 15:51:18] ............................................done.
(XEN) [2012-09-04 15:51:19] Initial low memory virq threshold set at 0x4000
pages.
(XEN) [2012-09-04 15:51:19] Std. Loglevel: All
(XEN) [2012-09-04 15:51:19] Guest Loglevel: All
(XEN) [2012-09-04 15:51:19] Xen is relinquishing VGA console.


Complete dmesg attached.
> Thanks,
> Wei
>>>   -- Keir
>>
>>>>
>>>>
>>>>>   -- Keir
>>>>
>>
>>
>>
>>
>>
>>





-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Sep-05 10:14 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> Did you also update xen tools accordingly? Sometime I also saw a few 
>> IO_PAGE_FAULTs came from nic if my tools version and HV version did not
>> match. But using recent 4.2 and corresponding xl, my tests went well.
>> BTW: You could also try iommu=no-sharept to see if it helps.
> 
> Tried it and it doesn''t help.
> I now even got a "xl dmesg" which shows a IO_PAGE_FAULT occuring
very early,
> before any toolstack or guest can be involved:
> 
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a05,
> root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a06,
> root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a07,
> root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0b00,
> root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] Scrubbing Free RAM: 
> ...........................<0>AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id =
> 0x0a06, fault address = 0xc2c2c2c0
Looks like use of uninitialized memory (assuming you''re using a
debug hypervisor, that''s the pattern scrub_one_page() puts
there). But it''s unclear to me what device should be doing any
I/O at that point (and even if one does, how it would get the
bad address loaded). What is 0a:00.6?

Jan

Sander Eikelenboom

2012-Sep-05 10:25 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> Did you also update xen tools accordingly? Sometime I also saw a
few
>>> IO_PAGE_FAULTs came from nic if my tools version and HV version did
not
>>> match. But using recent 4.2 and corresponding xl, my tests went
well.
>>> BTW: You could also try iommu=no-sharept to see if it helps.
>> 
>> Tried it and it doesn''t help.
>> I now even got a "xl dmesg" which shows a IO_PAGE_FAULT
occuring very early,
>> before any toolstack or guest can be involved:
>> 
>> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a05,
>> root table = 0x24d84b000, domain = 0, paging mode = 3
>> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a06,
>> root table = 0x24d84b000, domain = 0, paging mode = 3
>> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a07,
>> root table = 0x24d84b000, domain = 0, paging mode = 3
>> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0b00,
>> root table = 0x24d84b000, domain = 0, paging mode = 3
>> (XEN) [2012-09-04 15:51:17] Scrubbing Free RAM: 
>> ...........................<0>AMD-Vi: IO_PAGE_FAULT: domain = 0,
device id =
>> 0x0a06, fault address = 0xc2c2c2c0
> Looks like use of uninitialized memory (assuming you''re using a
> debug hypervisor, that''s the pattern scrub_one_page() puts
> there). But it''s unclear to me what device should be doing any
> I/O at that point (and even if one does, how it would get the
> bad address loaded). What is 0a:00.6?
since 4.2-rc4 is still unstable it has debug=y for what i know, so yes.
This particular IO_PAGE_FAULT happened before the kernel loads, so the kernel
and pciback shouldn''t be causing the issue one would say.
With pciback i''m hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.* and 07:00.0
at boot.

Is there any code i could add to get more info where it comes from ?

00:00.0 Host bridge: ATI Technologies Inc RD890 Northbridge only single slot
PCI-e GFX Hydra part (rev 02)
00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O Memory
Management Unit (IOMMU)
00:02.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express
gpp port B)
00:03.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express
gpp port C)
00:05.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express
gpp port E)
00:06.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express
gpp port F)
00:0a.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (external gfx1
port A)
00:0b.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (NB-SB link)
00:0c.0 PCI bridge: ATI Technologies Inc Device 5a20
00:0d.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (external gfx1
port B)
00:11.0 SATA controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 SATA Controller
[AHCI mode] (rev 40)
00:12.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
00:12.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
00:13.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
00:13.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 41)
00:14.3 ISA bridge: ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
(rev 40)
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
00:14.5 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI2
Controller
00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge
(PCIE port 0)
00:16.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
00:16.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Address
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link
Control
03:06.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
04:00.0 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.1 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.2 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.3 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.4 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.5 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
04:00.7 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
05:00.0 Multimedia video controller: Conexant Systems, Inc. CX25850
06:00.0 VGA compatible controller: ATI Technologies Inc RV620 LE [Radeon HD
3450]
06:00.1 Audio device: ATI Technologies Inc RV620 Audio device [Radeon HD 34xx
Series]
07:00.0 Multimedia video controller: Conexant Systems, Inc. Device 8210
08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI
Express Gigabit Ethernet controller (rev 03)
09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI
Express Gigabit Ethernet controller (rev 03)
0a:00.0 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.1 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.2 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.3 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.4 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.5 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0a:00.7 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0 Host
Controller
0b:00.0 VGA compatible controller: nVidia Corporation G98 [GeForce 8400 GS] (rev
a1)

> Jan

Jan Beulich

2012-Sep-05 10:40 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 05.09.12 at 12:25, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> ...........................<0>AMD-Vi: IO_PAGE_FAULT: domain =
0, device id =
>>> 0x0a06, fault address = 0xc2c2c2c0
> 
>> Looks like use of uninitialized memory (assuming you're using a
>> debug hypervisor, that's the pattern scrub_one_page() puts
>> there). But it's unclear to me what device should be doing any
>> I/O at that point (and even if one does, how it would get the
>> bad address loaded). What is 0a:00.6?
> 
> since 4.2-rc4 is still unstable it has debug=y for what i know, so yes.
> This particular IO_PAGE_FAULT happened before the kernel loads, so the 
> kernel and pciback shouldn't be causing the issue one would say.
> With pciback i'm hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.* and 07:00.0
at
> boot.
> 
> Is there any code i could add to get more info where it comes from ?
Hardly, since those accesses are asynchronous to what the CPUs
do. But ...
> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
... are your keyboard/mouse perhaps connected to this one? In
which case I'd suppose the 1:1 tables set up for Dom0 might not
be complete. Wei?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-05 10:48 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Wednesday, September 5, 2012, 12:40:31 PM, you wrote:
>>>> On 05.09.12 at 12:25, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>>> ...........................<0>AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id =
>>>> 0x0a06, fault address = 0xc2c2c2c0
>> 
>>> Looks like use of uninitialized memory (assuming you're using a
>>> debug hypervisor, that's the pattern scrub_one_page() puts
>>> there). But it's unclear to me what device should be doing any
>>> I/O at that point (and even if one does, how it would get the
>>> bad address loaded). What is 0a:00.6?
>> 
>> since 4.2-rc4 is still unstable it has debug=y for what i know, so yes.
>> This particular IO_PAGE_FAULT happened before the kernel loads, so the 
>> kernel and pciback shouldn't be causing the issue one would say.
>> With pciback i'm hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.* and
07:00.0 at
>> boot.
>> 
>> Is there any code i could add to get more info where it comes from ?
> Hardly, since those accesses are asynchronous to what the CPUs
> do. But ...
>> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB
2.0 Host Controller
> ... are your keyboard/mouse perhaps connected to this one? In
> which case I'd suppose the 1:1 tables set up for Dom0 might not
> be complete. Wei?
Nope this machine is running without any keyboard/mouse, the USB controller at
present has only one device connected to it:
in the pv guest lsusb:
Bus 007 Device 002: ID 10cf:5500 Velleman Components, Inc. 8055 Experiment
Interface Board (address=0)

And as i said, the hardware didn't change between my switch from xen-4.1.3
to xen-4.2.

But i will revert to 4.1 and see if i can spot any difference in xl dmesg
between the two.

> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Sep-05 11:41 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 05.09.12 at 12:48, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> Wednesday, September 5, 2012, 12:40:31 PM, you wrote:
> 
>>>>> On 05.09.12 at 12:25, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>>>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>>>> ...........................<0>AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id =
>>>>> 0x0a06, fault address = 0xc2c2c2c0
>>> 
>>>> Looks like use of uninitialized memory (assuming you're
using a
>>>> debug hypervisor, that's the pattern scrub_one_page() puts
>>>> there). But it's unclear to me what device should be doing
any
>>>> I/O at that point (and even if one does, how it would get the
>>>> bad address loaded). What is 0a:00.6?
>>> 
>>> since 4.2-rc4 is still unstable it has debug=y for what i know, so
yes.
>>> This particular IO_PAGE_FAULT happened before the kernel loads, so
the
>>> kernel and pciback shouldn't be causing the issue one would
say.
>>> With pciback i'm hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.* and
07:00.0 at
>>> boot.
>>> 
>>> Is there any code i could add to get more info where it comes from
?
> 
>> Hardly, since those accesses are asynchronous to what the CPUs
>> do. But ...
> 
>>> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort
USB 2.0
> Host Controller
> 
>> ... are your keyboard/mouse perhaps connected to this one? In
>> which case I'd suppose the 1:1 tables set up for Dom0 might not
>> be complete. Wei?
> 
> Nope this machine is running without any keyboard/mouse, the USB controller
> at present has only one device connected to it:
> in the pv guest lsusb:
> Bus 007 Device 002: ID 10cf:5500 Velleman Components, Inc. 8055 Experiment 
> Interface Board (address=0)
And this is not by chance hanging off the controller that the fault
was reported for?
> And as i said, the hardware didn't change between my switch from
xen-4.1.3
> to xen-4.2.
I understand that, but the problem here showed up only after
toggling the page table sharing option iirc.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-05 12:11 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Wednesday, September 5, 2012, 1:41:25 PM, you wrote:
>>>> On 05.09.12 at 12:48, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> Wednesday, September 5, 2012, 12:40:31 PM, you wrote:
>> 
>>>>>> On 05.09.12 at 12:25, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>>> Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>>>>>> On 04.09.12 at 18:43, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>>>>> ...........................<0>AMD-Vi:
IO_PAGE_FAULT: domain = 0, device id =
>>>>>> 0x0a06, fault address = 0xc2c2c2c0
>>>> 
>>>>> Looks like use of uninitialized memory (assuming you're
using a
>>>>> debug hypervisor, that's the pattern scrub_one_page()
puts
>>>>> there). But it's unclear to me what device should be
doing any
>>>>> I/O at that point (and even if one does, how it would get
the
>>>>> bad address loaded). What is 0a:00.6?
>>>> 
>>>> since 4.2-rc4 is still unstable it has debug=y for what i know,
so yes.
>>>> This particular IO_PAGE_FAULT happened before the kernel loads,
so the
>>>> kernel and pciback shouldn't be causing the issue one would
say.
>>>> With pciback i'm hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.*
and 07:00.0 at
>>>> boot.
>>>> 
>>>> Is there any code i could add to get more info where it comes
from ?
>> 
>>> Hardly, since those accesses are asynchronous to what the CPUs
>>> do. But ...
>> 
>>>> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to
4âPort USB 2.0
>> Host Controller
>> 
>>> ... are your keyboard/mouse perhaps connected to this one? In
>>> which case I'd suppose the 1:1 tables set up for Dom0 might not
>>> be complete. Wei?
>> 
>> Nope this machine is running without any keyboard/mouse, the USB
controller
>> at present has only one device connected to it:
>> in the pv guest lsusb:
>> Bus 007 Device 002: ID 10cf:5500 Velleman Components, Inc. 8055
Experiment
>> Interface Board (address=0)
> And this is not by chance hanging off the controller that the fault
> was reported for?
Yes but i also get faults for the 07:00.0 later on booting.
>> And as i said, the hardware didn't change between my switch from
xen-4.1.3
>> to xen-4.2.
> I understand that, but the problem here showed up only after
> toggling the page table sharing option iirc.
You mean that the fault occurring this early during boot, only happened after
enabling the "iommu=no-sharept" ?
That's correct although it's not clear if that is coincidence or not.
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Wang

2012-Sep-05 12:30 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 09/04/2012 06:43 PM, Sander Eikelenboom wrote:> Hello Wei,
> >
> Tried it and it doesn''t help.
> I now even got a "xl dmesg" which shows a IO_PAGE_FAULT occuring
very early, before any toolstack or guest can be involved:
>
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a05, root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a06, root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0a07, root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] AMD-Vi: Setup I/O page table: device id =
0x0b00, root table = 0x24d84b000, domain = 0, paging mode = 3
> (XEN) [2012-09-04 15:51:17] Scrubbing Free RAM:
...........................<0>AMD-Vi: IO_PAGE_FAULT: domain = 0, device id
= 0x0a06, fault address = 0xc2c2c2c0

you have dom0_mem=1024M, so you could try to remove this line any see if 
it disappears. It would also be helpful to provide a lspci output from 
your system.

Thanks,
Wei
> (XEN) [2012-09-04 15:51:18]
............................................done.
> (XEN) [2012-09-04 15:51:19] Initial low memory virq threshold set at 0x4000
pages.
> (XEN) [2012-09-04 15:51:19] Std. Loglevel: All
> (XEN) [2012-09-04 15:51:19] Guest Loglevel: All
> (XEN) [2012-09-04 15:51:19] Xen is relinquishing VGA console.
>
>
> Complete dmesg attached.
>
>> Thanks,
>> Wei
>
>>>>    -- Keir
>>>
>>>>>
>>>>>
>>>>>>    -- Keir
>>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>
>
>
>

Wei Wang

2012-Sep-05 12:48 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

OK, I got your lspci info, please find my answer inline.

On 09/05/2012 12:25 PM, Sander Eikelenboom wrote:
>
> 00:00.0 Host bridge: ATI Technologies Inc RD890 Northbridge only single
slot PCI-e GFX Hydra part (rev 02)
> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O
Memory Management Unit (IOMMU)
> 00:02.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI
express gpp port B)
> 00:03.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI
express gpp port C)
> 00:05.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI
express gpp port E)
> 00:06.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI
express gpp port F)
> 00:0a.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (external
gfx1 port A)
> 00:0b.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (NB-SB
link)
> 00:0c.0 PCI bridge: ATI Technologies Inc Device 5a20
> 00:0d.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (external
gfx1 port B)
> 00:11.0 SATA controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 SATA
Controller [AHCI mode] (rev 40)
> 00:12.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
> 00:12.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
> 00:13.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
> 00:13.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
> 00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 41)
> 00:14.3 ISA bridge: ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host
controller (rev 40)
> 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> 00:14.5 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI2
Controller
> 00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI
bridge (PCIE port 0)
> 00:16.0 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0
Controller
> 00:16.2 USB controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI
Controller
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
HyperTransport Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM
Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Miscellaneous Control
> 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link
Control
> 03:06.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev
10)
> 04:00.0 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.1 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.2 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.3 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.4 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.5 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 04:00.7 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 05:00.0 Multimedia video controller: Conexant Systems, Inc. CX25850
> 06:00.0 VGA compatible controller: ATI Technologies Inc RV620 LE [Radeon HD
3450]
> 06:00.1 Audio device: ATI Technologies Inc RV620 Audio device [Radeon HD
34xx Series]
> 07:00.0 Multimedia video controller: Conexant Systems, Inc. Device 8210
What kind of device is it, is it a graphic card? Is there any firmware 
running on this device? Some firmwares like vbios might have bad 
assumption of address space. And how many memory the guest has? Could 
you attach a lspci -vvv output from your guest?

Thanks,
Wei
> 08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B
PCI Express Gigabit Ethernet controller (rev 03)
> 09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B
PCI Express Gigabit Ethernet controller (rev 03)
> 0a:00.0 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.1 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.2 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.3 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.4 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.5 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0a:00.7 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB 2.0
Host Controller
> 0b:00.0 VGA compatible controller: nVidia Corporation G98 [GeForce 8400 GS]
(rev a1)
>
>
>
>
>
>
>
>> Jan
>
>
>
>

Wei Wang

2012-Sep-05 14:15 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 09/05/2012 12:40 PM, Jan Beulich wrote:>>>> On 05.09.12 at 12:25, Sander
Eikelenboom<linux@eikelenboom.it>  wrote:
>> Wednesday, September 5, 2012, 12:14:02 PM, you wrote:
>>>>>> On 04.09.12 at 18:43, Sander
Eikelenboom<linux@eikelenboom.it>  wrote:
>>>> ...........................<0>AMD-Vi: IO_PAGE_FAULT:
domain = 0, device id >>>> 0x0a06, fault address = 0xc2c2c2c0
>>
>>> Looks like use of uninitialized memory (assuming you're using a
>>> debug hypervisor, that's the pattern scrub_one_page() puts
>>> there). But it's unclear to me what device should be doing any
>>> I/O at that point (and even if one does, how it would get the
>>> bad address loaded). What is 0a:00.6?
>>
>> since 4.2-rc4 is still unstable it has debug=y for what i know, so yes.
>> This particular IO_PAGE_FAULT happened before the kernel loads, so the
>> kernel and pciback shouldn't be causing the issue one would say.
>> With pciback i'm hiding 03:06.0, 04:00.*, 05:00.0, 0a:00.* and
07:00.0 at
>> boot.
>>
>> Is there any code i could add to get more info where it comes from ?
>
> Hardly, since those accesses are asynchronous to what the CPUs
> do. But ...
>
>> 0a:00.6 USB controller: NetMos Technology MCS9990 PCIe to 4âPort USB
2.0 Host Controller
>
> ... are your keyboard/mouse perhaps connected to this one? In
> which case I'd suppose the 1:1 tables set up for Dom0 might not
> be complete. Wei?
I checked this on my machine using 'o' key and it has been mapped as a 
2MB frame see: gfn: 000c2c00  mfn: 000c2c00. But maybe I have no device 
access to this address... Another possibility is interrupt message. 4.1 
does not show IO_PAGE_FAULT for interrupts, but 4.2 does (changeset 
23199:dbd98ab2f87f). I will send a patch to dump flags from IO_PAGE_FAULT.

Thanks,
Wei
> Jan
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Sep-05 15:05 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> On 05.09.12 at 16:15, Wei Wang <wei.wang2@amd.com> wrote:
> I checked this on my machine using ''o'' key and it has
been mapped as a
> 2MB frame see: gfn: 000c2c00  mfn: 000c2c00. But maybe I have no device 
Obviously he won''t have gfn c2c00 and alike since he''s using
dom0_mem=1G - you should probably try that too. But even then
I wouldn''t expect you to see anything, as you also need a
babbling device.
> access to this address... Another possibility is interrupt message. 4.1 
> does not show IO_PAGE_FAULT for interrupts, but 4.2 does (changeset 
> 23199:dbd98ab2f87f). I will send a patch to dump flags from IO_PAGE_FAULT.
Yes, that might help a little further.

Jan

Jan Beulich

2012-Sep-20 08:08 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Ping?
>>> On 04.09.12 at 11:26, Jan Beulich wrote:
>>>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> > Hmm don''t know how to get the file/line, only thing i have
found is:
> > 
> > serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
> > GNU gdb (GDB) 7.0.1-debian
> > Copyright (C) 2009 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
> > (gdb) x/i 0xffff82c48015c9ee
> > 0xffff82c48015c9ee <context_switch+916>:        mov    %edx,%gs
> > (gdb)
> 
> I''m not really a gdb expert, so I don''t know off the top
of my
> head either. I thought I said in a previous reply that people
> generally appear to use the addr2line utility for that purpose.
> 
> But the disassembly already tells us where precisely the
> problem is: The selector value (0x0063) attempted to be put
> into %gs is apparently wrong in the context of the current
> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
> and ought to be valid. I''m surprised the guest (and the current
> process in it) survives this (as the failure here results in a failsafe
> callback into the guest).
> 
> Looking at the Linux side of things, this has been that way
> forever, and I think has always been broken: On x86-64, it
> should also clear %gs here (since 32-bit processes use it for
> their TLS, and there''s nothing wrong for a 64-bit process to put
> something in there either), albeit not via loadsegment(), but
> through xen_load_gs_index(). And I neither see why on 32-bit
> it only clears %gs - %fs can as much hold a selector that might
> get invalidated with the TLS descriptor updates. Eduardo,
> Jeremy, Konrad?
> 
> Jan
>

Sander Eikelenboom

2012-Sep-28 14:08 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Thursday, September 20, 2012, 10:08:45 AM, you wrote:
> Ping?
Perhaps changing the subject or starting a new thread altogether for this would
provoke more reaction ?
>>>> On 04.09.12 at 11:26, Jan Beulich wrote:
>>>>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> > Hmm don''t know how to get the file/line, only thing i
have found is:
>> > 
>> > serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
>> > GNU gdb (GDB) 7.0.1-debian
>> > Copyright (C) 2009 Free Software Foundation, Inc.
>> > License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
>> > This is free software: you are free to change and redistribute it.
>> > There is NO WARRANTY, to the extent permitted by law.  Type
"show copying"
>> > and "show warranty" for details.
>> > This GDB was configured as "x86_64-linux-gnu".
>> > For bug reporting instructions, please see:
>> > <http://www.gnu.org/software/gdb/bugs/>...
>> > Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
>> > (gdb) x/i 0xffff82c48015c9ee
>> > 0xffff82c48015c9ee <context_switch+916>:        mov   
%edx,%gs
>> > (gdb)
>> 
>> I''m not really a gdb expert, so I don''t know off the
top of my
>> head either. I thought I said in a previous reply that people
>> generally appear to use the addr2line utility for that purpose.
>> 
>> But the disassembly already tells us where precisely the
>> problem is: The selector value (0x0063) attempted to be put
>> into %gs is apparently wrong in the context of the current
>> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
>> and ought to be valid. I''m surprised the guest (and the
current
>> process in it) survives this (as the failure here results in a failsafe
>> callback into the guest).
>> 
>> Looking at the Linux side of things, this has been that way
>> forever, and I think has always been broken: On x86-64, it
>> should also clear %gs here (since 32-bit processes use it for
>> their TLS, and there''s nothing wrong for a 64-bit process to
put
>> something in there either), albeit not via loadsegment(), but
>> through xen_load_gs_index(). And I neither see why on 32-bit
>> it only clears %gs - %fs can as much hold a selector that might
>> get invalidated with the TLS descriptor updates. Eduardo,
>> Jeremy, Konrad?
>> 
>> Jan
>>

Jeremy Fitzhardinge

2012-Sep-28 21:26 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On 09/20/2012 01:08 AM, Jan Beulich wrote:> Ping?
I''ve been meaning to work up a reply, but I haven''t had time
to swap in
all the context again.

    J
>
>>>> On 04.09.12 at 11:26, Jan Beulich wrote:
>>>>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>> Hmm don''t know how to get the file/line, only thing i have
found is:
>>>
>>> serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
>>> GNU gdb (GDB) 7.0.1-debian
>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type
"show copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>...
>>> Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
>>> (gdb) x/i 0xffff82c48015c9ee
>>> 0xffff82c48015c9ee <context_switch+916>:        mov   
%edx,%gs
>>> (gdb)
>> I''m not really a gdb expert, so I don''t know off the
top of my
>> head either. I thought I said in a previous reply that people
>> generally appear to use the addr2line utility for that purpose.
>>
>> But the disassembly already tells us where precisely the
>> problem is: The selector value (0x0063) attempted to be put
>> into %gs is apparently wrong in the context of the current
>> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
>> and ought to be valid. I''m surprised the guest (and the
current
>> process in it) survives this (as the failure here results in a failsafe
>> callback into the guest).
>>
>> Looking at the Linux side of things, this has been that way
>> forever, and I think has always been broken: On x86-64, it
>> should also clear %gs here (since 32-bit processes use it for
>> their TLS, and there''s nothing wrong for a 64-bit process to
put
>> something in there either), albeit not via loadsegment(), but
>> through xen_load_gs_index(). And I neither see why on 32-bit
>> it only clears %gs - %fs can as much hold a selector that might
>> get invalidated with the TLS descriptor updates. Eduardo,
>> Jeremy, Konrad?
>>
>> Jan
>>
>
>

Konrad Rzeszutek Wilk

2012-Oct-02 20:08 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On Fri, Sep 28, 2012 at 02:26:58PM -0700, Jeremy Fitzhardinge
wrote:> On 09/20/2012 01:08 AM, Jan Beulich wrote:
> > Ping?
> 
> I''ve been meaning to work up a reply, but I haven''t had
time to swap in
> all the context again.
You would remember most of it. Perhaps that was what was saved at that
point of time and we did not need to restore/save other registers?
> 
>     J
> 
> >
> >>>> On 04.09.12 at 11:26, Jan Beulich wrote:
> >>>>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> >>> Hmm don''t know how to get the file/line, only thing i
have found is:
> >>>
> >>> serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
> >>> GNU gdb (GDB) 7.0.1-debian
> >>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> >>> This is free software: you are free to change and redistribute
it.
> >>> There is NO WARRANTY, to the extent permitted by law.  Type
"show copying"
> >>> and "show warranty" for details.
> >>> This GDB was configured as "x86_64-linux-gnu".
> >>> For bug reporting instructions, please see:
> >>> <http://www.gnu.org/software/gdb/bugs/>...
> >>> Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
> >>> (gdb) x/i 0xffff82c48015c9ee
> >>> 0xffff82c48015c9ee <context_switch+916>:        mov   
%edx,%gs
> >>> (gdb)
> >> I''m not really a gdb expert, so I don''t know off
the top of my
> >> head either. I thought I said in a previous reply that people
> >> generally appear to use the addr2line utility for that purpose.
> >>
> >> But the disassembly already tells us where precisely the
> >> problem is: The selector value (0x0063) attempted to be put
> >> into %gs is apparently wrong in the context of the current
> >> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
> >> and ought to be valid. I''m surprised the guest (and the
current
> >> process in it) survives this (as the failure here results in a
failsafe
> >> callback into the guest).
> >>
> >> Looking at the Linux side of things, this has been that way
> >> forever, and I think has always been broken: On x86-64, it
> >> should also clear %gs here (since 32-bit processes use it for
> >> their TLS, and there''s nothing wrong for a 64-bit process
to put
> >> something in there either), albeit not via loadsegment(), but
> >> through xen_load_gs_index(). And I neither see why on 32-bit
> >> it only clears %gs - %fs can as much hold a selector that might
> >> get invalidated with the TLS descriptor updates. Eduardo,
> >> Jeremy, Konrad?
> >>
> >> Jan
> >>
> >
> >

Konrad Rzeszutek Wilk

2012-Oct-02 20:09 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On Tue, Sep 04, 2012 at 10:26:18AM +0100, Jan Beulich
wrote:> >>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> > Hmm don''t know how to get the file/line, only thing i have
found is:
> > 
> > serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
> > GNU gdb (GDB) 7.0.1-debian
> > Copyright (C) 2009 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
> > (gdb) x/i 0xffff82c48015c9ee
> > 0xffff82c48015c9ee <context_switch+916>:        mov    %edx,%gs
> > (gdb)
> 
> I''m not really a gdb expert, so I don''t know off the top
of my
> head either. I thought I said in a previous reply that people
> generally appear to use the addr2line utility for that purpose.
> 
> But the disassembly already tells us where precisely the
> problem is: The selector value (0x0063) attempted to be put
> into %gs is apparently wrong in the context of the current
> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
> and ought to be valid. I''m surprised the guest (and the current
> process in it) survives this (as the failure here results in a failsafe
> callback into the guest).
> 
> Looking at the Linux side of things, this has been that way
> forever, and I think has always been broken: On x86-64, it
> should also clear %gs here (since 32-bit processes use it for
> their TLS, and there''s nothing wrong for a 64-bit process to put
> something in there either), albeit not via loadsegment(), but
> through xen_load_gs_index(). And I neither see why on 32-bit
> it only clears %gs - %fs can as much hold a selector that might
> get invalidated with the TLS descriptor updates. Eduardo,
> Jeremy, Konrad?
How is it on the SLES side? Do you set/restore all of the segment
registers?> 
> Jan

Matt Wilson

2012-Oct-02 20:54 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

On Tue, Sep 04, 2012 at 10:26:18AM +0100, Jan Beulich
wrote:> >>> On 04.09.12 at 10:13, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> > Hmm don''t know how to get the file/line, only thing i have
found is:
> > 
> > serveerstertje:/boot# gdb xen-syms-4.2.0-rc4-pre
> > GNU gdb (GDB) 7.0.1-debian
> > Copyright (C) 2009 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /boot/xen-syms-4.2.0-rc4-pre...done.
> > (gdb) x/i 0xffff82c48015c9ee
> > 0xffff82c48015c9ee <context_switch+916>:        mov    %edx,%gs
> > (gdb)
> 
> I''m not really a gdb expert, so I don''t know off the top
of my
> head either. I thought I said in a previous reply that people
> generally appear to use the addr2line utility for that purpose.
addr2line works, but "l *0xffff82c48015c9ee" in gdb should as well.

Matt

Jan Beulich

2012-Oct-03 13:12 UTC

head link

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 10/02/12 10:21
PM >>>
>On Tue, Sep 04, 2012 at 10:26:18AM +0100, Jan Beulich wrote:
>> But the disassembly already tells us where precisely the
>> problem is: The selector value (0x0063) attempted to be put
>> into %gs is apparently wrong in the context of the current
>> GDT. Now, that''s GDT_ENTRY_TLS_MIN on the Linux side,
>> and ought to be valid. I''m surprised the guest (and the
current
>> process in it) survives this (as the failure here results in a failsafe
>> callback into the guest).
>> 
>> Looking at the Linux side of things, this has been that way
>> forever, and I think has always been broken: On x86-64, it
>> should also clear %gs here (since 32-bit processes use it for
>> their TLS, and there''s nothing wrong for a 64-bit process to
put
>> something in there either), albeit not via loadsegment(), but
>> through xen_load_gs_index(). And I neither see why on 32-bit
>> it only clears %gs - %fs can as much hold a selector that might
>> get invalidated with the TLS descriptor updates. Eduardo,
>> Jeremy, Konrad?
>
>How is it on the SLES side? Do you set/restore all of the segment
>registers?
I think so (don''t have the sources around at home to check), but
that''s
not the point. What absolutely has to happen is the _clearing_ of the
selector registers before switching descriptor tables (even if done via
multicall, as the hypervisor may restore guest state between any two
pieces of a multicall set).

Jan

Xen devel - Aug 2012 - Using debug-key 'o: Dump IOMMU p2m table, locks up machine

Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine

Re: Using debug-key ''o: Dump IOMMU p2m table, locks up machine