thr3ads.net - Xen devel - High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Marek Marczykowski

2013-Mar-13 20:50 UTC

High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Hi,

I''ve still have problems with ACPI(?) on Xen. After some system startup
or
resume CPU temperature goes high although all domUs (and dom0) are idle. On
"good" system startup it is about 50-55C, on "bad" - above
67C (most time
above 70C). I''ve noticed difference in C-states repored by Xen
(attached
files). On "bad" startups in addition suspend doesn''t work -
system restarts
during suspend (still didn''t managed to get console messages - I
don''t have
serial port on this system). Note that sometimes system boots fine
("good"
state), but problem occurs after some suspend/resume cycles. Some time ago
I''ve got other symptoms: only CPU0 was used - for all VCPUs (according
to xl
vcpu-list). Maybe it is related?

Hardware: Dell Latitude E6420
CPU: Intel i5-2520M

Software:
xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve
picking
up the idle CPU for a VCPU"), with reverted commit "Introduce
system_state
variable."
But the same problem on vanilla xen 4.1.2.

Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer
(but still occurs).
Kernel config:
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD
I''ve tried some bisect from 3.7.4 to 3.7.6, but without success because
problem isn''t 100% reproducible.

Any ideas?

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Mar-15 03:00 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On mer, 2013-03-13 at 21:50 +0100, Marek Marczykowski
wrote:> Hi,
> 
> I''ve still have problems with ACPI(?) on Xen. After some system
startup or
> resume CPU temperature goes high although all domUs (and dom0) are idle.
>Resume? Sorry for going a bit off-topic (or, if you want, for not being
able to help with the issue you''re seeing), but that means
suspend/resume works for you under Xen?

That would be really nice, as I''ve never seen it working properly... Is
that me that am missing something? :-O

Actually, now that I think of it, there was a guy at FOSDEM, with
QubesOS installed on its laptop, telling us suspend was working for him,
but I''ve never had the chance to try it yet.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-15 03:22 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 15.03.2013 04:00, Dario Faggioli wrote:> On mer, 2013-03-13 at 21:50 +0100, Marek Marczykowski wrote:
>> Hi,
>>
>> I''ve still have problems with ACPI(?) on Xen. After some
system startup or
>> resume CPU temperature goes high although all domUs (and dom0) are
idle.
>>
> Resume? Sorry for going a bit off-topic (or, if you want, for not being
> able to help with the issue you''re seeing), but that means
> suspend/resume works for you under Xen?
Yes, with patches from Konrad''s devel/acpi-s3.v10 branch. Actually one
of
those patches looks to be already in upstream linux, but two remaining still
need to be applied.
> 
> That would be really nice, as I''ve never seen it working
properly... Is
> that me that am missing something? :-O
> 
> Actually, now that I think of it, there was a guy at FOSDEM, with
> QubesOS installed on its laptop, telling us suspend was working for him,
> but I''ve never had the chance to try it yet.
> 
> Regards,
> Dario
> 

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Mar-15 13:02 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski
wrote:> Hi,
> 
> I''ve still have problems with ACPI(?) on Xen. After some system
startup or
> resume CPU temperature goes high although all domUs (and dom0) are idle. On
> "good" system startup it is about 50-55C, on "bad" -
above 67C (most time
> above 70C). I''ve noticed difference in C-states repored by Xen
(attached
> files). On "bad" startups in addition suspend doesn''t
work - system restarts
> during suspend (still didn''t managed to get console messages - I
don''t have
> serial port on this system). Note that sometimes system boots fine
("good"
> state), but problem occurs after some suspend/resume cycles. Some time ago
> I''ve got other symptoms: only CPU0 was used - for all VCPUs
(according to xl
> vcpu-list). Maybe it is related?
> 
> Hardware: Dell Latitude E6420
> CPU: Intel i5-2520M
> 
> Software:
> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve
picking
> up the idle CPU for a VCPU"), with reverted commit "Introduce
system_state
> variable."
> But the same problem on vanilla xen 4.1.2.
> 
> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer
> (but still occurs).
> Kernel config:
>
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD
> I''ve tried some bisect from 3.7.4 to 3.7.6, but without success
because
> problem isn''t 100% reproducible.
> 
> Any ideas?
That C-states difference is important. The SYSIO part on your box means that the
CPU ends up doing an MWAIT. An HALT on the other hand is not so power-saving
friendly.

Looking at this:> (XEN) no cpu_id for acpi_id 5
> (XEN) no cpu_id for acpi_id 6
> (XEN) no cpu_id for acpi_id 7
> (XEN) no cpu_id for acpi_id 8
.. means that xen-acpi-processor was trying to probe for the ACPI IDs of the
the other CPUs that the machine theoritcally can support. That means it got
the ACPI information for the first four CPUs (which is good).

You can as the first step in trying to figure this out, add #define DEBUG 1
in xen-acpi-processor.c right before any of the #includes. And also boot
Xen with ''cpufreq=verbose''. That should tell you what kind of
C-states the
xen-acpi-processor uploaded (And if it did it for all of the vCPUS).

If both bootups show that we do upload the C-states for all the CPUs but they
vary that means digging a bit deeper in the ACPI code. Specifically in 
acpi_processor_get_power_info_cst and seeing if it hits any of the
''continue''.

Then I would say take also the DSDT for both bootups and compare them. It might
be that the BIOS is using a scratch register at reboot to construct the C-states
and somehow it ends up being corrupted. Which means that on the next warm reboot
the C-states has bogus data. This does show up in the field :-(

Marek Marczykowski

2013-Mar-22 15:34 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 15.03.2013 14:02, Konrad Rzeszutek Wilk wrote:> On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote:
>> Hi,
>>
>> I''ve still have problems with ACPI(?) on Xen. After some
system startup or
>> resume CPU temperature goes high although all domUs (and dom0) are
idle. On
>> "good" system startup it is about 50-55C, on "bad"
- above 67C (most time
>> above 70C). I''ve noticed difference in C-states repored by Xen
(attached
>> files). On "bad" startups in addition suspend
doesn''t work - system restarts
>> during suspend (still didn''t managed to get console messages -
I don''t have
>> serial port on this system). Note that sometimes system boots fine
("good"
>> state), but problem occurs after some suspend/resume cycles. Some time
ago
>> I''ve got other symptoms: only CPU0 was used - for all VCPUs
(according to xl
>> vcpu-list). Maybe it is related?
>>
>> Hardware: Dell Latitude E6420
>> CPU: Intel i5-2520M
>>
>> Software:
>> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit:
improve picking
>> up the idle CPU for a VCPU"), with reverted commit "Introduce
system_state
>> variable."
>> But the same problem on vanilla xen 4.1.2.
>>
>> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much
rarer
>> (but still occurs).
>> Kernel config:
>>
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD
>> I''ve tried some bisect from 3.7.4 to 3.7.6, but without
success because
>> problem isn''t 100% reproducible.
>>
>> Any ideas?
> 
> That C-states difference is important. The SYSIO part on your box means
that the
> CPU ends up doing an MWAIT. An HALT on the other hand is not so
power-saving
> friendly.
> 
> Looking at this:
>> (XEN) no cpu_id for acpi_id 5
>> (XEN) no cpu_id for acpi_id 6
>> (XEN) no cpu_id for acpi_id 7
>> (XEN) no cpu_id for acpi_id 8
> 
> .. means that xen-acpi-processor was trying to probe for the ACPI IDs of
the
> the other CPUs that the machine theoritcally can support. That means it got
> the ACPI information for the first four CPUs (which is good).
> 
> You can as the first step in trying to figure this out, add #define DEBUG 1
> in xen-acpi-processor.c right before any of the #includes. And also boot
> Xen with ''cpufreq=verbose''. That should tell you what
kind of C-states the
> xen-acpi-processor uploaded (And if it did it for all of the vCPUS).
> 
> If both bootups show that we do upload the C-states for all the CPUs but
they
> vary that means digging a bit deeper in the ACPI code. Specifically in 
> acpi_processor_get_power_info_cst and seeing if it hits any of the
''continue''.
> 
> Then I would say take also the DSDT for both bootups and compare them. It
might
> be that the BIOS is using a scratch register at reboot to construct the
C-states
> and somehow it ends up being corrupted. Which means that on the next warm
reboot
> the C-states has bogus data. This does show up in the field :-(
Finally I''ve found some time for further debugging this. And it looks
like
some deeper ACPI code problem...

I''ve switched to 3.8.4, on which problem is much easier to reproduce
(almost
every startup).

On bad bootup, xen-acpi-processor didn''t found any C-state: for each
CPU
_pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In
this case suspend (or shutdown) always ends up with reset.

On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend
succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading
xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c), but
still temperature keep high. Regardless of xen-acpi-processor reloading, next
suspend always fails.

Not sure how C-states can be related to S3 suspend, but perhaps something more
general with ACPI is wrong?

Each time DSDT (get from /sys/firmware/acpi/tables) is exactly the same.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Mar-22 16:56 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski
wrote:> On 15.03.2013 14:02, Konrad Rzeszutek Wilk wrote:
> > On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote:
> >> Hi,
> >>
> >> I''ve still have problems with ACPI(?) on Xen. After some
system startup or
> >> resume CPU temperature goes high although all domUs (and dom0) are
idle. On
> >> "good" system startup it is about 50-55C, on
"bad" - above 67C (most time
> >> above 70C). I''ve noticed difference in C-states repored
by Xen (attached
> >> files). On "bad" startups in addition suspend
doesn''t work - system restarts
> >> during suspend (still didn''t managed to get console
messages - I don''t have
> >> serial port on this system). Note that sometimes system boots fine
("good"
> >> state), but problem occurs after some suspend/resume cycles. Some
time ago
> >> I''ve got other symptoms: only CPU0 was used - for all
VCPUs (according to xl
> >> vcpu-list). Maybe it is related?
> >>
> >> Hardware: Dell Latitude E6420
> >> CPU: Intel i5-2520M
> >>
> >> Software:
> >> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit:
improve picking
> >> up the idle CPU for a VCPU"), with reverted commit
"Introduce system_state
> >> variable."
> >> But the same problem on vanilla xen 4.1.2.
> >>
> >> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens
much rarer
> >> (but still occurs).
> >> Kernel config:
> >>
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD
> >> I''ve tried some bisect from 3.7.4 to 3.7.6, but without
success because
> >> problem isn''t 100% reproducible.
> >>
> >> Any ideas?
> > 
> > That C-states difference is important. The SYSIO part on your box
means that the
> > CPU ends up doing an MWAIT. An HALT on the other hand is not so
power-saving
> > friendly.
> > 
> > Looking at this:
> >> (XEN) no cpu_id for acpi_id 5
> >> (XEN) no cpu_id for acpi_id 6
> >> (XEN) no cpu_id for acpi_id 7
> >> (XEN) no cpu_id for acpi_id 8
> > 
> > .. means that xen-acpi-processor was trying to probe for the ACPI IDs
of the
> > the other CPUs that the machine theoritcally can support. That means
it got
> > the ACPI information for the first four CPUs (which is good).
> > 
> > You can as the first step in trying to figure this out, add #define
DEBUG 1
> > in xen-acpi-processor.c right before any of the #includes. And also
boot
> > Xen with ''cpufreq=verbose''. That should tell you
what kind of C-states the
> > xen-acpi-processor uploaded (And if it did it for all of the vCPUS).
> > 
> > If both bootups show that we do upload the C-states for all the CPUs
but they
> > vary that means digging a bit deeper in the ACPI code. Specifically in
> > acpi_processor_get_power_info_cst and seeing if it hits any of the
''continue''.
> > 
> > Then I would say take also the DSDT for both bootups and compare them.
It might
> > be that the BIOS is using a scratch register at reboot to construct
the C-states
> > and somehow it ends up being corrupted. Which means that on the next
warm reboot
> > the C-states has bogus data. This does show up in the field :-(
> 
> Finally I''ve found some time for further debugging this. And it
looks like
> some deeper ACPI code problem...
> 
> I''ve switched to 3.8.4, on which problem is much easier to
reproduce (almost
> every startup).
> 
> On bad bootup, xen-acpi-processor didn''t found any C-state: for
each CPU
> _pr.flags.power and _pr->power.count was 0 (but
flags.power_setup_done=1). In
> this case suspend (or shutdown) always ends up with reset.
This is you booting the machine from a cold-state or a warm one?

There are some BIOSes out there that I know that use the scratchpad registers in
IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or
Linux
touch it then the P-states and C-states that the BIOS generates are buggy.

But that is not the case here - you are saying that the DSDT after disassembling
(so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them), the
_PSD, _PSS, and _PCT look the same?

You could also look at the FACP table and see if they are
different.> 
> On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend
> succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading
> xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c),
but
> still temperature keep high. Regardless of xen-acpi-processor reloading,
next
> suspend always fails.
If you reload, and look at the runqeueus, are all of them using the ACPI
idler or the default one?
> 
> Not sure how C-states can be related to S3 suspend, but perhaps something
more
> general with ACPI is wrong?
This reminds me of something. I recall a long long time ago seeing something
like this....
Completly forgot about this until now. The difference was whether the
Xen''s cpu_idle
as running a) the acpi_idle (so using the different C-states), or b) the default
one
(so just using HLT).

With the b), during resume it would get half-way through
(http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would
actually
continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log

This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.

Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html

And it looks Kevin''s recommendation was use the a) case with
max_cstates=1
to narrow it down.
> 
> Each time DSDT (get from /sys/firmware/acpi/tables) is exactly the same.
> 
> -- 
> Best Regards / Pozdrawiam,
> Marek Marczykowski
> Invisible Things Lab
>

Marek Marczykowski

2013-Mar-25 11:36 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote:> On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote:
>> I''ve switched to 3.8.4, on which problem is much easier to
reproduce (almost
>> every startup).
>>
>> On bad bootup, xen-acpi-processor didn''t found any C-state:
for each CPU
>> _pr.flags.power and _pr->power.count was 0 (but
flags.power_setup_done=1). In
>> this case suspend (or shutdown) always ends up with reset.
> 
> This is you booting the machine from a cold-state or a warm one?
Doesn''t matter - in both cases the same result.
> There are some BIOSes out there that I know that use the scratchpad
registers in
> IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or
Linux
> touch it then the P-states and C-states that the BIOS generates are buggy.
> 
> But that is not the case here - you are saying that the DSDT after
disassembling
> (so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them),
the
> _PSD, _PSS, and _PCT look the same?
Binary versions are the same so assume disassembled also. I''ve copied
full
/sys/firmware/acpi/tables at some startups and in all cases (both cold and
warm startups) all were the same.
In case of any noticed difference will check disassembled versions.
> You could also look at the FACP table and see if they are different.
>>
>> On good one xen-acpi-processor got C1-C3 states for each CPU, then
suspend
>> succeeded, but after resume CPU0 had C1-C3, but others only C1.
Reloading
>> xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key
c), but
>> still temperature keep high. Regardless of xen-acpi-processor
reloading, next
>> suspend always fails.
> 
> If you reload, and look at the runqeueus, are all of them using the ACPI
> idler or the default one?
The ACPI one (before reload and after).
>> Not sure how C-states can be related to S3 suspend, but perhaps
something more
>> general with ACPI is wrong?
> 
> This reminds me of something. I recall a long long time ago seeing
something like this....
> Completly forgot about this until now. The difference was whether the
Xen''s cpu_idle
> as running a) the acpi_idle (so using the different C-states), or b) the
default one
> (so just using HLT).
> 
> With the b), during resume it would get half-way through
> (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would
actually
> continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log
> 
> This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.
> 
> Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html
> 
> And it looks Kevin''s recommendation was use the a) case with
max_cstates=1
> to narrow it down.
When default_idle used, resume doesn''t work at all (even the first
one). Details:
(1) With max_cstates=1, without xen-acpi-processor module: default_idle used.
Suspend succeed, but always hang at resume.

(2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle used.
Suspend succeed, resume also, but after resume above problem exists (high
temperature, C2-C3 states only present on CPU0, subsequent suspends always
ends up with reboot).

(3) Without max_cstate=1, with xen-acpi-processor module loaded: same as (2).

(4) Without max_cstate=1, without xen-acpi-processor module loaded: same as (1).

One more observation: when xen compiled with debug=y, (2) and (4) cases
behaves the same as (1).

Hopefully I will have real serial console somehow in this week and will be
able to get more details from hang and reboot cases.

BTW Any chances for Xen ACPI S3 patches in upstream kernel?

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Mar-25 14:17 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski
wrote:> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote:
> > On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote:
> >> I''ve switched to 3.8.4, on which problem is much easier
to reproduce (almost
> >> every startup).
> >>
> >> On bad bootup, xen-acpi-processor didn''t found any
C-state: for each CPU
> >> _pr.flags.power and _pr->power.count was 0 (but
flags.power_setup_done=1). In
> >> this case suspend (or shutdown) always ends up with reset.
> > 
> > This is you booting the machine from a cold-state or a warm one?
> 
> Doesn''t matter - in both cases the same result.
> 
> > There are some BIOSes out there that I know that use the scratchpad
registers in
> > IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If
Xen or Linux
> > touch it then the P-states and C-states that the BIOS generates are
buggy.
> > 
> > But that is not the case here - you are saying that the DSDT after
disassembling
> > (so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on
them), the
> > _PSD, _PSS, and _PCT look the same?
> 
> Binary versions are the same so assume disassembled also. I''ve
copied full
> /sys/firmware/acpi/tables at some startups and in all cases (both cold and
> warm startups) all were the same.
> In case of any noticed difference will check disassembled versions.
<sigh> Was hoping it was something as simple as that :-)
.. snip..> > This reminds me of something. I recall a long long time ago seeing
something like this....
> > Completly forgot about this until now. The difference was whether the
Xen''s cpu_idle
> > as running a) the acpi_idle (so using the different C-states), or b)
the default one
> > (so just using HLT).
> > 
> > With the b), during resume it would get half-way through
> > (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it
would actually
> > continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log
> > 
> > This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.
> > 
> > Oh look:
http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html
> > 
> > And it looks Kevin''s recommendation was use the a) case with
max_cstates=1
> > to narrow it down.
> 
> When default_idle used, resume doesn''t work at all (even the first
one). Details:
> (1) With max_cstates=1, without xen-acpi-processor module: default_idle
used.
> Suspend succeed, but always hang at resume.
AHA! So the bug persist.> 
> (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle
used.
> Suspend succeed, resume also, but after resume above problem exists (high
> temperature, C2-C3 states only present on CPU0, subsequent suspends always
> ends up with reboot).
> 
> (3) Without max_cstate=1, with xen-acpi-processor module loaded: same as
(2).
> 
> (4) Without max_cstate=1, without xen-acpi-processor module loaded: same as
(1).
> 
> One more observation: when xen compiled with debug=y, (2) and (4) cases
> behaves the same as (1).
Oh, that is something new.> 
> Hopefully I will have real serial console somehow in this week and will be
> able to get more details from hang and reboot cases.
> 
> BTW Any chances for Xen ACPI S3 patches in upstream kernel?
<sigh> Now that the regression storm of v3.9 has subsided I should have
some breathing room to address that. 

> 
> -- 
> Best Regards / Pozdrawiam,
> Marek Marczykowski
> Invisible Things Lab
>

Marek Marczykowski

2013-Mar-25 14:56 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 25.03.2013 15:17, Konrad Rzeszutek Wilk wrote:> On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski wrote:
>> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote:
>>> This reminds me of something. I recall a long long time ago seeing
something like this....
>>> Completly forgot about this until now. The difference was whether
the Xen''s cpu_idle
>>> as running a) the acpi_idle (so using the different C-states), or
b) the default one
>>> (so just using HLT).
>>>
>>> With the b), during resume it would get half-way through
>>> (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a)
it would actually
>>> continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log
>>>
>>> This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.
>>>
>>> Oh look:
http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html
>>>
>>> And it looks Kevin''s recommendation was use the a) case
with max_cstates=1
>>> to narrow it down.
>>
>> When default_idle used, resume doesn''t work at all (even the
first one). Details:
>> (1) With max_cstates=1, without xen-acpi-processor module: default_idle
used.
>> Suspend succeed, but always hang at resume.
> 
> AHA! So the bug persist.
>>
>> (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle
used.
>> Suspend succeed, resume also, but after resume above problem exists
(high
>> temperature, C2-C3 states only present on CPU0, subsequent suspends
always
>> ends up with reboot).
>>
>> (3) Without max_cstate=1, with xen-acpi-processor module loaded: same
as (2).
>>
>> (4) Without max_cstate=1, without xen-acpi-processor module loaded:
same as (1).
>>
>> One more observation: when xen compiled with debug=y, (2) and (4) cases
>> behaves the same as (1).
> 
> Oh, that is something new.
I''ve tried also some (automated :) ) bisection on xen from 4.1.2 to
4.1.4, but
unfortunately results wasn''t deterministic... My script don''t
distinguish
different symptoms (reboot at suspend, hang at resume, incomplete C-states
after resume, etc), so this can be reason for such non-deterministic results...

One time I''ve got this commit as first bad:
commit 329d4280255ff44300913f24119f52d3459c1ed0
Author: Jan Beulich <jbeulich@suse.com>
Date:   Tue Apr 17 08:33:33 2012 +0100

    XENPF_set_processor_pminfo XEN_PM_CX overflows states array

Maybe related?
>>
>> Hopefully I will have real serial console somehow in this week and will
be
>> able to get more details from hang and reboot cases.
>>
>> BTW Any chances for Xen ACPI S3 patches in upstream kernel?
> 
> <sigh> Now that the regression storm of v3.9 has subsided I should
have
> some breathing room to address that. 
I keep fingers crossed.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-26 12:17 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 25.03.2013 15:17, Konrad Rzeszutek Wilk wrote:> On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski wrote:
>> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote:
>>> This reminds me of something. I recall a long long time ago seeing
something like this....
>>> Completly forgot about this until now. The difference was whether
the Xen''s cpu_idle
>>> as running a) the acpi_idle (so using the different C-states), or
b) the default one
>>> (so just using HLT).
>>>
>>> With the b), during resume it would get half-way through
>>> (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a)
it would actually
>>> continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log
>>>
>>> This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.
>>>
>>> Oh look:
http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html
>>>
>>> And it looks Kevin''s recommendation was use the a) case
with max_cstates=1
>>> to narrow it down.
>>
>> When default_idle used, resume doesn''t work at all (even the
first one). Details:
>> (1) With max_cstates=1, without xen-acpi-processor module: default_idle
used.
>> Suspend succeed, but always hang at resume.
> 
> AHA! So the bug persist.
>>
>> (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle
used.
>> Suspend succeed, resume also, but after resume above problem exists
(high
>> temperature, C2-C3 states only present on CPU0, subsequent suspends
always
>> ends up with reboot).
>>
>> (3) Without max_cstate=1, with xen-acpi-processor module loaded: same
as (2).
>>
>> (4) Without max_cstate=1, without xen-acpi-processor module loaded:
same as (1).
>>
>> One more observation: when xen compiled with debug=y, (2) and (4) cases
>> behaves the same as (1).
> 
> Oh, that is something new.
Finally got serial console :)
The debug=y problem is (actually at resume):
(XEN) Assertion ''test_bit(vector, cfg->used_vectors)''
failed at io_apic.c:542
(XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: ffff82c48029ff18
(XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi: ffff830421060538
(XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8:  ffff88041820eb60
(XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11: 0000000000000000
(XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
(XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000300b81000   cr2: ffff880402070198
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48029feb8:
(XEN)    0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0
(XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728
(XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60
(XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729
(XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0
(XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88
(XEN)    000000000000002a 000000000000002a 00000000ffff372a 0000002000000000
(XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000
(XEN)    0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion ''test_bit(vector, cfg->used_vectors)''
failed at io_apic.c:542
(XEN) ****************************************


-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Mar-26 13:11 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
> Finally got serial console :)
> The debug=y problem is (actually at resume):
> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)''
failed at io_apic.c:542
> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c48015e288>] 
> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: ffff82c48029ff18
> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi: ffff830421060538
> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8:  ffff88041820eb60
> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11: 0000000000000000
> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
> (XEN)    0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)''
failed at io_apic.c:542
> (XEN) ****************************************
To make sense of this, we need to know the register (and maybe
stack) allocation at this point, to know which vector it was that
triggered the assertion. You can either do this analysis for us, or
point us at the xen-syms binary matching the xen.gz you used.

From the register values, the most likely candidates are vector 0xe9
and 0x2a. The former having two registers set to this value seems
more likely from than angle, but vectors in the 0xe? range should
never end up in smp_irq_move_cleanup_interrupt().

And if it''s the 0x2a one, then we''d need to know what IRQ it
was
last used for. That can''t be reconstructed from the data above, so
would require you being able to reproduce this and adding some
instrumentation to the code.

Jan

Marek Marczykowski

2013-Mar-26 13:50 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 14:11, Jan Beulich wrote:>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>> Finally got serial console :)
>> The debug=y problem is (actually at resume):
>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    0
>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>> (XEN)    0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>> (XEN) ****************************************
> 
> To make sense of this, we need to know the register (and maybe
> stack) allocation at this point, to know which vector it was that
> triggered the assertion. You can either do this analysis for us, or
> point us at the xen-syms binary matching the xen.gz you used.
"info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so
0xe9.
> From the register values, the most likely candidates are vector 0xe9
> and 0x2a. The former having two registers set to this value seems
> more likely from than angle, but vectors in the 0xe? range should
> never end up in smp_irq_move_cleanup_interrupt().
> 
> And if it''s the 0x2a one, then we''d need to know what IRQ
it was
> last used for. That can''t be reconstructed from the data above, so
> would require you being able to reproduce this and adding some
> instrumentation to the code.
> 
> Jan
> 

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-26 15:47 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26/03/2013 13:50, Marek Marczykowski wrote:> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>>> Finally got serial console :)
>>> The debug=y problem is (actually at resume):
>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>>> (XEN) CPU:    0
>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs:
e008
>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>> (XEN)    0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>> (XEN) ****************************************
>> To make sense of this, we need to know the register (and maybe
>> stack) allocation at this point, to know which vector it was that
>> triggered the assertion. You can either do this analysis for us, or
>> point us at the xen-syms binary matching the xen.gz you used.
> "info scope smp_irq_move_cleanup_interrupt" said vector is in
%rbx, so 0xe9.
>
>> From the register values, the most likely candidates are vector 0xe9
>> and 0x2a. The former having two registers set to this value seems
>> more likely from than angle, but vectors in the 0xe? range should
>> never end up in smp_irq_move_cleanup_interrupt().
>>
>> And if it''s the 0x2a one, then we''d need to know what
IRQ it was
>> last used for. That can''t be reconstructed from the data
above, so
>> would require you being able to reproduce this and adding some
>> instrumentation to the code.
>>
>> Jan
>>
>
Could it be something to do with switching virtual wire mode, and having
PIC compatibility stuff left in the IO-APIC after leaving the BIOS but
before starting back up again?

Looking at the stack dump, there is an extra exception frame under what
is printed by the assertion failure.

0000002000000000 TRAP_syscall
ffffffff81a01db8 guest kernel addr
0000000000000246 FLAGS
000000000000e033 FLAT_RING3_CS64
ffffffff8105dd5a guest kernel addr
000000000000e02b FLAT_RING3_SS{64,32}

So it appears that we are already executing a guest (presumably dom0) by the
time this assertion occurs.  From the serial, is there any indication that dom0
has started up again?

I would have thought that we should have successfully reset the IO-APIC back up
properly before we would ever get back around to executing dom0.

~Andrew

Jan Beulich

2013-Mar-26 16:03 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 26.03.13 at 14:50, Marek Marczykowski
<marmarek@invisiblethingslab.com>
wrote:> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com>
> wrote:
>>> Finally got serial console :)
>>> The debug=y problem is (actually at resume):
>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>>> (XEN) CPU:    0
>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs:
e008
>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>> (XEN)    0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>> (XEN) ****************************************
>> 
>> To make sense of this, we need to know the register (and maybe
>> stack) allocation at this point, to know which vector it was that
>> triggered the assertion. You can either do this analysis for us, or
>> point us at the xen-syms binary matching the xen.gz you used.
> 
> "info scope smp_irq_move_cleanup_interrupt" said vector is in
%rbx, so 0xe9.
And that system isn''t using a strange mixed mode IO-APIC/legacy
PIC model, where particularly IRQ 9 (usually ACPI SCI) gets
channeled through the legacy PIC?

Could you attach the complete log, ideally with ''i'' output
logged
right before suspending?

Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily
reproducible with 4.1.5-rc1, could you try changing the containing
loop''s upper bound from "< NR_VECTORS" to
"<= LAST_DYNAMIC_VECTOR"?

Jan

Andrew Cooper

2013-Mar-26 16:12 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26/03/2013 15:47, Andrew Cooper wrote:> On 26/03/2013 13:50, Marek Marczykowski wrote:
>> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>>>> Finally got serial console :)
>>>> The debug=y problem is (actually at resume):
>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>>>> (XEN) CPU:    0
>>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000  
cs: e008
>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>>> (XEN)    0000000000000000
>>>> (XEN) Xen call trace:
>>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>> (XEN)
>>>> (XEN)
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 0:
>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>> (XEN) ****************************************
>>> To make sense of this, we need to know the register (and maybe
>>> stack) allocation at this point, to know which vector it was that
>>> triggered the assertion. You can either do this analysis for us, or
>>> point us at the xen-syms binary matching the xen.gz you used.
>> "info scope smp_irq_move_cleanup_interrupt" said vector is in
%rbx, so 0xe9.
>>
>>> From the register values, the most likely candidates are vector
0xe9
>>> and 0x2a. The former having two registers set to this value seems
>>> more likely from than angle, but vectors in the 0xe? range should
>>> never end up in smp_irq_move_cleanup_interrupt().
>>>
>>> And if it''s the 0x2a one, then we''d need to know
what IRQ it was
>>> last used for. That can''t be reconstructed from the data
above, so
>>> would require you being able to reproduce this and adding some
>>> instrumentation to the code.
>>>
>>> Jan
>>>
> Could it be something to do with switching virtual wire mode, and having
> PIC compatibility stuff left in the IO-APIC after leaving the BIOS but
> before starting back up again?
>
> Looking at the stack dump, there is an extra exception frame under what
> is printed by the assertion failure.
>
> 0000002000000000 TRAP_syscall
Apologies - this is a vector 0x20 interrupt, not TRAP_syscall, which
makes sense as 0x20 is FIRST_DYNAMIC_IRQ which is also the cleanup IPI
vector.

The other comments still stand, espcially as we appear to be
interrupting dom0 which is already running.

~Andrew
> ffffffff81a01db8 guest kernel addr
> 0000000000000246 FLAGS
> 000000000000e033 FLAT_RING3_CS64
> ffffffff8105dd5a guest kernel addr
> 000000000000e02b FLAT_RING3_SS{64,32}
>
> So it appears that we are already executing a guest (presumably dom0) by
the time this assertion occurs.  From the serial, is there any indication that
dom0 has started up again?
>
> I would have thought that we should have successfully reset the IO-APIC
back up properly before we would ever get back around to executing dom0.
>
> ~Andrew
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-26 16:45 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 17:03, Jan Beulich wrote:>>>> On 26.03.13 at 14:50, Marek Marczykowski
<marmarek@invisiblethingslab.com>
> wrote:
>> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>> wrote:
>>>> Finally got serial console :)
>>>> The debug=y problem is (actually at resume):
>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>>>> (XEN) CPU:    0
>>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000  
cs: e008
>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>>> (XEN)    0000000000000000
>>>> (XEN) Xen call trace:
>>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>> (XEN)
>>>> (XEN)
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 0:
>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>> (XEN) ****************************************
>>>
>>> To make sense of this, we need to know the register (and maybe
>>> stack) allocation at this point, to know which vector it was that
>>> triggered the assertion. You can either do this analysis for us, or
>>> point us at the xen-syms binary matching the xen.gz you used.
>>
>> "info scope smp_irq_move_cleanup_interrupt" said vector is in
%rbx, so 0xe9.
> 
> And that system isn''t using a strange mixed mode IO-APIC/legacy
> PIC model, where particularly IRQ 9 (usually ACPI SCI) gets
> channeled through the legacy PIC?
I don''t know...
> Could you attach the complete log, ideally with ''i''
output logged
> right before suspending?
Sure, attached.
> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily
> reproducible with 4.1.5-rc1, could you try changing the containing
> loop''s upper bound from "< NR_VECTORS" to
> "<= LAST_DYNAMIC_VECTOR"?
I''ve tried 4.2.x some time ago and bug also exists there (but I had not
console, so not sure if exactly the same). 4.3 seems to be not affected.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-26 16:47 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 17:12, Andrew Cooper wrote:> On 26/03/2013 15:47, Andrew Cooper wrote:
>> On 26/03/2013 13:50, Marek Marczykowski wrote:
>>> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>>>>> Finally got serial console :)
>>>>> The debug=y problem is (actually at resume):
>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C
]----
>>>>> (XEN) CPU:    0
>>>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000 
cs: e008
>>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>>>> (XEN)    0000000000000000
>>>>> (XEN) Xen call trace:
>>>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>> (XEN)
>>>>> (XEN)
>>>>> (XEN) ****************************************
>>>>> (XEN) Panic on CPU 0:
>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>> (XEN) ****************************************
>>>> To make sense of this, we need to know the register (and maybe
>>>> stack) allocation at this point, to know which vector it was
that
>>>> triggered the assertion. You can either do this analysis for
us, or
>>>> point us at the xen-syms binary matching the xen.gz you used.
>>> "info scope smp_irq_move_cleanup_interrupt" said vector
is in %rbx, so 0xe9.
>>>
>>>> From the register values, the most likely candidates are vector
0xe9
>>>> and 0x2a. The former having two registers set to this value
seems
>>>> more likely from than angle, but vectors in the 0xe? range
should
>>>> never end up in smp_irq_move_cleanup_interrupt().
>>>>
>>>> And if it''s the 0x2a one, then we''d need to
know what IRQ it was
>>>> last used for. That can''t be reconstructed from the
data above, so
>>>> would require you being able to reproduce this and adding some
>>>> instrumentation to the code.
>>>>
>>>> Jan
>>>>
>> Could it be something to do with switching virtual wire mode, and
having
>> PIC compatibility stuff left in the IO-APIC after leaving the BIOS but
>> before starting back up again?
>>
>> Looking at the stack dump, there is an extra exception frame under what
>> is printed by the assertion failure.
>>
>> 0000002000000000 TRAP_syscall
> 
> Apologies - this is a vector 0x20 interrupt, not TRAP_syscall, which
> makes sense as 0x20 is FIRST_DYNAMIC_IRQ which is also the cleanup IPI
> vector.
> 
> The other comments still stand, espcially as we appear to be
> interrupting dom0 which is already running.
Indeed, dom0 is running at this stage (see log in my second email).
> 
> ~Andrew
> 
>> ffffffff81a01db8 guest kernel addr
>> 0000000000000246 FLAGS
>> 000000000000e033 FLAT_RING3_CS64
>> ffffffff8105dd5a guest kernel addr
>> 000000000000e02b FLAT_RING3_SS{64,32}
>>
>> So it appears that we are already executing a guest (presumably dom0)
by the time this assertion occurs.  From the serial, is there any indication
that dom0 has started up again?
>>
>> I would have thought that we should have successfully reset the IO-APIC
back up properly before we would ever get back around to executing dom0.
>>
>> ~Andrew
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-26 17:02 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26/03/2013 16:45, Marek Marczykowski wrote:> On 26.03.2013 17:03, Jan Beulich wrote:
>>>>> On 26.03.13 at 14:50, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>> wrote:
>>> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>>> wrote:
>>>>> Finally got serial console :)
>>>>> The debug=y problem is (actually at resume):
>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C
]----
>>>>> (XEN) CPU:    0
>>>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
ffff82c48029ff18
>>>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a   rdi:
ffff830421060538
>>>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
ffff88041820eb60
>>>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0   r11:
0000000000000000
>>>>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>>>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>>>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000 
cs: e008
>>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>>>> (XEN)    0000000000000000 000000000000e030 ffff82c48029ff18
ffff82c4802dd9e0
>>>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729
000000013fff3728
>>>>> (XEN)    ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7
ffff82c48014de60
>>>>> (XEN)    00000000ffff3729 ffffffff81b907c0 000000013fff3728
00000000ffff3729
>>>>> (XEN)    ffffffff81a01e18 00000000ffff3729 0000000000000000
0000000000007ff0
>>>>> (XEN)    0000000000000000 ffff88041820eb60 ffff8803fd1820a8
ffffffff81b90a88
>>>>> (XEN)    000000000000002a 000000000000002a 00000000ffff372a
0000002000000000
>>>>> (XEN)    ffffffff8105dd5a 000000000000e033 0000000000000246
ffffffff81a01db8
>>>>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>>>>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>>>>> (XEN)    0000000000000000
>>>>> (XEN) Xen call trace:
>>>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>> (XEN)
>>>>> (XEN)
>>>>> (XEN) ****************************************
>>>>> (XEN) Panic on CPU 0:
>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>> (XEN) ****************************************
>>>> To make sense of this, we need to know the register (and maybe
>>>> stack) allocation at this point, to know which vector it was
that
>>>> triggered the assertion. You can either do this analysis for
us, or
>>>> point us at the xen-syms binary matching the xen.gz you used.
>>> "info scope smp_irq_move_cleanup_interrupt" said vector
is in %rbx, so 0xe9.
>> And that system isn''t using a strange mixed mode
IO-APIC/legacy
>> PIC model, where particularly IRQ 9 (usually ACPI SCI) gets
>> channeled through the legacy PIC?
> I don''t know...
>
>> Could you attach the complete log, ideally with ''i''
output logged
>> right before suspending?
> Sure, attached.
>
>> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily
>> reproducible with 4.1.5-rc1, could you try changing the containing
>> loop''s upper bound from "< NR_VECTORS" to
>> "<= LAST_DYNAMIC_VECTOR"?
> I''ve tried 4.2.x some time ago and bug also exists there (but I
had not
> console, so not sure if exactly the same). 4.3 seems to be not affected.
>
Can you replace the ASSERT() with code similar to that in

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668

Which should call dump_irqs() in before dying because of the ASSERT. 
You might need to also take the latest version of dump_irqs() from
unstable, as I seem to remember there was another assertion failure due
to xfree()''ing in IRQ context.

~Andrew

Marek Marczykowski

2013-Mar-26 17:42 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 18:02, Andrew Cooper wrote:> On 26/03/2013 16:45, Marek Marczykowski wrote:
>> On 26.03.2013 17:03, Jan Beulich wrote:
>>>>>> On 26.03.13 at 14:50, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>>> wrote:
>>>> On 26.03.2013 14:11, Jan Beulich wrote:
>>>>>>>> On 26.03.13 at 13:17, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>>>> wrote:
>>>>>> Finally got serial console :)
>>>>>> The debug=y problem is (actually at resume):
>>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:   
C ]----
>>>>>> (XEN) CPU:    0
>>>>>> (XEN) RIP:    e008:[<ffff82c48015e288>] 
>>>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>>>>>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9  
rcx: ffff82c48029ff18
>>>>>> (XEN) rdx: 00000000000000e9   rsi: 000000000000002a  
rdi: ffff830421060538
>>>>>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8  
r8:  ffff88041820eb60
>>>>>> (XEN) r9:  0000000000000000   r10: 0000000000007ff0  
r11: 0000000000000000
>>>>>> (XEN) r12: ffff830421080250   r13: ffff830421060534  
r14: ffff82c48029ff18
>>>>>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b  
cr4: 00000000000026f0
>>>>>> (XEN) cr3: 0000000300b81000   cr2: ffff880402070198
>>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss:
0000   cs: e008
>>>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>>>>>> (XEN)    0000000000000000 000000000000e030
ffff82c48029ff18 ffff82c4802dd9e0
>>>>>> (XEN)    ffff8802cac3c7c0 00000000ffff3729
00000000ffff3729 000000013fff3728
>>>>>> (XEN)    ffffffff81b907c0 00000000ffff3729
00007d3b7fd600c7 ffff82c48014de60
>>>>>> (XEN)    00000000ffff3729 ffffffff81b907c0
000000013fff3728 00000000ffff3729
>>>>>> (XEN)    ffffffff81a01e18 00000000ffff3729
0000000000000000 0000000000007ff0
>>>>>> (XEN)    0000000000000000 ffff88041820eb60
ffff8803fd1820a8 ffffffff81b90a88
>>>>>> (XEN)    000000000000002a 000000000000002a
00000000ffff372a 0000002000000000
>>>>>> (XEN)    ffffffff8105dd5a 000000000000e033
0000000000000246 ffffffff81a01db8
>>>>>> (XEN)    000000000000e02b 0000000000000000
0000000000000000 0000000000000000
>>>>>> (XEN)    0000000000000000 0000000000000000
ffff8300ca9a0000 0000000000000000
>>>>>> (XEN)    0000000000000000
>>>>>> (XEN) Xen call trace:
>>>>>> (XEN)    [<ffff82c48015e288>]
smp_irq_move_cleanup_interrupt+0x1c3/0x23d
>>>>>> (XEN)
>>>>>> (XEN)
>>>>>> (XEN) ****************************************
>>>>>> (XEN) Panic on CPU 0:
>>>>>> (XEN) Assertion ''test_bit(vector,
cfg->used_vectors)'' failed at io_apic.c:542
>>>>>> (XEN) ****************************************
>>>>> To make sense of this, we need to know the register (and
maybe
>>>>> stack) allocation at this point, to know which vector it
was that
>>>>> triggered the assertion. You can either do this analysis
for us, or
>>>>> point us at the xen-syms binary matching the xen.gz you
used.
>>>> "info scope smp_irq_move_cleanup_interrupt" said
vector is in %rbx, so 0xe9.
>>> And that system isn''t using a strange mixed mode
IO-APIC/legacy
>>> PIC model, where particularly IRQ 9 (usually ACPI SCI) gets
>>> channeled through the legacy PIC?
>> I don''t know...
>>
>>> Could you attach the complete log, ideally with
''i'' output logged
>>> right before suspending?
>> Sure, attached.
>>
>>> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if
readily
>>> reproducible with 4.1.5-rc1, could you try changing the containing
>>> loop''s upper bound from "< NR_VECTORS" to
>>> "<= LAST_DYNAMIC_VECTOR"?
>> I''ve tried 4.2.x some time ago and bug also exists there (but
I had not
>> console, so not sure if exactly the same). 4.3 seems to be not
affected.
Checked 4.2 and indeed also assert() in similar place. If anyone interested,
log here:
http://duch.mimuw.edu.pl/~marmarek/qubes/console-4.2-failed-resume.log
>>
> 
> Can you replace the ASSERT() with code similar to that in
> 
>
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668
> 
> Which should call dump_irqs() in before dying because of the ASSERT. 
> You might need to also take the latest version of dump_irqs() from
> unstable, as I seem to remember there was another assertion failure due
> to xfree()''ing in IRQ context.
Full log here:
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log
Interesting part:
(XEN) *** IRQ BUG found ***
(XEN) CPU0 -Testing vector 233 from bitmap
39,47,63-65,72,80,88,96,98,112,120,125,144,152,160,168,174,182-183,190,192,198,200,208,214,222
(XEN) Guest interrupt information:
(XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000002 vec:c6
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  1(-S--),
(XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
type=XT-PIC          status=00000000 mapped, unbound
(XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000001 vec:58
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  7(-S--),
(XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000001 vec:60
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  8(-S--),
(XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:de
type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0:  9(-S--),
(XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:27
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 12(-S--),
(XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:2f
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 16(-S--),
(XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:3f
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 17(-S--),
(XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000008 vec:41
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000002 vec:b7
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 20(-S--),
(XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:62
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  26 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:6f
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:77
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7f
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:87
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  31 affinity:00000000,00000000,00000000,00000002 vec:a6
type=PCI-MSI         status=00000002 mapped, unbound
(XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:47
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:273(-S--),
(XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000002 vec:5f
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:272(PS--),
(XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000001 vec:67
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:271(-S--),
(XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000001 vec:4f
type=PCI-MSI         status=00000050 in-flight=0 domain-list=1: 55(-S--),
(XEN) IO-APIC interrupt information:
(XEN)     IRQ  0 Vec240:
(XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  1 Vec198:
(XEN)       Apic 0x00, Pin  1: vec=c6 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  3 Vec 64:
(XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  4 Vec241:
(XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  5 Vec 72:
(XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  6 Vec 80:
(XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  7 Vec 88:
(XEN)       Apic 0x00, Pin  7: vec=58 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  8 Vec 96:
(XEN)       Apic 0x00, Pin  8: vec=60 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  9 Vec222:
(XEN)       Apic 0x00, Pin  9: vec=de delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 10 Vec112:
(XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 11 Vec120:
(XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 12 Vec 39:
(XEN)       Apic 0x00, Pin 12: vec=27 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 13 Vec144:
(XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN)     IRQ 14 Vec152:
(XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 15 Vec160:
(XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 16 Vec 47:
(XEN)       Apic 0x00, Pin 16: vec=2f delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 17 Vec 63:
(XEN)       Apic 0x00, Pin 17: vec=3f delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 18 Vec 65:
(XEN)       Apic 0x00, Pin 18: vec=41 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 19 Vec200:
(XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 20 Vec183:
(XEN)       Apic 0x00, Pin 20: vec=b7 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 22 Vec 98:
(XEN)       Apic 0x00, Pin 22: vec=62 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 23 Vec168:
(XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN) Xen BUG at io_apic.c:554
(XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015e2d6>]
smp_irq_move_cleanup_interrupt+0x211/0x289
(XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: 0000000000000000
(XEN) rdx: 0000000000000016   rsi: 000000000000000a   rdi: ffff82c4802592e0
(XEN) rbp: ffff82c48029fd08   rsp: ffff82c48029fcb8   r8:  0000000000000018
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000001
(XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
(XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000119a96000   cr2: ffff880402070198
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48029fcb8:
(XEN)    0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000
(XEN)    ffff83042109ba04 ffff830421008000 0000000000000114 000000000000001d
(XEN)    0000000000000114 0000000000000000 00007d3b7fd602c7 ffff82c48014de60
(XEN)    0000000000000000 0000000000000114 000000000000001d 0000000000000114
(XEN)    ffff82c48029fdc8 ffff830421008000 0000000000000246 ffff82c48025c1f0
(XEN)    0000000000000003 0000001944602466 0000000000000000 0000000000000001
(XEN)    0000000000000000 0000000000000286 ffff830421060f34 0000002000000000
(XEN)    ffff82c4801226c0 000000000000e008 0000000000000286 ffff82c48029fdc8
(XEN)    000000000000e010 0000000000000286 ffff82c48029fe48 ffff82c480164446
(XEN)    ffff82c4802dd9e0 0000000000000286 ffff830421060f00 ffff830421060f34
(XEN)    ffff830421050ac0 000000000000001d 0000000000000246 ffff8301108fd140
(XEN)    ffff82c4801226d3 ffff82c48029fe78 000000000000001d ffff8803fa889af0
(XEN)    0000000000000114 ffff8804023be000 ffff82c48029fef8 ffff82c48017655b
(XEN)    ffff830114c7f300 ffffffff81381646 ffff82f600000008 ffff830421008000
(XEN)    0000000000000003 000000030000001d 00000000e2200000 0000000100a0fb00
(XEN)    0000000000007ff0 ffffffffffffffff 0000000000000003 0000000000000003
(XEN)    00000000e2200000 c390ed90d1ffffff 0000000000000202 ffff8300ca666000
(XEN)    ffff8803fc880240 0000000000000011 ffff8804023be858 ffff8804023be000
(XEN)    00007d3b7fd600c7 ffff82c480209f38 ffffffff8100142a 0000000000000021
(XEN)    ffff8804023be000 ffff8804023be858 0000000000000011 ffff8803fc880240
(XEN) Xen call trace:
(XEN)    [<ffff82c48015e2d6>] smp_irq_move_cleanup_interrupt+0x211/0x289
(XEN)    [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40
(XEN)    [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24
(XEN)    [<ffff82c480164446>] map_domain_pirq+0x37a/0x3df
(XEN)    [<ffff82c48017655b>] do_physdev_op+0xa2b/0x1508
(XEN)    [<ffff82c480209f38>] syscall_enter+0xc8/0x122

> 
> ~Andrew
> 

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-26 17:54 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>> Can you replace the ASSERT() with code similar to that in
>>
>>
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668
>>
>> Which should call dump_irqs() in before dying because of the ASSERT. 
>> You might need to also take the latest version of dump_irqs() from
>> unstable, as I seem to remember there was another assertion failure due
>> to xfree()''ing in IRQ context.
> Full log here:
>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log
> Interesting part:
> (XEN) *** IRQ BUG found ***
> (XEN) CPU0 -Testing vector 233 from bitmap
>
39,47,63-65,72,80,88,96,98,112,120,125,144,152,160,168,174,182-183,190,192,198,200,208,214,222
> (XEN) Guest interrupt information:
> (XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
> type=IO-APIC-edge    status=00000000 mapped, unbound
> (XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000002 vec:c6
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  1(-S--),
> (XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
> type=XT-PIC          status=00000000 mapped, unbound
> (XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
> type=IO-APIC-edge    status=00000000 mapped, unbound
> (XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000001 vec:58
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  7(-S--),
> (XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000001 vec:60
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  8(-S--),
> (XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:de
> type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0:  9(-S--),
> (XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:27
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 12(-S--),
> (XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:2f
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 16(-S--),
> (XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:3f
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 17(-S--),
> (XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000008 vec:41
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000002 vec:b7
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 20(-S--),
> (XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:62
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
> type=DMA_MSI         status=00000000 mapped, unbound
> (XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
> type=DMA_MSI         status=00000000 mapped, unbound
> (XEN)    IRQ:  26 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:6f
> type=PCI-MSI         status=00000042 mapped, unbound
> (XEN)    IRQ:  27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:77
> type=PCI-MSI         status=00000042 mapped, unbound
> (XEN)    IRQ:  28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7f
> type=PCI-MSI         status=00000042 mapped, unbound
> (XEN)    IRQ:  29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:87
> type=PCI-MSI         status=00000042 mapped, unbound
> (XEN)    IRQ:  31 affinity:00000000,00000000,00000000,00000002 vec:a6
> type=PCI-MSI         status=00000002 mapped, unbound
> (XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:47
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:273(-S--),
> (XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000002 vec:5f
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:272(PS--),
> (XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000001 vec:67
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:271(-S--),
> (XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000001 vec:4f
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=1: 55(-S--),
> (XEN) IO-APIC interrupt information:
> (XEN)     IRQ  0 Vec240:
> (XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  1 Vec198:
> (XEN)       Apic 0x00, Pin  1: vec=c6 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  3 Vec 64:
> (XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  4 Vec241:
> (XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  5 Vec 72:
> (XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  6 Vec 80:
> (XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  7 Vec 88:
> (XEN)       Apic 0x00, Pin  7: vec=58 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  8 Vec 96:
> (XEN)       Apic 0x00, Pin  8: vec=60 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  9 Vec222:
> (XEN)       Apic 0x00, Pin  9: vec=de delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 10 Vec112:
> (XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 11 Vec120:
> (XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 12 Vec 39:
> (XEN)       Apic 0x00, Pin 12: vec=27 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 13 Vec144:
> (XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=1 dest_id:0
> (XEN)     IRQ 14 Vec152:
> (XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 15 Vec160:
> (XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 16 Vec 47:
> (XEN)       Apic 0x00, Pin 16: vec=2f delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 17 Vec 63:
> (XEN)       Apic 0x00, Pin 17: vec=3f delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 18 Vec 65:
> (XEN)       Apic 0x00, Pin 18: vec=41 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 19 Vec200:
> (XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 20 Vec183:
> (XEN)       Apic 0x00, Pin 20: vec=b7 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 22 Vec 98:
> (XEN)       Apic 0x00, Pin 22: vec=62 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 23 Vec168:
> (XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=1 dest_id:0
> (XEN) Xen BUG at io_apic.c:554
> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c48015e2d6>]
smp_irq_move_cleanup_interrupt+0x211/0x289
> (XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: 0000000000000000
> (XEN) rdx: 0000000000000016   rsi: 000000000000000a   rdi: ffff82c4802592e0
> (XEN) rbp: ffff82c48029fd08   rsp: ffff82c48029fcb8   r8:  0000000000000018
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000001
> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 0000000119a96000   cr2: ffff880402070198
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c48029fcb8:
> (XEN)    0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0
000000e900000000
> (XEN)    ffff83042109ba04 ffff830421008000 0000000000000114
000000000000001d
> (XEN)    0000000000000114 0000000000000000 00007d3b7fd602c7
ffff82c48014de60
> (XEN)    0000000000000000 0000000000000114 000000000000001d
0000000000000114
> (XEN)    ffff82c48029fdc8 ffff830421008000 0000000000000246
ffff82c48025c1f0
> (XEN)    0000000000000003 0000001944602466 0000000000000000
0000000000000001
> (XEN)    0000000000000000 0000000000000286 ffff830421060f34
0000002000000000
> (XEN)    ffff82c4801226c0 000000000000e008 0000000000000286
ffff82c48029fdc8
> (XEN)    000000000000e010 0000000000000286 ffff82c48029fe48
ffff82c480164446
> (XEN)    ffff82c4802dd9e0 0000000000000286 ffff830421060f00
ffff830421060f34
> (XEN)    ffff830421050ac0 000000000000001d 0000000000000246
ffff8301108fd140
> (XEN)    ffff82c4801226d3 ffff82c48029fe78 000000000000001d
ffff8803fa889af0
> (XEN)    0000000000000114 ffff8804023be000 ffff82c48029fef8
ffff82c48017655b
> (XEN)    ffff830114c7f300 ffffffff81381646 ffff82f600000008
ffff830421008000
> (XEN)    0000000000000003 000000030000001d 00000000e2200000
0000000100a0fb00
> (XEN)    0000000000007ff0 ffffffffffffffff 0000000000000003
0000000000000003
> (XEN)    00000000e2200000 c390ed90d1ffffff 0000000000000202
ffff8300ca666000
> (XEN)    ffff8803fc880240 0000000000000011 ffff8804023be858
ffff8804023be000
> (XEN)    00007d3b7fd600c7 ffff82c480209f38 ffffffff8100142a
0000000000000021
> (XEN)    ffff8804023be000 ffff8804023be858 0000000000000011
ffff8803fc880240
> (XEN) Xen call trace:
> (XEN)    [<ffff82c48015e2d6>]
smp_irq_move_cleanup_interrupt+0x211/0x289
> (XEN)    [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40
> (XEN)    [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24
> (XEN)    [<ffff82c480164446>] map_domain_pirq+0x37a/0x3df
> (XEN)    [<ffff82c48017655b>] do_physdev_op+0xa2b/0x1508
> (XEN)    [<ffff82c480209f38>] syscall_enter+0xc8/0x122
>
>
>> ~Andrew
>>
>
Even more curious.  vector e9 does not appear to be programmed in.  Can
you extend the debugging to also call __print_IO_APIC().

The i debug key and z debug key list IO-APIC entries from different
sources of information.

~Andrew

Marek Marczykowski

2013-Mar-26 18:21 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 18:54, Andrew Cooper wrote:> 
>>> Can you replace the ASSERT() with code similar to that in
>>>
>>>
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668
>>>
>>> Which should call dump_irqs() in before dying because of the
ASSERT.
>>> You might need to also take the latest version of dump_irqs() from
>>> unstable, as I seem to remember there was another assertion failure
due
>>> to xfree()''ing in IRQ context.
>> Full log here:
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log
>> Interesting part:
(...)> Even more curious.  vector e9 does not appear to be programmed in.  Can
> you extend the debugging to also call __print_IO_APIC().
> 
> The i debug key and z debug key list IO-APIC entries from different
> sources of information.
As you wish, full log:
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log

Final part:
(XEN) *** IRQ BUG found ***
(XEN) CPU0 -Testing vector 233 from bitmap
43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218
(XEN) Guest interrupt information:
(XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000001 vec:7f
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  1(-S--),
(XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
type=XT-PIC          status=00000000 mapped, unbound
(XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000008 vec:da
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  7(-S--),
(XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000004 vec:d8
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  8(-S--),
(XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:87
type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0:  9(-S--),
(XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:8f
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 12(-S--),
(XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:97
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 16(-S--),
(XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:9f
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 17(-S--),
(XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000004 vec:79
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000002 vec:d3
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 20(-S--),
(XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:2b
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  26 affinity:00000000,00000000,00000000,00000001 vec:c7
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:279(-S--),
(XEN)    IRQ:  27 affinity:00000000,00000000,00000000,00000001 vec:cf
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:278(-S--),
(XEN)    IRQ:  28 affinity:00000000,00000000,00000000,00000001 vec:d7
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:277(-S--),
(XEN)    IRQ:  29 affinity:00000000,00000000,00000000,00000001 vec:df
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:276(-S--),
(XEN)    IRQ:  30 affinity:00000000,00000000,00000000,00000001 vec:38
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:275(-S--),
(XEN)    IRQ:  31 affinity:00000000,00000000,00000000,00000004 vec:47
type=PCI-MSI         status=00000002 mapped, unbound
(XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:a7
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:273(-S--),
(XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000001 vec:b7
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:272(-S--),
(XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000004 vec:40
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:271(PS--),
(XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000001 vec:af
type=PCI-MSI         status=00000050 in-flight=0 domain-list=1: 55(-S--),
(XEN) IO-APIC interrupt information:
(XEN)     IRQ  0 Vec240:
(XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  1 Vec127:
(XEN)       Apic 0x00, Pin  1: vec=7f delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  3 Vec 64:
(XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  4 Vec241:
(XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  5 Vec 72:
(XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  6 Vec 80:
(XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  7 Vec218:
(XEN)       Apic 0x00, Pin  7: vec=da delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  8 Vec216:
(XEN)       Apic 0x00, Pin  8: vec=d8 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  9 Vec135:
(XEN)       Apic 0x00, Pin  9: vec=87 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 10 Vec112:
(XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 11 Vec120:
(XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 12 Vec143:
(XEN)       Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 13 Vec144:
(XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN)     IRQ 14 Vec152:
(XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 15 Vec160:
(XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 16 Vec151:
(XEN)       Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 17 Vec159:
(XEN)       Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 18 Vec121:
(XEN)       Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 19 Vec200:
(XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 20 Vec211:
(XEN)       Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 22 Vec 43:
(XEN)       Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 23 Vec168:
(XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN) number of MP IRQ sources: 15.
(XEN) number of IO-APIC #2 registers: 24.
(XEN) testing the IO APIC.......................
(XEN) IO APIC #2......
(XEN) .... register #00: 02000000
(XEN) .......    : physical APIC id: 02
(XEN) .......    : Delivery Type: 0
(XEN) .......    : LTS          : 0
(XEN) .... register #01: 00170020
(XEN) .......     : max redirection entries: 0017
(XEN) .......     : PRQ implemented: 0
(XEN) .......     : IO APIC version: 0020
(XEN) .... IRQ redirection table:
(XEN)  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
(XEN)  00 0DC 0C  1    0    0   0   0    1    2    87
(XEN)  01 000 00  0    0    0   0   0    1    1    7F
(XEN)  02 000 00  0    0    0   0   0    1    1    F0
(XEN)  03 000 00  0    0    0   0   0    1    1    40
(XEN)  04 000 00  0    0    0   0   0    1    1    F1
(XEN)  05 000 00  0    0    0   0   0    1    1    48
(XEN)  06 000 00  0    0    0   0   0    1    1    50
(XEN)  07 000 00  0    0    0   0   0    1    1    DA
(XEN)  08 000 00  0    0    0   0   0    1    1    D8
(XEN)  09 000 00  0    1    0   0   0    1    1    87
(XEN)  0a 000 00  0    0    0   0   0    1    1    70
(XEN)  0b 000 00  0    0    0   0   0    1    1    78
(XEN)  0c 000 00  0    0    0   0   0    1    1    8F
(XEN)  0d 000 00  1    0    0   0   0    1    1    90
(XEN)  0e 000 00  0    0    0   0   0    1    1    98
(XEN)  0f 000 00  0    0    0   0   0    1    1    A0
(XEN)  10 000 00  0    1    0   1   0    1    1    97
(XEN)  11 000 00  0    1    0   1   0    1    1    9F
(XEN)  12 000 00  1    1    0   1   0    1    1    79
(XEN)  13 000 00  1    1    0   1   0    1    1    C8
(XEN)  14 000 00  0    1    0   1   0    1    1    D3
(XEN)  15 000 00  1    0    0   0   0    0    0    00
(XEN)  16 000 00  1    1    0   1   0    1    1    2B
(XEN)  17 000 00  1    0    0   0   0    1    1    A8
(XEN) Using vector-based indexing
(XEN) IRQ to pin mappings:
(XEN) IRQ240 -> 0:2
(XEN) IRQ127 -> 0:1
(XEN) IRQ64 -> 0:3
(XEN) IRQ241 -> 0:4
(XEN) IRQ72 -> 0:5
(XEN) IRQ80 -> 0:6
(XEN) IRQ218 -> 0:7
(XEN) IRQ216 -> 0:8
(XEN) IRQ135 -> 0:9
(XEN) IRQ112 -> 0:10
(XEN) IRQ120 -> 0:11
(XEN) IRQ143 -> 0:12
(XEN) IRQ144 -> 0:13
(XEN) IRQ152 -> 0:14
(XEN) IRQ160 -> 0:15
(XEN) IRQ151 -> 0:16
(XEN) IRQ159 -> 0:17
(XEN) IRQ121 -> 0:18
(XEN) IRQ200 -> 0:19
(XEN) IRQ211 -> 0:20
(XEN) IRQ43 -> 0:22
(XEN) IRQ168 -> 0:23
(XEN) .................................... done.
(XEN) Xen BUG at io_apic.c:556
(XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015e2db>]
smp_irq_move_cleanup_interrupt+0x216/0x28e
(XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: 0000000000000000
(XEN) rdx: 0000000000000000   rsi: 000000000000000a   rdi: ffff82c4802592e0
(XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8:  0000000000000004
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000002
(XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
(XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000026582c000   cr2: ffff8804020701d8
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48029feb8:
(XEN)    0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000
(XEN)    000000000000e02b 0000000000000000 000000004bf51982 00000000000060a9
(XEN)    0000000000000000 0000000000000000 00007d3b7fd600c7 ffff82c48014de60
(XEN)    0000000000000000 0000000000000000 00000000000060a9 000000004bf51982
(XEN)    ffff8802d2665b28 0000000000000000 0000000000000000 0000000000007ff0
(XEN)    0000000000000022 0000000000000000 000000024bf57322 0000000001307da0
(XEN)    00000000000059a0 0000000000000000 00000000000060a9 0000002000000000
(XEN)    ffffffff8123c51a 000000000000e033 0000000000000293 ffff8802d2665b08
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000
(XEN)    0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e



-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-26 18:50 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26/03/2013 18:21, Marek Marczykowski wrote:> On 26.03.2013 18:54, Andrew Cooper wrote:
>>>> Can you replace the ASSERT() with code similar to that in
>>>>
>>>>
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668
>>>>
>>>> Which should call dump_irqs() in before dying because of the
ASSERT.
>>>> You might need to also take the latest version of dump_irqs()
from
>>>> unstable, as I seem to remember there was another assertion
failure due
>>>> to xfree()''ing in IRQ context.
>>> Full log here:
>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log
>>> Interesting part:
> (...)
>> Even more curious.  vector e9 does not appear to be programmed in.  Can
>> you extend the debugging to also call __print_IO_APIC().
>>
>> The i debug key and z debug key list IO-APIC entries from different
>> sources of information.
> As you wish, full log:
>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log
>
> Final part:
> (XEN) *** IRQ BUG found ***
> (XEN) CPU0 -Testing vector 233 from bitmap
>
43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218
> (XEN) Guest interrupt information:
> (XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
> type=IO-APIC-edge    status=00000000 mapped, unbound
> (XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000001 vec:7f
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  1(-S--),
> (XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
> type=XT-PIC          status=00000000 mapped, unbound
> (XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
> type=IO-APIC-edge    status=00000000 mapped, unbound
> (XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000008 vec:da
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  7(-S--),
> (XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000004 vec:d8
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  8(-S--),
> (XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:87
> type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0:  9(-S--),
> (XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:8f
> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 12(-S--),
> (XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:97
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 16(-S--),
> (XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:9f
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 17(-S--),
> (XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000004 vec:79
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000002 vec:d3
> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 20(-S--),
> (XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:2b
> type=IO-APIC-level   status=00000002 mapped, unbound
> (XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
> type=IO-APIC-edge    status=00000002 mapped, unbound
> (XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
> type=DMA_MSI         status=00000000 mapped, unbound
> (XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
> type=DMA_MSI         status=00000000 mapped, unbound
> (XEN)    IRQ:  26 affinity:00000000,00000000,00000000,00000001 vec:c7
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:279(-S--),
> (XEN)    IRQ:  27 affinity:00000000,00000000,00000000,00000001 vec:cf
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:278(-S--),
> (XEN)    IRQ:  28 affinity:00000000,00000000,00000000,00000001 vec:d7
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:277(-S--),
> (XEN)    IRQ:  29 affinity:00000000,00000000,00000000,00000001 vec:df
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:276(-S--),
> (XEN)    IRQ:  30 affinity:00000000,00000000,00000000,00000001 vec:38
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:275(-S--),
> (XEN)    IRQ:  31 affinity:00000000,00000000,00000000,00000004 vec:47
> type=PCI-MSI         status=00000002 mapped, unbound
> (XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:a7
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:273(-S--),
> (XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000001 vec:b7
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:272(-S--),
> (XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000004 vec:40
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:271(PS--),
> (XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000001 vec:af
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=1: 55(-S--),
> (XEN) IO-APIC interrupt information:
> (XEN)     IRQ  0 Vec240:
> (XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  1 Vec127:
> (XEN)       Apic 0x00, Pin  1: vec=7f delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  3 Vec 64:
> (XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  4 Vec241:
> (XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  5 Vec 72:
> (XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  6 Vec 80:
> (XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  7 Vec218:
> (XEN)       Apic 0x00, Pin  7: vec=da delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  8 Vec216:
> (XEN)       Apic 0x00, Pin  8: vec=d8 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ  9 Vec135:
> (XEN)       Apic 0x00, Pin  9: vec=87 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 10 Vec112:
> (XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 11 Vec120:
> (XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 12 Vec143:
> (XEN)       Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 13 Vec144:
> (XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=1 dest_id:0
> (XEN)     IRQ 14 Vec152:
> (XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 15 Vec160:
> (XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=0 dest_id:0
> (XEN)     IRQ 16 Vec151:
> (XEN)       Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 17 Vec159:
> (XEN)       Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 18 Vec121:
> (XEN)       Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 19 Vec200:
> (XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 20 Vec211:
> (XEN)       Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=0 dest_id:0
> (XEN)     IRQ 22 Vec 43:
> (XEN)       Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0
> polarity=1 irr=0 trig=L mask=1 dest_id:0
> (XEN)     IRQ 23 Vec168:
> (XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
> polarity=0 irr=0 trig=E mask=1 dest_id:0
> (XEN) number of MP IRQ sources: 15.
> (XEN) number of IO-APIC #2 registers: 24.
> (XEN) testing the IO APIC.......................
> (XEN) IO APIC #2......
> (XEN) .... register #00: 02000000
> (XEN) .......    : physical APIC id: 02
> (XEN) .......    : Delivery Type: 0
> (XEN) .......    : LTS          : 0
> (XEN) .... register #01: 00170020
> (XEN) .......     : max redirection entries: 0017
> (XEN) .......     : PRQ implemented: 0
> (XEN) .......     : IO APIC version: 0020
> (XEN) .... IRQ redirection table:
> (XEN)  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
> (XEN)  00 0DC 0C  1    0    0   0   0    1    2    87
> (XEN)  01 000 00  0    0    0   0   0    1    1    7F
> (XEN)  02 000 00  0    0    0   0   0    1    1    F0
> (XEN)  03 000 00  0    0    0   0   0    1    1    40
> (XEN)  04 000 00  0    0    0   0   0    1    1    F1
> (XEN)  05 000 00  0    0    0   0   0    1    1    48
> (XEN)  06 000 00  0    0    0   0   0    1    1    50
> (XEN)  07 000 00  0    0    0   0   0    1    1    DA
> (XEN)  08 000 00  0    0    0   0   0    1    1    D8
> (XEN)  09 000 00  0    1    0   0   0    1    1    87
> (XEN)  0a 000 00  0    0    0   0   0    1    1    70
> (XEN)  0b 000 00  0    0    0   0   0    1    1    78
> (XEN)  0c 000 00  0    0    0   0   0    1    1    8F
> (XEN)  0d 000 00  1    0    0   0   0    1    1    90
> (XEN)  0e 000 00  0    0    0   0   0    1    1    98
> (XEN)  0f 000 00  0    0    0   0   0    1    1    A0
> (XEN)  10 000 00  0    1    0   1   0    1    1    97
> (XEN)  11 000 00  0    1    0   1   0    1    1    9F
> (XEN)  12 000 00  1    1    0   1   0    1    1    79
> (XEN)  13 000 00  1    1    0   1   0    1    1    C8
> (XEN)  14 000 00  0    1    0   1   0    1    1    D3
> (XEN)  15 000 00  1    0    0   0   0    0    0    00
> (XEN)  16 000 00  1    1    0   1   0    1    1    2B
> (XEN)  17 000 00  1    0    0   0   0    1    1    A8
> (XEN) Using vector-based indexing
> (XEN) IRQ to pin mappings:
> (XEN) IRQ240 -> 0:2
> (XEN) IRQ127 -> 0:1
> (XEN) IRQ64 -> 0:3
> (XEN) IRQ241 -> 0:4
> (XEN) IRQ72 -> 0:5
> (XEN) IRQ80 -> 0:6
> (XEN) IRQ218 -> 0:7
> (XEN) IRQ216 -> 0:8
> (XEN) IRQ135 -> 0:9
> (XEN) IRQ112 -> 0:10
> (XEN) IRQ120 -> 0:11
> (XEN) IRQ143 -> 0:12
> (XEN) IRQ144 -> 0:13
> (XEN) IRQ152 -> 0:14
> (XEN) IRQ160 -> 0:15
> (XEN) IRQ151 -> 0:16
> (XEN) IRQ159 -> 0:17
> (XEN) IRQ121 -> 0:18
> (XEN) IRQ200 -> 0:19
> (XEN) IRQ211 -> 0:20
> (XEN) IRQ43 -> 0:22
> (XEN) IRQ168 -> 0:23
> (XEN) .................................... done.
> (XEN) Xen BUG at io_apic.c:556
> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c48015e2db>]
smp_irq_move_cleanup_interrupt+0x216/0x28e
> (XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: 0000000000000000
> (XEN) rdx: 0000000000000000   rsi: 000000000000000a   rdi: ffff82c4802592e0
> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8:  0000000000000004
> (XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000002
> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 000000026582c000   cr2: ffff8804020701d8
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
> (XEN)    0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0
000000e900000000
> (XEN)    000000000000e02b 0000000000000000 000000004bf51982
00000000000060a9
> (XEN)    0000000000000000 0000000000000000 00007d3b7fd600c7
ffff82c48014de60
> (XEN)    0000000000000000 0000000000000000 00000000000060a9
000000004bf51982
> (XEN)    ffff8802d2665b28 0000000000000000 0000000000000000
0000000000007ff0
> (XEN)    0000000000000022 0000000000000000 000000024bf57322
0000000001307da0
> (XEN)    00000000000059a0 0000000000000000 00000000000060a9
0000002000000000
> (XEN)    ffffffff8123c51a 000000000000e033 0000000000000293
ffff8802d2665b08
> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
> (XEN)    0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82c48015e2db>]
smp_irq_move_cleanup_interrupt+0x216/0x28e
>
>
>
So vector e9 doesn''t appear to be programmed in anywhere.

I am starting to get more into the realm of guessing here but, can you
use apic_verbosity=debug on the command line and copy this extra
debugging logic to send_cleanup_vector()

You should be able to conditionally trigger it on "desc->arch.vector
=0xe9".  You will probably also want to change the BUG() to a WARN(), so
we get the interrupt and ioapic information on both sides of the cleanup
vector, as well as getting the stack trace of the codepath through Xen
as a result of vector 0xe9.

~Andrew

Marek Marczykowski

2013-Mar-27 08:50 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 26.03.2013 19:50, Andrew Cooper wrote:> On 26/03/2013 18:21, Marek Marczykowski wrote:
>> On 26.03.2013 18:54, Andrew Cooper wrote:
>>>>> Can you replace the ASSERT() with code similar to that in
>>>>>
>>>>>
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668
>>>>>
>>>>> Which should call dump_irqs() in before dying because of
the ASSERT.
>>>>> You might need to also take the latest version of
dump_irqs() from
>>>>> unstable, as I seem to remember there was another assertion
failure due
>>>>> to xfree()''ing in IRQ context.
>>>> Full log here:
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log
>>>> Interesting part:
>> (...)
>>> Even more curious.  vector e9 does not appear to be programmed in. 
Can
>>> you extend the debugging to also call __print_IO_APIC().
>>>
>>> The i debug key and z debug key list IO-APIC entries from different
>>> sources of information.
>> As you wish, full log:
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log
>>
>> Final part:
>> (XEN) *** IRQ BUG found ***
>> (XEN) CPU0 -Testing vector 233 from bitmap
>>
43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218
>> (XEN) Guest interrupt information:
>> (XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
>> type=IO-APIC-edge    status=00000000 mapped, unbound
>> (XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000001 vec:7f
>> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 
1(-S--),
>> (XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
>> type=XT-PIC          status=00000000 mapped, unbound
>> (XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
>> type=IO-APIC-edge    status=00000000 mapped, unbound
>> (XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000008 vec:da
>> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 
7(-S--),
>> (XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000004 vec:d8
>> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 
8(-S--),
>> (XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:87
>> type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0: 
9(-S--),
>> (XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:8f
>> type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:
12(-S--),
>> (XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:97
>> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0:
16(-S--),
>> (XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:9f
>> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0:
17(-S--),
>> (XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000004 vec:79
>> type=IO-APIC-level   status=00000002 mapped, unbound
>> (XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
>> type=IO-APIC-level   status=00000002 mapped, unbound
>> (XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000002 vec:d3
>> type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0:
20(-S--),
>> (XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:2b
>> type=IO-APIC-level   status=00000002 mapped, unbound
>> (XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
>> type=IO-APIC-edge    status=00000002 mapped, unbound
>> (XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
>> type=DMA_MSI         status=00000000 mapped, unbound
>> (XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
>> type=DMA_MSI         status=00000000 mapped, unbound
>> (XEN)    IRQ:  26 affinity:00000000,00000000,00000000,00000001 vec:c7
>> type=PCI-MSI         status=00000010 in-flight=0
domain-list=0:279(-S--),
>> (XEN)    IRQ:  27 affinity:00000000,00000000,00000000,00000001 vec:cf
>> type=PCI-MSI         status=00000050 in-flight=0
domain-list=0:278(-S--),
>> (XEN)    IRQ:  28 affinity:00000000,00000000,00000000,00000001 vec:d7
>> type=PCI-MSI         status=00000050 in-flight=0
domain-list=0:277(-S--),
>> (XEN)    IRQ:  29 affinity:00000000,00000000,00000000,00000001 vec:df
>> type=PCI-MSI         status=00000050 in-flight=0
domain-list=0:276(-S--),
>> (XEN)    IRQ:  30 affinity:00000000,00000000,00000000,00000001 vec:38
>> type=PCI-MSI         status=00000050 in-flight=0
domain-list=0:275(-S--),
>> (XEN)    IRQ:  31 affinity:00000000,00000000,00000000,00000004 vec:47
>> type=PCI-MSI         status=00000002 mapped, unbound
>> (XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:a7
>> type=PCI-MSI         status=00000050 in-flight=0
domain-list=0:273(-S--),
>> (XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000001 vec:b7
>> type=PCI-MSI         status=00000010 in-flight=0
domain-list=0:272(-S--),
>> (XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000004 vec:40
>> type=PCI-MSI         status=00000010 in-flight=0
domain-list=0:271(PS--),
>> (XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000001 vec:af
>> type=PCI-MSI         status=00000050 in-flight=0 domain-list=1:
55(-S--),
>> (XEN) IO-APIC interrupt information:
>> (XEN)     IRQ  0 Vec240:
>> (XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  1 Vec127:
>> (XEN)       Apic 0x00, Pin  1: vec=7f delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  3 Vec 64:
>> (XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  4 Vec241:
>> (XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  5 Vec 72:
>> (XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  6 Vec 80:
>> (XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  7 Vec218:
>> (XEN)       Apic 0x00, Pin  7: vec=da delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  8 Vec216:
>> (XEN)       Apic 0x00, Pin  8: vec=d8 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ  9 Vec135:
>> (XEN)       Apic 0x00, Pin  9: vec=87 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=L mask=0 dest_id:0
>> (XEN)     IRQ 10 Vec112:
>> (XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ 11 Vec120:
>> (XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ 12 Vec143:
>> (XEN)       Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ 13 Vec144:
>> (XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=1 dest_id:0
>> (XEN)     IRQ 14 Vec152:
>> (XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ 15 Vec160:
>> (XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=0 dest_id:0
>> (XEN)     IRQ 16 Vec151:
>> (XEN)       Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=0 dest_id:0
>> (XEN)     IRQ 17 Vec159:
>> (XEN)       Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=0 dest_id:0
>> (XEN)     IRQ 18 Vec121:
>> (XEN)       Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=1 dest_id:0
>> (XEN)     IRQ 19 Vec200:
>> (XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=1 dest_id:0
>> (XEN)     IRQ 20 Vec211:
>> (XEN)       Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=0 dest_id:0
>> (XEN)     IRQ 22 Vec 43:
>> (XEN)       Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0
>> polarity=1 irr=0 trig=L mask=1 dest_id:0
>> (XEN)     IRQ 23 Vec168:
>> (XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
>> polarity=0 irr=0 trig=E mask=1 dest_id:0
>> (XEN) number of MP IRQ sources: 15.
>> (XEN) number of IO-APIC #2 registers: 24.
>> (XEN) testing the IO APIC.......................
>> (XEN) IO APIC #2......
>> (XEN) .... register #00: 02000000
>> (XEN) .......    : physical APIC id: 02
>> (XEN) .......    : Delivery Type: 0
>> (XEN) .......    : LTS          : 0
>> (XEN) .... register #01: 00170020
>> (XEN) .......     : max redirection entries: 0017
>> (XEN) .......     : PRQ implemented: 0
>> (XEN) .......     : IO APIC version: 0020
>> (XEN) .... IRQ redirection table:
>> (XEN)  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
>> (XEN)  00 0DC 0C  1    0    0   0   0    1    2    87
>> (XEN)  01 000 00  0    0    0   0   0    1    1    7F
>> (XEN)  02 000 00  0    0    0   0   0    1    1    F0
>> (XEN)  03 000 00  0    0    0   0   0    1    1    40
>> (XEN)  04 000 00  0    0    0   0   0    1    1    F1
>> (XEN)  05 000 00  0    0    0   0   0    1    1    48
>> (XEN)  06 000 00  0    0    0   0   0    1    1    50
>> (XEN)  07 000 00  0    0    0   0   0    1    1    DA
>> (XEN)  08 000 00  0    0    0   0   0    1    1    D8
>> (XEN)  09 000 00  0    1    0   0   0    1    1    87
>> (XEN)  0a 000 00  0    0    0   0   0    1    1    70
>> (XEN)  0b 000 00  0    0    0   0   0    1    1    78
>> (XEN)  0c 000 00  0    0    0   0   0    1    1    8F
>> (XEN)  0d 000 00  1    0    0   0   0    1    1    90
>> (XEN)  0e 000 00  0    0    0   0   0    1    1    98
>> (XEN)  0f 000 00  0    0    0   0   0    1    1    A0
>> (XEN)  10 000 00  0    1    0   1   0    1    1    97
>> (XEN)  11 000 00  0    1    0   1   0    1    1    9F
>> (XEN)  12 000 00  1    1    0   1   0    1    1    79
>> (XEN)  13 000 00  1    1    0   1   0    1    1    C8
>> (XEN)  14 000 00  0    1    0   1   0    1    1    D3
>> (XEN)  15 000 00  1    0    0   0   0    0    0    00
>> (XEN)  16 000 00  1    1    0   1   0    1    1    2B
>> (XEN)  17 000 00  1    0    0   0   0    1    1    A8
>> (XEN) Using vector-based indexing
>> (XEN) IRQ to pin mappings:
>> (XEN) IRQ240 -> 0:2
>> (XEN) IRQ127 -> 0:1
>> (XEN) IRQ64 -> 0:3
>> (XEN) IRQ241 -> 0:4
>> (XEN) IRQ72 -> 0:5
>> (XEN) IRQ80 -> 0:6
>> (XEN) IRQ218 -> 0:7
>> (XEN) IRQ216 -> 0:8
>> (XEN) IRQ135 -> 0:9
>> (XEN) IRQ112 -> 0:10
>> (XEN) IRQ120 -> 0:11
>> (XEN) IRQ143 -> 0:12
>> (XEN) IRQ144 -> 0:13
>> (XEN) IRQ152 -> 0:14
>> (XEN) IRQ160 -> 0:15
>> (XEN) IRQ151 -> 0:16
>> (XEN) IRQ159 -> 0:17
>> (XEN) IRQ121 -> 0:18
>> (XEN) IRQ200 -> 0:19
>> (XEN) IRQ211 -> 0:20
>> (XEN) IRQ43 -> 0:22
>> (XEN) IRQ168 -> 0:23
>> (XEN) .................................... done.
>> (XEN) Xen BUG at io_apic.c:556
>> (XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    0
>> (XEN) RIP:    e008:[<ffff82c48015e2db>]
smp_irq_move_cleanup_interrupt+0x216/0x28e
>> (XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx:
0000000000000000
>> (XEN) rdx: 0000000000000000   rsi: 000000000000000a   rdi:
ffff82c4802592e0
>> (XEN) rbp: ffff82c48029ff08   rsp: ffff82c48029feb8   r8: 
0000000000000004
>> (XEN) r9:  0000000000000004   r10: 0000000000000004   r11:
0000000000000002
>> (XEN) r12: ffff830421080250   r13: ffff830421060534   r14:
ffff82c48029ff18
>> (XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4:
00000000000026f0
>> (XEN) cr3: 000000026582c000   cr2: ffff8804020701d8
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c48029feb8:
>> (XEN)    0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0
000000e900000000
>> (XEN)    000000000000e02b 0000000000000000 000000004bf51982
00000000000060a9
>> (XEN)    0000000000000000 0000000000000000 00007d3b7fd600c7
ffff82c48014de60
>> (XEN)    0000000000000000 0000000000000000 00000000000060a9
000000004bf51982
>> (XEN)    ffff8802d2665b28 0000000000000000 0000000000000000
0000000000007ff0
>> (XEN)    0000000000000022 0000000000000000 000000024bf57322
0000000001307da0
>> (XEN)    00000000000059a0 0000000000000000 00000000000060a9
0000002000000000
>> (XEN)    ffffffff8123c51a 000000000000e033 0000000000000293
ffff8802d2665b08
>> (XEN)    000000000000e02b 0000000000000000 0000000000000000
0000000000000000
>> (XEN)    0000000000000000 0000000000000000 ffff8300ca9a0000
0000000000000000
>> (XEN)    0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c48015e2db>]
smp_irq_move_cleanup_interrupt+0x216/0x28e
>>
>>
>>
> 
> So vector e9 doesn''t appear to be programmed in anywhere.
> 
> I am starting to get more into the realm of guessing here but, can you
> use apic_verbosity=debug on the command line and copy this extra
> debugging logic to send_cleanup_vector()
> 
> You should be able to conditionally trigger it on
"desc->arch.vector => 0xe9".  You will probably also want to
change the BUG() to a WARN(), so
> we get the interrupt and ioapic information on both sides of the cleanup
> vector, as well as getting the stack trace of the codepath through Xen
> as a result of vector 0xe9.
send_cleanup_vector() doesn''t seem to be called with cfg->vector ==
0xe9...
Can dom0 mess something here around?

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Mar-27 08:52 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> So vector e9 doesn''t appear to be programmed in anywhere.
Quite obviously, as it''s the 8259A vector for IRQ 9. The question
really is why an IRQ appears on that vector in the first place. The
8259A resume code _should_ leave all IRQs masked on a fully
IO-APIC system (see my question raised yesterday).

And that''s also why I suggested, for an experiment, to fiddle with
the loop exit condition to exclude legacy vectors (which wouldn''t
be a final solution, but would at least tell us whether the direction
is the right one). In the end, besides understanding why an
interrupt on vector E9 gets raised at all, we may also need to
tweak the IRQ migration logic to not do anything on legacy IRQs,
but that would need to happen earlier than in
smp_irq_move_cleanup_interrupt(). Considering that 4.3
apparently doesn''t have this problem, we may need to go hunt for
a change that isn''t directly connected to this, yet deals with the
problem as a side effect (at least I don''t recall any particular fix
since 4.2). One aspect here is the double mapping of legacy IRQs
(once to their IO-APIC vector, and once to their legacy vector,
i.e. vector_irq[] having two entries pointing to the same IRQ).

Jan

Jan Beulich

2013-Mar-27 08:58 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 27.03.13 at 09:50, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
> send_cleanup_vector() doesn''t seem to be called with
cfg->vector == 0xe9...
> Can dom0 mess something here around?
Of course not - I suppose it is being called for IRQ9 (with whatever
vector the IO-APIC has set for that IRQ at that point in time).

Jan

Jan Beulich

2013-Mar-27 09:03 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 27.03.13 at 09:52, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> So vector e9 doesn''t appear to be programmed in anywhere.
> 
> Quite obviously, as it''s the 8259A vector for IRQ 9. The question
> really is why an IRQ appears on that vector in the first place. The
> 8259A resume code _should_ leave all IRQs masked on a fully
> IO-APIC system (see my question raised yesterday).
So to put this in consumable form: Please log what i8259A_resume()
writes to ports 21 and A1 (i.e. cached_21 and cached_A1), and also
dump those ports'' contents at the crash point (i.e. alongside the
dump_irqs()).

Jan

Marek Marczykowski

2013-Mar-27 14:01 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 10:03, Jan Beulich wrote:>>>> On 27.03.13 at 09:52, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>> So vector e9 doesn''t appear to be programmed in anywhere.
>>
>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>> really is why an IRQ appears on that vector in the first place. The
>> 8259A resume code _should_ leave all IRQs masked on a fully
>> IO-APIC system (see my question raised yesterday).
> 
> So to put this in consumable form: Please log what i8259A_resume()
> writes to ports 21 and A1 (i.e. cached_21 and cached_A1), and also
> dump those ports'' contents at the crash point (i.e. alongside the
> dump_irqs()).
I''ve noticed that not all messages are available on serial console,
especially
nothing from inside of i8259A_resume(). So changed BUG to WARN and got some
additional lines.

Ports: 21:0xfb, A1:0xff (the same in i8259A_resume() as at crash point).

Part of
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs3.log:
(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Broke affinity for irq 1
(XEN) Broke affinity for irq 12
(XEN) Broke affinity for irq 17
(XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty
entry!
(XEN) Broke affinity for irq 27
(XEN) Broke affinity for irq 1
(XEN) Broke affinity for irq 7
(XEN) Broke affinity for irq 9
(XEN) Broke affinity for irq 16
(XEN) Broke affinity for irq 20
(XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty
entry!
(XEN) Broke affinity for irq 32
(XEN) Broke affinity for irq 36
(XEN) Broke affinity for irq 1
(XEN) Broke affinity for irq 7
(XEN) Broke affinity for irq 20
(XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty
entry!
(XEN) Broke affinity for irq 28
(XEN) Broke affinity for irq 29
(XEN) Broke affinity for irq 30
(XEN) Broke affinity for irq 31
(XEN) Entering ACPI S3 state.
(XEN) i8259A_suspend: cached_21: 0xfb, cached_A1: 0xff
(XEN) i8259A_resume: cached_21: 0xfb, cached_A1: 0xff
(XEN) mce_intel.c:1162: MCA Capability: BCAST 1 SER 0 CMCI 1 firstbank 0
extended MCE MSR 0
(XEN) CPU0 CMCI LVT vector (0xf7) already installed
(XEN) CPU0: Thermal LVT vector (0xfa) already installed
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Suppress EOI broadcast on CPU#1
(XEN) masked ExtINT on CPU#1
(XEN) Suppress EOI broadcast on CPU#2
(XEN) masked ExtINT on CPU#2
(XEN) Suppress EOI broadcast on CPU#3
(XEN) masked ExtINT on CPU#3
(XEN) *** IRQ BUG found ***
(XEN) CPU0 -Testing vector 233 from bitmap
44,49,57,64,68,72,76,80,84,88,96,100,108,112,120,122,144,152,154,160,168,192,194,200,208,211,218-219
(XEN) Guest interrupt information:
(XEN)    IRQ:   0 affinity:00000000,00000000,00000000,00000001 vec:f0
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   1 affinity:00000000,00000000,00000000,00000002 vec:db
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  1(-S--),
(XEN)    IRQ:   2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2
type=XT-PIC          status=00000000 mapped, unbound
(XEN)    IRQ:   3 affinity:00000000,00000000,00000000,00000001 vec:40
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   4 affinity:00000000,00000000,00000000,00000001 vec:f1
type=IO-APIC-edge    status=00000000 mapped, unbound
(XEN)    IRQ:   5 affinity:00000000,00000000,00000000,00000001 vec:48
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   6 affinity:00000000,00000000,00000000,00000001 vec:50
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:   7 affinity:00000000,00000000,00000000,00000004 vec:7a
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  7(-S--),
(XEN)    IRQ:   8 affinity:00000000,00000000,00000000,00000001 vec:60
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0:  8(-S--),
(XEN)    IRQ:   9 affinity:00000000,00000000,00000000,00000001 vec:64
type=IO-APIC-level   status=00000010 in-flight=0 domain-list=0:  9(-S--),
(XEN)    IRQ:  10 affinity:00000000,00000000,00000000,00000001 vec:70
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  11 affinity:00000000,00000000,00000000,00000001 vec:78
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  12 affinity:00000000,00000000,00000000,00000001 vec:4c
type=IO-APIC-edge    status=00000050 in-flight=0 domain-list=0: 12(-S--),
(XEN)    IRQ:  13 affinity:00000000,00000000,00000000,0000000f vec:90
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  14 affinity:00000000,00000000,00000000,00000001 vec:98
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  15 affinity:00000000,00000000,00000000,00000001 vec:a0
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  16 affinity:00000000,00000000,00000000,00000001 vec:6c
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 16(-S--),
(XEN)    IRQ:  17 affinity:00000000,00000000,00000000,00000001 vec:54
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 17(-S--),
(XEN)    IRQ:  18 affinity:00000000,00000000,00000000,00000008 vec:39
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  19 affinity:00000000,00000000,00000000,0000000f vec:c8
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  20 affinity:00000000,00000000,00000000,00000004 vec:da
type=IO-APIC-level   status=00000050 in-flight=0 domain-list=0: 20(-S--),
(XEN)    IRQ:  22 affinity:00000000,00000000,00000000,0000000f vec:9a
type=IO-APIC-level   status=00000002 mapped, unbound
(XEN)    IRQ:  23 affinity:00000000,00000000,00000000,0000000f vec:a8
type=IO-APIC-edge    status=00000002 mapped, unbound
(XEN)    IRQ:  24 affinity:00000000,00000000,00000000,00000001 vec:28
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  25 affinity:00000000,00000000,00000000,00000001 vec:30
type=DMA_MSI         status=00000000 mapped, unbound
(XEN)    IRQ:  26 affinity:00000000,00000000,00000000,00000004 vec:3c
type=PCI-MSI         status=00000002 mapped, unbound
(XEN)    IRQ:  27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:9c
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:a4
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ac
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  32 affinity:00000000,00000000,00000000,00000001 vec:74
type=PCI-MSI         status=00000050 in-flight=0 domain-list=0:273(-S--),
(XEN)    IRQ:  33 affinity:00000000,00000000,00000000,00000004 vec:8c
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:272(PS--),
(XEN)    IRQ:  34 affinity:00000000,00000000,00000000,00000001 vec:94
type=PCI-MSI         status=00000010 in-flight=0 domain-list=0:271(-S--),
(XEN)    IRQ:  35 affinity:00000000,00000000,00000000,00000004 vec:d9
type=PCI-MSI         status=00000042 mapped, unbound
(XEN)    IRQ:  36 affinity:00000000,00000000,00000000,00000001 vec:7c
type=PCI-MSI         status=00000050 in-flight=0 domain-list=1: 54(-S--),
(XEN) IO-APIC interrupt information:
(XEN)     IRQ  0 Vec240:
(XEN)       Apic 0x00, Pin  2: vec=f0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  1 Vec219:
(XEN)       Apic 0x00, Pin  1: vec=db delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  3 Vec 64:
(XEN)       Apic 0x00, Pin  3: vec=40 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  4 Vec241:
(XEN)       Apic 0x00, Pin  4: vec=f1 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  5 Vec 72:
(XEN)       Apic 0x00, Pin  5: vec=48 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  6 Vec 80:
(XEN)       Apic 0x00, Pin  6: vec=50 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  7 Vec122:
(XEN)       Apic 0x00, Pin  7: vec=7a delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  8 Vec 96:
(XEN)       Apic 0x00, Pin  8: vec=60 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ  9 Vec100:
(XEN)       Apic 0x00, Pin  9: vec=64 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 10 Vec112:
(XEN)       Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 11 Vec120:
(XEN)       Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 12 Vec 76:
(XEN)       Apic 0x00, Pin 12: vec=4c delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 13 Vec144:
(XEN)       Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN)     IRQ 14 Vec152:
(XEN)       Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 15 Vec160:
(XEN)       Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=0 dest_id:0
(XEN)     IRQ 16 Vec108:
(XEN)       Apic 0x00, Pin 16: vec=6c delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 17 Vec 84:
(XEN)       Apic 0x00, Pin 17: vec=54 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 18 Vec 57:
(XEN)       Apic 0x00, Pin 18: vec=39 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 19 Vec200:
(XEN)       Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 20 Vec218:
(XEN)       Apic 0x00, Pin 20: vec=da delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=0 dest_id:0
(XEN)     IRQ 22 Vec154:
(XEN)       Apic 0x00, Pin 22: vec=9a delivery=LoPri dest=L status=0
polarity=1 irr=0 trig=L mask=1 dest_id:0
(XEN)     IRQ 23 Vec168:
(XEN)       Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0
polarity=0 irr=0 trig=E mask=1 dest_id:0
(XEN) number of MP IRQ sources: 15.
(XEN) number of IO-APIC #2 registers: 24.
(XEN) testing the IO APIC.......................
(XEN) IO APIC #2......
(XEN) .... register #00: 02000000
(XEN) .......    : physical APIC id: 02
(XEN) .......    : Delivery Type: 0
(XEN) .......    : LTS          : 0
(XEN) .... register #01: 00170020
(XEN) .......     : max redirection entries: 0017
(XEN) .......     : PRQ implemented: 0
(XEN) .......     : IO APIC version: 0020
(XEN) .... IRQ redirection table:
(XEN)  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
(XEN)  00 000 00  1    0    0   0   0    0    0    00
(XEN)  01 000 00  0    0    0   0   0    1    1    DB
(XEN)  02 000 00  0    0    0   0   0    1    1    F0
(XEN)  03 000 00  0    0    0   0   0    1    1    40
(XEN)  04 000 00  0    0    0   0   0    1    1    F1
(XEN)  05 000 00  0    0    0   0   0    1    1    48
(XEN)  06 000 00  0    0    0   0   0    1    1    50
(XEN)  07 000 00  0    0    0   0   0    1    1    7A
(XEN)  08 000 00  0    0    0   0   0    1    1    60
(XEN)  09 000 00  0    1    0   0   0    1    1    64
(XEN)  0a 000 00  0    0    0   0   0    1    1    70
(XEN)  0b 000 00  0    0    0   0   0    1    1    78
(XEN)  0c 000 00  0    0    0   0   0    1    1    4C
(XEN)  0d 000 00  1    0    0   0   0    1    1    90
(XEN)  0e 000 00  0    0    0   0   0    1    1    98
(XEN)  0f 000 00  0    0    0   0   0    1    1    A0
(XEN)  10 000 00  0    1    0   1   0    1    1    6C
(XEN)  11 000 00  0    1    0   1   0    1    1    54
(XEN)  12 000 00  1    1    0   1   0    1    1    39
(XEN)  13 000 00  1    1    0   1   0    1    1    C8
(XEN)  14 000 00  0    1    0   1   0    1    1    DA
(XEN)  15 000 00  1    0    0   0   0    0    0    00
(XEN)  16 000 00  1    1    0   1   0    1    1    9A
(XEN)  17 000 00  1    0    0   0   0    1    1    A8
(XEN) Using vector-based indexing
(XEN) IRQ to pin mappings:
(XEN) IRQ240 -> 0:2
(XEN) IRQ219 -> 0:1
(XEN) IRQ64 -> 0:3
(XEN) IRQ241 -> 0:4
(XEN) IRQ72 -> 0:5
(XEN) IRQ80 -> 0:6
(XEN) IRQ122 -> 0:7
(XEN) IRQ96 -> 0:8
(XEN) IRQ100 -> 0:9
(XEN) IRQ112 -> 0:10
(XEN) IRQ120 -> 0:11
(XEN) IRQ76 -> 0:12
(XEN) IRQ144 -> 0:13
(XEN) IRQ152 -> 0:14
(XEN) IRQ160 -> 0:15
(XEN) IRQ108 -> 0:16
(XEN) IRQ84 -> 0:17
(XEN) IRQ57 -> 0:18
(XEN) IRQ200 -> 0:19
(XEN) IRQ218 -> 0:20
(XEN) IRQ154 -> 0:22
(XEN) IRQ168 -> 0:23
(XEN) .................................... done.
(XEN) i8259: 21: 0xfb, A1: 0xff
(XEN) Xen WARN at io_apic.c:558
(XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015e341>]
smp_irq_move_cleanup_interrupt+0x23c/0x2bc
(XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 00000000000000e9   rcx: 0000000000000000
(XEN) rdx: 0000000000000000   rsi: 000000000000000a   rdi: ffff82c4802592e0
(XEN) rbp: ffff82c48029fb58   rsp: ffff82c48029fb08   r8:  0000000000000004
(XEN) r9:  0000000000000001   r10: 00000000000000ff   r11: 0000000000000002
(XEN) r12: ffff830421080250   r13: ffff830421060534   r14: ffff82c48029ff18
(XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000037e7a8000   cr2: ffff880402070318
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48029fb08:
(XEN)    0000000000000000 0000000000000008 ffff82c48029ff18 ffff82c4802dd9e0
(XEN)    ffff82c48029fb58 0000000000000004 0000000000000000 0000000080030014
(XEN)    0000000000000000 0000000000000000 00007d3b7fd60477 ffff82c48014de60
(XEN)    0000000000000000 0000000000000000 0000000080030014 0000000000000000
(XEN)    ffff82c48029fc18 0000000000000004 0000000000000246 0000000000000000
(XEN)    00000000ffffffff 00000000ffffffff 0000000000000000 0000000000000001
(XEN)    0000000000000cfc 0000000000000282 ffff82c48025a9c0 0000002000000000
(XEN)    ffff82c4801226c0 000000000000e008 0000000000000282 ffff82c48029fc18
(XEN)    000000000000e010 0000000000000282 ffff82c48029fc48 ffff82c480175950
(XEN)    0000000000000202 0000000000000006 0000000000000010 00000000e2200004
(XEN)    ffff82c48029fc68 ffff82c4802105dc ffff82c48029fc78 ffff82c480122614
(XEN)    ffff82c48029fcc8 ffff82c480160183 ffff82c48029fca8 ffff82c480175950
(XEN)    000082c4ffffffff 0000000000000003 ffff8301108fd1c0 ffff830421050ac0
(XEN)    ffff8301108fd1c0 0000000000000000 0000000000000000 0000000000000003
(XEN)    ffff82c48029fd58 ffff82c48016033a 000000000000002f 0000000000000082
(XEN)    000782c48029fd08 ffff82c48029fe10 0000006a00000008 ffff82c48029fe78
(XEN)    0000000300000068 0000000000000000 0000000000002000 ffff82c4ffffffff
(XEN)    ffff82c48029fe10 ffff82c48029fe78 ffff82c48029fe10 ffff830421050ac0
(XEN)    0000000000000000 000000000000001e ffff82c48029fdc8 ffff82c4801610ef
(XEN)    ffff82c48029fdb8 ffff82c480115ec5 0000000000000293 ffff83042100a1f8
(XEN) Xen call trace:
(XEN)    [<ffff82c48015e341>] smp_irq_move_cleanup_interrupt+0x23c/0x2bc
(XEN)    [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40
(XEN)    [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24
(XEN)    [<ffff82c480175950>] pci_conf_read+0xb0/0xc1
(XEN)    [<ffff82c4802105dc>] pci_conf_read32+0x7c/0x7e
(XEN)    [<ffff82c480160183>] read_pci_mem_bar+0x2b0/0x303
(XEN)    [<ffff82c48016033a>] msix_capability_init+0x164/0x5fa
(XEN)    [<ffff82c4801610ef>] pci_enable_msi+0x19b/0x49b
(XEN)    [<ffff82c4801643bd>] map_domain_pirq+0x281/0x3df
(XEN)    [<ffff82c4801765cb>] do_physdev_op+0xa2b/0x1508
(XEN)    [<ffff82c480209fa8>] syscall_enter+0xc8/0x122
(XEN)


-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-27 14:31 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 09:52, Jan Beulich wrote:>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> So vector e9 doesn''t appear to be programmed in anywhere.
> 
> Quite obviously, as it''s the 8259A vector for IRQ 9. The question
> really is why an IRQ appears on that vector in the first place. The
> 8259A resume code _should_ leave all IRQs masked on a fully
> IO-APIC system (see my question raised yesterday).
> 
> And that''s also why I suggested, for an experiment, to fiddle with
> the loop exit condition to exclude legacy vectors (which wouldn''t
> be a final solution, but would at least tell us whether the direction
> is the right one). In the end, besides understanding why an
> interrupt on vector E9 gets raised at all, we may also need to
> tweak the IRQ migration logic to not do anything on legacy IRQs,
> but that would need to happen earlier than in
> smp_irq_move_cleanup_interrupt(). Considering that 4.3
> apparently doesn''t have this problem, we may need to go hunt for
> a change that isn''t directly connected to this, yet deals with the
> problem as a side effect (at least I don''t recall any particular
fix
> since 4.2). One aspect here is the double mapping of legacy IRQs
> (once to their IO-APIC vector, and once to their legacy vector,
> i.e. vector_irq[] having two entries pointing to the same IRQ).
So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t
hit that
BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler,
also some
errors from dom0 kernel, and errors about PCI devices used by domU(1).

Messages from resume (different tries):
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log

Also one time I''ve got fatal page fault error, earlier in resume (it
isn''t
deterministic):
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-27 14:46 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 14:31, Marek Marczykowski wrote:> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>> So vector e9 doesn''t appear to be programmed in anywhere.
>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>> really is why an IRQ appears on that vector in the first place. The
>> 8259A resume code _should_ leave all IRQs masked on a fully
>> IO-APIC system (see my question raised yesterday).
>>
>> And that''s also why I suggested, for an experiment, to fiddle
with
>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>> be a final solution, but would at least tell us whether the direction
>> is the right one). In the end, besides understanding why an
>> interrupt on vector E9 gets raised at all, we may also need to
>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>> but that would need to happen earlier than in
>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>> apparently doesn''t have this problem, we may need to go hunt
for
>> a change that isn''t directly connected to this, yet deals with
the
>> problem as a side effect (at least I don''t recall any
particular fix
>> since 4.2). One aspect here is the double mapping of legacy IRQs
>> (once to their IO-APIC vector, and once to their legacy vector,
>> i.e. vector_irq[] having two entries pointing to the same IRQ).
> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit that
> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also some
> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>
> Messages from resume (different tries):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>
> Also one time I''ve got fatal page fault error, earlier in resume
(it isn''t
> deterministic):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>
This pagefault is a Null structure pointer dereference, likely the
scheduling data.  At a first glance, it looks related to the assertion
failures I have been seeing sporadically in testing, but unable to
reproduce reliably.  There seems to be something quite dodgy with
interaction of vcpu_wake and scheduling loops.

The other logs indicate that dom0 appears to have a domain id of 1,
which is sure to cause problems.

As for locating the cause of the legacy vectors, it might be a good idea
to stick a printk at the top of do_IRQ() which indicates an interrupt
with vector between 0xe0 and 0xef.  This might at least indicate whether
legacy vectors are genuinely being delivered, or whether we have some
memory corruption causing these effects.

~Andrew

Marek Marczykowski

2013-Mar-27 14:49 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 15:46, Andrew Cooper wrote:> On 27/03/2013 14:31, Marek Marczykowski wrote:
>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>> So vector e9 doesn''t appear to be programmed in
anywhere.
>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>>> really is why an IRQ appears on that vector in the first place. The
>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>> IO-APIC system (see my question raised yesterday).
>>>
>>> And that''s also why I suggested, for an experiment, to
fiddle with
>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>> be a final solution, but would at least tell us whether the
direction
>>> is the right one). In the end, besides understanding why an
>>> interrupt on vector E9 gets raised at all, we may also need to
>>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>>> but that would need to happen earlier than in
>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>> apparently doesn''t have this problem, we may need to go
hunt for
>>> a change that isn''t directly connected to this, yet deals
with the
>>> problem as a side effect (at least I don''t recall any
particular fix
>>> since 4.2). One aspect here is the double mapping of legacy IRQs
>>> (once to their IO-APIC vector, and once to their legacy vector,
>>> i.e. vector_irq[] having two entries pointing to the same IRQ).
>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit that
>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also some
>> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>>
>> Messages from resume (different tries):
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>
>> Also one time I''ve got fatal page fault error, earlier in
resume (it isn''t
>> deterministic):
>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>
> 
> This pagefault is a Null structure pointer dereference, likely the
> scheduling data.  At a first glance, it looks related to the assertion
> failures I have been seeing sporadically in testing, but unable to
> reproduce reliably.  There seems to be something quite dodgy with
> interaction of vcpu_wake and scheduling loops.
> 
> The other logs indicate that dom0 appears to have a domain id of 1,
> which is sure to cause problems.
Perhaps not - domain 1 exists and have some PCI devices assigned (namely two
network adapters).
> As for locating the cause of the legacy vectors, it might be a good idea
> to stick a printk at the top of do_IRQ() which indicates an interrupt
> with vector between 0xe0 and 0xef.  This might at least indicate whether
> legacy vectors are genuinely being delivered, or whether we have some
> memory corruption causing these effects.
Ok, will try something like this.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-27 14:52 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 14:46, Andrew Cooper wrote:> On 27/03/2013 14:31, Marek Marczykowski wrote:
>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>> So vector e9 doesn''t appear to be programmed in
anywhere.
>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>>> really is why an IRQ appears on that vector in the first place. The
>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>> IO-APIC system (see my question raised yesterday).
>>>
>>> And that''s also why I suggested, for an experiment, to
fiddle with
>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>> be a final solution, but would at least tell us whether the
direction
>>> is the right one). In the end, besides understanding why an
>>> interrupt on vector E9 gets raised at all, we may also need to
>>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>>> but that would need to happen earlier than in
>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>> apparently doesn''t have this problem, we may need to go
hunt for
>>> a change that isn''t directly connected to this, yet deals
with the
>>> problem as a side effect (at least I don''t recall any
particular fix
>>> since 4.2). One aspect here is the double mapping of legacy IRQs
>>> (once to their IO-APIC vector, and once to their legacy vector,
>>> i.e. vector_irq[] having two entries pointing to the same IRQ).
>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit that
>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also some
>> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>>
>> Messages from resume (different tries):
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>
>> Also one time I''ve got fatal page fault error, earlier in
resume (it isn''t
>> deterministic):
>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>
> This pagefault is a Null structure pointer dereference, likely the
> scheduling data.  At a first glance, it looks related to the assertion
> failures I have been seeing sporadically in testing, but unable to
> reproduce reliably.  There seems to be something quite dodgy with
> interaction of vcpu_wake and scheduling loops.
>
> The other logs indicate that dom0 appears to have a domain id of 1,
> which is sure to cause problems.
Actually - ignore this

From the log,

(XEN) physdev.c:153: dom0: can''t create irq for msi!
[  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
domain
(XEN) physdev.c:153: dom0: can''t create irq for msi!
[  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
domain

and later

(XEN) physdev.c:153: dom1: can''t create irq for msi!
[  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
[  121.954080] error enable msi for guest 1 status ffffffea
(XEN) physdev.c:153: dom1: can''t create irq for msi!
[  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
[  122.044421] error enable msi for guest 1 status ffffffea

I think that there is a separate bug where mapped irqs are not unmapped
on the suspend path.
>
> As for locating the cause of the legacy vectors, it might be a good idea
> to stick a printk at the top of do_IRQ() which indicates an interrupt
> with vector between 0xe0 and 0xef.  This might at least indicate whether
> legacy vectors are genuinely being delivered, or whether we have some
> memory corruption causing these effects.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Mar-27 15:47 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper
wrote:> On 27/03/2013 14:46, Andrew Cooper wrote:
> > On 27/03/2013 14:31, Marek Marczykowski wrote:
> >> On 27.03.2013 09:52, Jan Beulich wrote:
> >>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> >>>> So vector e9 doesn''t appear to be programmed in
anywhere.
> >>> Quite obviously, as it''s the 8259A vector for IRQ 9.
The question
> >>> really is why an IRQ appears on that vector in the first
place. The
> >>> 8259A resume code _should_ leave all IRQs masked on a fully
> >>> IO-APIC system (see my question raised yesterday).
> >>>
> >>> And that''s also why I suggested, for an experiment,
to fiddle with
> >>> the loop exit condition to exclude legacy vectors (which
wouldn''t
> >>> be a final solution, but would at least tell us whether the
direction
> >>> is the right one). In the end, besides understanding why an
> >>> interrupt on vector E9 gets raised at all, we may also need to
> >>> tweak the IRQ migration logic to not do anything on legacy
IRQs,
> >>> but that would need to happen earlier than in
> >>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
> >>> apparently doesn''t have this problem, we may need to
go hunt for
> >>> a change that isn''t directly connected to this, yet
deals with the
> >>> problem as a side effect (at least I don''t recall any
particular fix
> >>> since 4.2). One aspect here is the double mapping of legacy
IRQs
> >>> (once to their IO-APIC vector, and once to their legacy
vector,
> >>> i.e. vector_irq[] having two entries pointing to the same
IRQ).
> >> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit that
> >> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also some
> >> errors from dom0 kernel, and errors about PCI devices used by
domU(1).
> >>
> >> Messages from resume (different tries):
> >>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
> >>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
> >>
> >> Also one time I''ve got fatal page fault error, earlier in
resume (it isn''t
> >> deterministic):
> >>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
> >>
> > This pagefault is a Null structure pointer dereference, likely the
> > scheduling data.  At a first glance, it looks related to the assertion
> > failures I have been seeing sporadically in testing, but unable to
> > reproduce reliably.  There seems to be something quite dodgy with
> > interaction of vcpu_wake and scheduling loops.
> >
> > The other logs indicate that dom0 appears to have a domain id of 1,
> > which is sure to cause problems.
> 
> Actually - ignore this
> 
> >From the log,
> 
> (XEN) physdev.c:153: dom0: can''t create irq for msi!
> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
> domain
> (XEN) physdev.c:153: dom0: can''t create irq for msi!
> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
> domain
> 
> and later
> 
> (XEN) physdev.c:153: dom1: can''t create irq for msi!
> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
> [  121.954080] error enable msi for guest 1 status ffffffea
> (XEN) physdev.c:153: dom1: can''t create irq for msi!
> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
> [  122.044421] error enable msi for guest 1 status ffffffea
> 
> I think that there is a separate bug where mapped irqs are not unmapped
> on the suspend path.
You thinking this is a Linux (xen irq machinery) issue? Meaning it should
end up calling PHYSDEV_unmap_pirq as part of the suspend process?

Marek Marczykowski

2013-Mar-27 15:51 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 15:49, Marek Marczykowski wrote:> On 27.03.2013 15:46, Andrew Cooper wrote:
>> As for locating the cause of the legacy vectors, it might be a good
idea
>> to stick a printk at the top of do_IRQ() which indicates an interrupt
>> with vector between 0xe0 and 0xef.  This might at least indicate
whether
>> legacy vectors are genuinely being delivered, or whether we have some
>> memory corruption causing these effects.
> 
> Ok, will try something like this.
Nothing interesting here...
Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump
information).

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-27 16:27 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 15:51, Marek Marczykowski wrote:> On 27.03.2013 15:49, Marek Marczykowski wrote:
>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>> As for locating the cause of the legacy vectors, it might be a good
idea
>>> to stick a printk at the top of do_IRQ() which indicates an
interrupt
>>> with vector between 0xe0 and 0xef.  This might at least indicate
whether
>>> legacy vectors are genuinely being delivered, or whether we have
some
>>> memory corruption causing these effects.
>> Ok, will try something like this.
> Nothing interesting here...
> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump
information).
>
Even in the case where we hit the original assertion?

If so, then all I can thing is that the move_pending flag for that
specific GSI has been corrupted in memory somehow.

I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume
and in the case of the assertion failure might give some hints.

~Andrew

Andrew Cooper

2013-Mar-27 16:56 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:
>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>>> So vector e9 doesn''t appear to be programmed
in anywhere.
>>>>> Quite obviously, as it''s the 8259A vector for IRQ
9. The question
>>>>> really is why an IRQ appears on that vector in the first
place. The
>>>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>>>> IO-APIC system (see my question raised yesterday).
>>>>>
>>>>> And that''s also why I suggested, for an
experiment, to fiddle with
>>>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>>>> be a final solution, but would at least tell us whether the
direction
>>>>> is the right one). In the end, besides understanding why an
>>>>> interrupt on vector E9 gets raised at all, we may also need
to
>>>>> tweak the IRQ migration logic to not do anything on legacy
IRQs,
>>>>> but that would need to happen earlier than in
>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>>>> apparently doesn''t have this problem, we may need
to go hunt for
>>>>> a change that isn''t directly connected to this,
yet deals with the
>>>>> problem as a side effect (at least I don''t recall
any particular fix
>>>>> since 4.2). One aspect here is the double mapping of legacy
IRQs
>>>>> (once to their IO-APIC vector, and once to their legacy
vector,
>>>>> i.e. vector_irq[] having two entries pointing to the same
IRQ).
>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit that
>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used
by scheduler, also some
>>>> errors from dom0 kernel, and errors about PCI devices used by
domU(1).
>>>>
>>>> Messages from resume (different tries):
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>
>>>> Also one time I''ve got fatal page fault error, earlier
in resume (it isn''t
>>>> deterministic):
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>
>>> This pagefault is a Null structure pointer dereference, likely the
>>> scheduling data.  At a first glance, it looks related to the
assertion
>>> failures I have been seeing sporadically in testing, but unable to
>>> reproduce reliably.  There seems to be something quite dodgy with
>>> interaction of vcpu_wake and scheduling loops.
>>>
>>> The other logs indicate that dom0 appears to have a domain id of 1,
>>> which is sure to cause problems.
>> Actually - ignore this
>>
>> >From the log,
>>
>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
>> domain
>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
>> domain
>>
>> and later
>>
>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1
domain
>> [  121.954080] error enable msi for guest 1 status ffffffea
>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1
domain
>> [  122.044421] error enable msi for guest 1 status ffffffea
>>
>> I think that there is a separate bug where mapped irqs are not unmapped
>> on the suspend path.
> You thinking this is a Linux (xen irq machinery) issue? Meaning it should
> end up calling PHYSDEV_unmap_pirq as part of the suspend process?
I am not sure.  Without looking at the code, I am only speculating.

Beyond that, the main question is about the expected behaviour.  Do we
expect dom0/U to unmap its irqs and remap them after resume?  What do we
expect from domains which are unaware of the host sleep action?

~Andrew

Marek Marczykowski

2013-Mar-27 17:15 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 17:56, Andrew Cooper wrote:> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:
>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>>>> So vector e9 doesn''t appear to be
programmed in anywhere.
>>>>>> Quite obviously, as it''s the 8259A vector for
IRQ 9. The question
>>>>>> really is why an IRQ appears on that vector in the
first place. The
>>>>>> 8259A resume code _should_ leave all IRQs masked on a
fully
>>>>>> IO-APIC system (see my question raised yesterday).
>>>>>>
>>>>>> And that''s also why I suggested, for an
experiment, to fiddle with
>>>>>> the loop exit condition to exclude legacy vectors
(which wouldn''t
>>>>>> be a final solution, but would at least tell us whether
the direction
>>>>>> is the right one). In the end, besides understanding
why an
>>>>>> interrupt on vector E9 gets raised at all, we may also
need to
>>>>>> tweak the IRQ migration logic to not do anything on
legacy IRQs,
>>>>>> but that would need to happen earlier than in
>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>>>>> apparently doesn''t have this problem, we may
need to go hunt for
>>>>>> a change that isn''t directly connected to
this, yet deals with the
>>>>>> problem as a side effect (at least I don''t
recall any particular fix
>>>>>> since 4.2). One aspect here is the double mapping of
legacy IRQs
>>>>>> (once to their IO-APIC vector, and once to their legacy
vector,
>>>>>> i.e. vector_irq[] having two entries pointing to the
same IRQ).
>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and
it doesn''t hit that
>>>>> BUG/ASSERT. But still it doesn''t work - only CPU0
used by scheduler, also some
>>>>> errors from dom0 kernel, and errors about PCI devices used
by domU(1).
>>>>>
>>>>> Messages from resume (different tries):
>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>
>>>>> Also one time I''ve got fatal page fault error,
earlier in resume (it isn''t
>>>>> deterministic):
>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>
>>>> This pagefault is a Null structure pointer dereference, likely
the
>>>> scheduling data.  At a first glance, it looks related to the
assertion
>>>> failures I have been seeing sporadically in testing, but unable
to
>>>> reproduce reliably.  There seems to be something quite dodgy
with
>>>> interaction of vcpu_wake and scheduling loops.
>>>>
>>>> The other logs indicate that dom0 appears to have a domain id
of 1,
>>>> which is sure to cause problems.
>>> Actually - ignore this
>>>
>>> >From the log,
>>>
>>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for
32752
>>> domain
>>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for
32752
>>> domain
>>>
>>> and later
>>>
>>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>>> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1
domain
>>> [  121.954080] error enable msi for guest 1 status ffffffea
>>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>>> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1
domain
>>> [  122.044421] error enable msi for guest 1 status ffffffea
>>>
>>> I think that there is a separate bug where mapped irqs are not
unmapped
>>> on the suspend path.
>> You thinking this is a Linux (xen irq machinery) issue? Meaning it
should
>> end up calling PHYSDEV_unmap_pirq as part of the suspend process?
> 
> I am not sure.  Without looking at the code, I am only speculating.
> 
> Beyond that, the main question is about the expected behaviour.  Do we
> expect dom0/U to unmap its irqs and remap them after resume?  What do we
> expect from domains which are unaware of the host sleep action?
BTW this is the case: domain 1 isn''t fully aware of sleep. It have some
PCI
devices assigned. The only action taken there before suspend is shutdown
network interfaces (without this system hanged during suspend).

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-27 18:16 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 17:27, Andrew Cooper wrote:> On 27/03/2013 15:51, Marek Marczykowski wrote:
>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>> As for locating the cause of the legacy vectors, it might be a
good idea
>>>> to stick a printk at the top of do_IRQ() which indicates an
interrupt
>>>> with vector between 0xe0 and 0xef.  This might at least
indicate whether
>>>> legacy vectors are genuinely being delivered, or whether we
have some
>>>> memory corruption causing these effects.
>>> Ok, will try something like this.
>> Nothing interesting here...
>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump
information).
>>
> 
> Even in the case where we hit the original assertion?
Yes, even then.
> If so, then all I can thing is that the move_pending flag for that
> specific GSI has been corrupted in memory somehow.
I guest this isn''t the case, see below.
> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume
> and in the case of the assertion failure might give some hints.
I''ve tried something like this. Detailed log here:
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log

Some interesing parts:
after system startup:
(XEN) irq_cfg of IRQ 9:
(XEN)   vector: 138
(XEN)   move_cleanup_count: 0x0
(XEN)   move_in_progress: 0x0
(XEN) irq_desc of IRQ 9:
(XEN)   status: 80 (IRQ_GUEST | IRQ_PENDING)

Isn''t this wrong (status vs move_in_progress)?

Then I''ve run pm-suspend, intentionally failed at the end to prevent
actual
suspend, but run all its hooks. After that:
(XEN) irq_cfg of IRQ 9:
(XEN)   vector: 181
(XEN)   move_cleanup_count: 0x0
(XEN)   move_in_progress: 0x1
(XEN) irq_desc of IRQ 9:
(XEN)   status: 80

So now move_in_progress consistent with status.
Wait few second, and still move_in_progress was 0x1. Isn''t it supposed
to be
only temporary state?

Then suspended, at resume hit that bug. There was:
(XEN) irq_cfg of IRQ 9:
(XEN)   vector: 60
(XEN)   move_cleanup_count: 0x0
(XEN)   move_in_progress: 0x0
(XEN) irq_desc of IRQ 9:
(XEN)   status: 16

move_in_progress==0, ok. But move_cleanup_count==0, while at least once was
move_in_progress==1. Isn''t that wrong?

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-27 18:56 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 18:16, Marek Marczykowski wrote:> On 27.03.2013 17:27, Andrew Cooper wrote:
>> On 27/03/2013 15:51, Marek Marczykowski wrote:
>>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>>> As for locating the cause of the legacy vectors, it might
be a good idea
>>>>> to stick a printk at the top of do_IRQ() which indicates an
interrupt
>>>>> with vector between 0xe0 and 0xef.  This might at least
indicate whether
>>>>> legacy vectors are genuinely being delivered, or whether we
have some
>>>>> memory corruption causing these effects.
>>>> Ok, will try something like this.
>>> Nothing interesting here...
>>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump
information).
>>>
>> Even in the case where we hit the original assertion?
> Yes, even then.
>
>> If so, then all I can thing is that the move_pending flag for that
>> specific GSI has been corrupted in memory somehow.
> I guest this isn''t the case, see below.
>
>> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume
>> and in the case of the assertion failure might give some hints.
> I''ve tried something like this. Detailed log here:
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log
This is concerning, unless I am getting utterly confused.  Jan: Do you
mind double checking my reasoning?

irq 0 through 15 should be the PIC irqs, set up in init_IRQ() in
arch/x86/i8259.c

irq9 should be the irq for the PIC vector which is set up as 0xe9, and
its vector should never change.

Could you put in extra checks for the sanity of per_cpu(vector_irq,
cpu)[0xe0 thru 0xef] ?
>
> Some interesing parts:
> after system startup:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 138
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x0
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 80 (IRQ_GUEST | IRQ_PENDING)
>
> Isn''t this wrong (status vs move_in_progress)?
This here looks fine.  What do you think is wrong about it?
>
> Then I''ve run pm-suspend, intentionally failed at the end to
prevent actual
> suspend, but run all its hooks. After that:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 181
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x1
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 80
>
> So now move_in_progress consistent with status.
> Wait few second, and still move_in_progress was 0x1. Isn''t it
supposed to be
> only temporary state?
move_in_progress gets set by __assign_irq_vector() when the scheduler
decides to move the IRQ.  It can stay set for a long time.

On the next interrupt from this source, the move_in_progress bit being
set causes the IRQ source to be reprogrammed to the new destination.
>
> Then suspended, at resume hit that bug. There was:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 60
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x0
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 16
>
> move_in_progress==0, ok. But move_cleanup_count==0, while at least once was
> move_in_progress==1. Isn''t that wrong?
>
move_cleanup_count is only set in send_cleanup_vector, for the specific
vector which is being cleaned up.

However, as the IPI handler cleans up all vectors which are outstanding,
the move_cleanup_count can be 0 for most vectors which are actually
cleaned up.

This is in an attempt to reduce the number of IPIs required to clean up
all moving irqs.  As the scheduler currently has a habit of moving vcpus
at every scheduling opportunity, this means that irqs are constantly moving.

~Andrew

Jan Beulich

2013-Mar-28 10:50 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 27.03.13 at 17:27, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 27/03/2013 15:51, Marek Marczykowski wrote:
>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>> As for locating the cause of the legacy vectors, it might be a
good idea
>>>> to stick a printk at the top of do_IRQ() which indicates an
interrupt
>>>> with vector between 0xe0 and 0xef.  This might at least
indicate whether
>>>> legacy vectors are genuinely being delivered, or whether we
have some
>>>> memory corruption causing these effects.
>>> Ok, will try something like this.
>> Nothing interesting here...
>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump 
> information).
>>
> 
> Even in the case where we hit the original assertion?
> 
> If so, then all I can thing is that the move_pending flag for that
> specific GSI has been corrupted in memory somehow.
No, I think the flag is legitimately set after resume, and gets
looked at the after the first SCI got signaled (which would
trigger the pending affinity change to be carried out that was
initiated in the suspend path). The problem is a more
fundamental one: irq_move_cleanup_interrupt() (in unstable
terms) includes the legacy vectors, so if, upon encountering the
move_cleanup_count for IRQ 9 (or any legacy IRQ) execution
doesn''t make it all the way through to carrying out the cleanup,
the loop, once in the legacy vector range, will re-encounter the
same IRQ, find move_cleanup_count non-zero again, and thus
tries to do something here.

Hence I think skipping the legacy vector range here is indeed
necessary, even outside the suspend/resume scenario (see
below). Another alternative would be to invalidate the
vector_irq[] entries for legacy vectors handled through the
IO-APIC.

Jan

x86: irq_move_cleanup_interrupt() must ignore legacy vectors

Since the main loop in the function includes legacy vectors, and since
vector_irq[] gets set up for legacy vectors regardless of whether those
get handled through the IO-APIC, it must not do anything on this vector
range. In fact, we should never get here for IRQs not handled through
the IO-APIC, so add a respective warning at once (could probably as
well be an ASSERT()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -625,6 +625,12 @@ void irq_move_cleanup_interrupt(struct c
         if ((int)irq < 0)
             continue;
 
+        if ( vector >= FIRST_LEGACY_VECTOR && vector <=
LAST_LEGACY_VECTOR )
+        {
+            WARN_ON(!IO_APIC_IRQ(irq));
+            continue;
+        }
+
         desc = irq_to_desc(irq);
         if (!desc)
             continue;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-28 11:53 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28/03/2013 10:50, Jan Beulich wrote:>>>> On 27.03.13 at 17:27, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> On 27/03/2013 15:51, Marek Marczykowski wrote:
>>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>>> As for locating the cause of the legacy vectors, it might
be a good idea
>>>>> to stick a printk at the top of do_IRQ() which indicates an
interrupt
>>>>> with vector between 0xe0 and 0xef.  This might at least
indicate whether
>>>>> legacy vectors are genuinely being delivered, or whether we
have some
>>>>> memory corruption causing these effects.
>>>> Ok, will try something like this.
>>> Nothing interesting here...
>>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump
>> information).
>> Even in the case where we hit the original assertion?
>>
>> If so, then all I can thing is that the move_pending flag for that
>> specific GSI has been corrupted in memory somehow.
> No, I think the flag is legitimately set after resume, and gets
> looked at the after the first SCI got signaled (which would
> trigger the pending affinity change to be carried out that was
> initiated in the suspend path). The problem is a more
> fundamental one: irq_move_cleanup_interrupt() (in unstable
> terms) includes the legacy vectors, so if, upon encountering the
> move_cleanup_count for IRQ 9 (or any legacy IRQ) execution
> doesn''t make it all the way through to carrying out the cleanup,
> the loop, once in the legacy vector range, will re-encounter the
> same IRQ, find move_cleanup_count non-zero again, and thus
> tries to do something here.
>
> Hence I think skipping the legacy vector range here is indeed
> necessary, even outside the suspend/resume scenario (see
> below). Another alternative would be to invalidate the
> vector_irq[] entries for legacy vectors handled through the
> IO-APIC.
>
> Jan
>
> x86: irq_move_cleanup_interrupt() must ignore legacy vectors
>
> Since the main loop in the function includes legacy vectors, and since
> vector_irq[] gets set up for legacy vectors regardless of whether those
> get handled through the IO-APIC, it must not do anything on this vector
> range. In fact, we should never get here for IRQs not handled through
> the IO-APIC, so add a respective warning at once (could probably as
> well be an ASSERT()).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Under what circumstances would we have any vectors 0xe0-0xef programmed
into the IOAPIC?  I cant think of any offhand.

As far as I am aware, it is not valid for any PIC interrupts to ever be
up for moving, as they should only be delivered to the BSP.

In addition to the check you have, the scope of the loop should probably
be reduced.  We should never be considering to move any vector larger
than LAST_HIPRIORITY_VECTOR, which I believe are all LAPIC interrupts,
making 8 useless iterations of the loop.  I would also suggest that it
is an ASSERT rather than a WARN, but that leaves us not fixing the bug
at hand, as we have already verified that vector 0xe9 is not programmed
into the IOAPIC.

~Andrew
>
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -625,6 +625,12 @@ void irq_move_cleanup_interrupt(struct c
>          if ((int)irq < 0)
>              continue;
>  
> +        if ( vector >= FIRST_LEGACY_VECTOR && vector <=
LAST_LEGACY_VECTOR )
> +        {
> +            WARN_ON(!IO_APIC_IRQ(irq));
> +            continue;
> +        }
> +
>          desc = irq_to_desc(irq);
>          if (!desc)
>              continue;
>
>

Jan Beulich

2013-Mar-28 12:54 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 28.03.13 at 12:53, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 28/03/2013 10:50, Jan Beulich wrote:
>> x86: irq_move_cleanup_interrupt() must ignore legacy vectors
>>
>> Since the main loop in the function includes legacy vectors, and since
>> vector_irq[] gets set up for legacy vectors regardless of whether those
>> get handled through the IO-APIC, it must not do anything on this vector
>> range. In fact, we should never get here for IRQs not handled through
>> the IO-APIC, so add a respective warning at once (could probably as
>> well be an ASSERT()).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Under what circumstances would we have any vectors 0xe0-0xef programmed
> into the IOAPIC?  I cant think of any offhand.
Never. And I didn''t say it would.
> As far as I am aware, it is not valid for any PIC interrupts to ever be
> up for moving, as they should only be delivered to the BSP.
Hence the WARN_ON() (or ASSERT()).
> In addition to the check you have, the scope of the loop should probably
> be reduced.  We should never be considering to move any vector larger
> than LAST_HIPRIORITY_VECTOR, which I believe are all LAPIC interrupts,
> making 8 useless iterations of the loop.
Agreed. Will update the patch to also do that.
>  I would also suggest that it
> is an ASSERT rather than a WARN, but that leaves us not fixing the bug
> at hand, as we have already verified that vector 0xe9 is not programmed
> into the IOAPIC.
So with you repeating this I think I didn''t explain well enough
what I think is happening. Hence I''ll try again: We possibly (on at
least one CPU for sure) have two vector_irq[] entries referring to
any particular legacy IRQ - one for the vector that the IO-APIC is
using, and one for the corresponding legacy vector. Hence there''ll
be two iterations of the loop here looking at the _same_ IRQ, the
second of which (wrongly) being the one pointed to by the entry in
the legacy vector range. It is this second instance that the change
is suppressing, with the WARN_ON() being there to ascertain that
we indeed never get here for an IRQ handled through the 8259A.

Jan

Jan Beulich

2013-Mar-28 13:19 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 28.03.13 at 13:54, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>>> On 28.03.13 at 12:53, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> On 28/03/2013 10:50, Jan Beulich wrote:
>>> x86: irq_move_cleanup_interrupt() must ignore legacy vectors
>>>
>>> Since the main loop in the function includes legacy vectors, and
since
>>> vector_irq[] gets set up for legacy vectors regardless of whether
those
>>> get handled through the IO-APIC, it must not do anything on this
vector
>>> range. In fact, we should never get here for IRQs not handled
through
>>> the IO-APIC, so add a respective warning at once (could probably as
>>> well be an ASSERT()).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> 
>> Under what circumstances would we have any vectors 0xe0-0xef programmed
>> into the IOAPIC?  I cant think of any offhand.
> 
> Never. And I didn''t say it would.
> 
>> As far as I am aware, it is not valid for any PIC interrupts to ever be
>> up for moving, as they should only be delivered to the BSP.
> 
> Hence the WARN_ON() (or ASSERT()).
You know what - now that I actually tried this out, I see that this
triggers. For the moment I''m puzzled, will need to look into this in
more detail.

Jan

Marek Marczykowski

2013-Mar-28 14:43 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27.03.2013 19:56, Andrew Cooper wrote:> On 27/03/2013 18:16, Marek Marczykowski wrote:
>> On 27.03.2013 17:27, Andrew Cooper wrote:
>>> On 27/03/2013 15:51, Marek Marczykowski wrote:
>>>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>>>> As for locating the cause of the legacy vectors, it
might be a good idea
>>>>>> to stick a printk at the top of do_IRQ() which
indicates an interrupt
>>>>>> with vector between 0xe0 and 0xef.  This might at least
indicate whether
>>>>>> legacy vectors are genuinely being delivered, or
whether we have some
>>>>>> memory corruption causing these effects.
>>>>> Ok, will try something like this.
>>>> Nothing interesting here...
>>>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq
dump information).
>>>>
>>> Even in the case where we hit the original assertion?
>> Yes, even then.
>>
>>> If so, then all I can thing is that the move_pending flag for that
>>> specific GSI has been corrupted in memory somehow.
>> I guest this isn''t the case, see below.
>>
>>> I wonder if hexdumping irq_desc[9] after setup, before sleep, on
resume
>>> and in the case of the assertion failure might give some hints.
>> I''ve tried something like this. Detailed log here:
>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log
> 
> This is concerning, unless I am getting utterly confused.  Jan: Do you
> mind double checking my reasoning?
> 
> irq 0 through 15 should be the PIC irqs, set up in init_IRQ() in
> arch/x86/i8259.c
> 
> irq9 should be the irq for the PIC vector which is set up as 0xe9, and
> its vector should never change.
> 
> Could you put in extra checks for the sanity of per_cpu(vector_irq,
> cpu)[0xe0 thru 0xef] ?
Ok, got something here:
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump2.log

Now bug triggered after some time after resume (about 15s). But only CPU0 by
scheduler immediately after resume. Interesting part - note vector_irq(e1):
(XEN) irq_cfg of IRQ 9:
(XEN)   vector: 188
(XEN)   cpu_mask: 00000000,00000000,00000000,00000001
(XEN)   old_cpu_mask: 00000000,00000000,00000000,00000002
(XEN)   move_cleanup_count: 0x0
(XEN)   used_vectors:
49,64,72,74,80-81,88,98,112,120,144,148,152,156,160,164,168,172,178,188,192,196,200,207-208
(XEN)   move_in_progress: 0x0
(XEN) irq_desc of IRQ 9:
(XEN)   status: 16
(XEN)   handler: ffff82c480252660
(XEN)   msi_desc: 0000000000000000
(XEN)   action: ffff83041d9f1ed0
(XEN)   depth: 0
(XEN)   chip_data: ffff830421080250
(XEN)   irq: 9
(XEN)   affinity: 00000000,00000000,00000000,00000001
(XEN)   pending_mask: 00000000,00000000,00000000,00000000
(XEN)   (...)
(XEN) vector_irq(e0): 0
(XEN) vector_irq(e1): -1
(XEN) vector_irq(e2): 2
(XEN) vector_irq(e3): 3
(XEN) vector_irq(e4): 4
(XEN) vector_irq(e5): 5
(XEN) vector_irq(e6): 6
(XEN) vector_irq(e7): 7
(XEN) vector_irq(e8): 8
(XEN) vector_irq(e9): 9
(XEN) vector_irq(ea): 10
(XEN) vector_irq(eb): 11
(XEN) vector_irq(ec): 12
(XEN) vector_irq(ed): 13
(XEN) vector_irq(ee): 14
(XEN) vector_irq(ef): 15
(XEN) Xen WARN at io_apic.c:639
(XEN) ----[ Xen-4.1.5-rc1  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015e5fb>]
smp_irq_move_cleanup_interrupt+0x246/0x2c6
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 00000000000000e1   rcx: 0000000000000000
(XEN) rdx: 0000000000000000   rsi: 000000000000000a   rdi: ffff82c4802592e0
(XEN) rbp: ffff82c48029fda8   rsp: ffff82c48029fd58   r8:  0000000000000004
(XEN) r9:  0000000000000001   r10: 000000000000000f   r11: 0000000000000002
(XEN) r12: ffff830421080050   r13: ffff830421060134   r14: ffff82c48029ff18
(XEN) r15: ffff82c4802dd9e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000273d3c000   cr2: ffff88000c360318
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48029fd58:
(XEN)    0000000000000000 000000008029fd70 ffff82c48029ff18 ffff82c4802dd9e0
(XEN)    ffff82c480153f55 ffff830421043260 ffff830421043320 0000006f207ab134
(XEN)    0000006f207c3b14 ffff82c4802dd600 00007d3b7fd60227 ffff82c48014de60
(XEN)    ffff82c4802dd600 0000006f207c3b14 0000006f207ab134 ffff830421043320
(XEN)    ffff82c48029fef0 ffff830421043260 0000ffff0000ffff 0000006f416dab2e
(XEN)    ffff830007ef4060 0000006f1fad2570 0000000000003f40 0000000000000001
(XEN)    0000000000000000 ffff82c4802de200 0000000002048cac 0000002000000000
(XEN)    ffff82c480197940 000000000000e008 0000000000000246 ffff82c48029fe68
(XEN)    000000000000e010 ffff82c48029fef0 ffff82c4801987b7 ffff880402105d30
(XEN)    00000000ca9a4000 ffffffffffffffff aaaaaaaaaaaaaa00 aaaaaaaaaaaaaaaa
(XEN)    0000006f21136437 0000000000000000 0000000000000000 ffffffffffffffff
(XEN)    000004c200000542 0000000000000000 ffff82c48029ff18 ffff82c48029ff18
(XEN)    00000000ffffffff 0000000000000002 ffff82c4802dd600 ffff82c48029ff10
(XEN)    ffff82c4801549ce ffff8300ca9a4000 ffff8300ca666000 ffff82c48029fdc8
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000001
(XEN)    ffff880402105f00 ffff880402105fd8 0000000000000246 0000000000000001
(XEN)    0000000000000000 0000000000000000 0000000000000000 ffffffff810013aa
(XEN)    ffffffff81a2a858 00000000deadbeef 00000000deadbeef 0000010000000000
(XEN)    ffffffff810013aa 000000000000e033 0000000000000246 ffff880402105ee8
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48015e5fb>] smp_irq_move_cleanup_interrupt+0x246/0x2c6
(XEN)    [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40
(XEN)    [<ffff82c480197940>] lapic_timer_nop+0x0/0x6
(XEN)    [<ffff82c4801549ce>] idle_loop+0x4b/0x59



Ignore rest of comments from my previous mail - I clearly don''t
understand IRQ
handling code.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Mar-28 16:13 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
> Also one time I''ve got fatal page fault error, earlier in resume
(it isn''t
> deterministic):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log 
This is mostly identical to
http://lists.xen.org/archives/html/xen-devel/2013-01/msg02175.html,
and hence I would assume that the patch Ben posted (v4 came
through yesterday) would be fixing this. Care to give this a try?

Jan

Jan Beulich

2013-Mar-28 16:25 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
wrote:> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>> So vector e9 doesn''t appear to be programmed in anywhere.
>> 
>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>> really is why an IRQ appears on that vector in the first place. The
>> 8259A resume code _should_ leave all IRQs masked on a fully
>> IO-APIC system (see my question raised yesterday).
>> 
>> And that''s also why I suggested, for an experiment, to fiddle
with
>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>> be a final solution, but would at least tell us whether the direction
>> is the right one). In the end, besides understanding why an
>> interrupt on vector E9 gets raised at all, we may also need to
>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>> but that would need to happen earlier than in
>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>> apparently doesn''t have this problem, we may need to go hunt
for
>> a change that isn''t directly connected to this, yet deals with
the
>> problem as a side effect (at least I don''t recall any
particular fix
>> since 4.2). One aspect here is the double mapping of legacy IRQs
>> (once to their IO-APIC vector, and once to their legacy vector,
>> i.e. vector_irq[] having two entries pointing to the same IRQ).
> 
> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit
> that
> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also
> some
> errors from dom0 kernel, and errors about PCI devices used by domU(1).
> 
> Messages from resume (different tries):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log 
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log 
Is that a sensible usage scenario at all? I would think that a
prerequisite to host S3 is that all guests get suspended. If you
do that, do you still have these interrupt re-setup problems?

Jan

Marek Marczykowski

2013-Mar-28 16:31 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 17:25, Jan Beulich wrote:>>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
> wrote:
>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>> So vector e9 doesn''t appear to be programmed in
anywhere.
>>>
>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The
question
>>> really is why an IRQ appears on that vector in the first place. The
>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>> IO-APIC system (see my question raised yesterday).
>>>
>>> And that''s also why I suggested, for an experiment, to
fiddle with
>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>> be a final solution, but would at least tell us whether the
direction
>>> is the right one). In the end, besides understanding why an
>>> interrupt on vector E9 gets raised at all, we may also need to
>>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>>> but that would need to happen earlier than in
>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>> apparently doesn''t have this problem, we may need to go
hunt for
>>> a change that isn''t directly connected to this, yet deals
with the
>>> problem as a side effect (at least I don''t recall any
particular fix
>>> since 4.2). One aspect here is the double mapping of legacy IRQs
>>> (once to their IO-APIC vector, and once to their legacy vector,
>>> i.e. vector_irq[] having two entries pointing to the same IRQ).
>>
>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit
>> that
>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also
>> some
>> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>>
>> Messages from resume (different tries):
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
> 
> Is that a sensible usage scenario at all? I would think that a
> prerequisite to host S3 is that all guests get suspended. 
What do you mean by "suspended"? I haven''t found any sane
method to do that
with xl (only some manual xenstore write to control/shutdown). For now I do:
 - shutdown all network adapters in VMs
 - pause all VMs
> If you
> do that, do you still have these interrupt re-setup problems?
Yes, even when no guest is running (which was the case on 4.2)...

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Mar-28 16:52 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 28.03.13 at 17:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
wrote:> On 28.03.2013 17:25, Jan Beulich wrote:
>>>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>> wrote:
>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>> So vector e9 doesn''t appear to be programmed in
anywhere.
>>>>
>>>> Quite obviously, as it''s the 8259A vector for IRQ 9.
The question
>>>> really is why an IRQ appears on that vector in the first place.
The
>>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>>> IO-APIC system (see my question raised yesterday).
>>>>
>>>> And that''s also why I suggested, for an experiment, to
fiddle with
>>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>>> be a final solution, but would at least tell us whether the
direction
>>>> is the right one). In the end, besides understanding why an
>>>> interrupt on vector E9 gets raised at all, we may also need to
>>>> tweak the IRQ migration logic to not do anything on legacy
IRQs,
>>>> but that would need to happen earlier than in
>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>>> apparently doesn''t have this problem, we may need to
go hunt for
>>>> a change that isn''t directly connected to this, yet
deals with the
>>>> problem as a side effect (at least I don''t recall any
particular fix
>>>> since 4.2). One aspect here is the double mapping of legacy
IRQs
>>>> (once to their IO-APIC vector, and once to their legacy vector,
>>>> i.e. vector_irq[] having two entries pointing to the same IRQ).
>>>
>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit
>>> that
>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by
scheduler, also
>>> some
>>> errors from dom0 kernel, and errors about PCI devices used by
domU(1).
>>>
>>> Messages from resume (different tries):
>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>> 
>> Is that a sensible usage scenario at all? I would think that a
>> prerequisite to host S3 is that all guests get suspended. 
> 
> What do you mean by "suspended"? I haven''t found any
sane method to do that
> with xl (only some manual xenstore write to control/shutdown). For now I
do:
>  - shutdown all network adapters in VMs
>  - pause all VMs
Aren''t there "xl save" and "xl restore"? And for
HVM guests, I think
there''s also a way to do virtual S3.

Jan

Marek Marczykowski

2013-Mar-28 17:09 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 17:52, Jan Beulich wrote:>>>> On 28.03.13 at 17:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
> wrote:
>> On 28.03.2013 17:25, Jan Beulich wrote:
>>>>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>>> wrote:
>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>>> So vector e9 doesn''t appear to be programmed
in anywhere.
>>>>>
>>>>> Quite obviously, as it''s the 8259A vector for IRQ
9. The question
>>>>> really is why an IRQ appears on that vector in the first
place. The
>>>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>>>> IO-APIC system (see my question raised yesterday).
>>>>>
>>>>> And that''s also why I suggested, for an
experiment, to fiddle with
>>>>> the loop exit condition to exclude legacy vectors (which
wouldn''t
>>>>> be a final solution, but would at least tell us whether the
direction
>>>>> is the right one). In the end, besides understanding why an
>>>>> interrupt on vector E9 gets raised at all, we may also need
to
>>>>> tweak the IRQ migration logic to not do anything on legacy
IRQs,
>>>>> but that would need to happen earlier than in
>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>>>> apparently doesn''t have this problem, we may need
to go hunt for
>>>>> a change that isn''t directly connected to this,
yet deals with the
>>>>> problem as a side effect (at least I don''t recall
any particular fix
>>>>> since 4.2). One aspect here is the double mapping of legacy
IRQs
>>>>> (once to their IO-APIC vector, and once to their legacy
vector,
>>>>> i.e. vector_irq[] having two entries pointing to the same
IRQ).
>>>>
>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it
doesn''t hit
>>>> that
>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used
by scheduler, also
>>>> some
>>>> errors from dom0 kernel, and errors about PCI devices used by
domU(1).
>>>>
>>>> Messages from resume (different tries):
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>
>>> Is that a sensible usage scenario at all? I would think that a
>>> prerequisite to host S3 is that all guests get suspended. 
>>
>> What do you mean by "suspended"? I haven''t found any
sane method to do that
>> with xl (only some manual xenstore write to control/shutdown). For now
I do:
>>  - shutdown all network adapters in VMs
>>  - pause all VMs
> 
> Aren''t there "xl save" and "xl restore"? And
for HVM guests, I think
> there''s also a way to do virtual S3.
xl save/restore takes far to much time.

I''ve tried xenstore-write "suspend" to control/shutdown, then
xc_domain_resume
call some time ago, but I had some problems with that (unfortunately
don''t
remember details...).
This is basically what xl save and restore does, but without actual data dump.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-28 17:41 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 27/03/2013 17:15, Marek Marczykowski wrote:> On 27.03.2013 17:56, Andrew Cooper wrote:
>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:
>>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>>>>> So vector e9 doesn''t appear to be
programmed in anywhere.
>>>>>>> Quite obviously, as it''s the 8259A vector
for IRQ 9. The question
>>>>>>> really is why an IRQ appears on that vector in the
first place. The
>>>>>>> 8259A resume code _should_ leave all IRQs masked on
a fully
>>>>>>> IO-APIC system (see my question raised yesterday).
>>>>>>>
>>>>>>> And that''s also why I suggested, for an
experiment, to fiddle with
>>>>>>> the loop exit condition to exclude legacy vectors
(which wouldn''t
>>>>>>> be a final solution, but would at least tell us
whether the direction
>>>>>>> is the right one). In the end, besides
understanding why an
>>>>>>> interrupt on vector E9 gets raised at all, we may
also need to
>>>>>>> tweak the IRQ migration logic to not do anything on
legacy IRQs,
>>>>>>> but that would need to happen earlier than in
>>>>>>> smp_irq_move_cleanup_interrupt(). Considering that
4.3
>>>>>>> apparently doesn''t have this problem, we
may need to go hunt for
>>>>>>> a change that isn''t directly connected to
this, yet deals with the
>>>>>>> problem as a side effect (at least I don''t
recall any particular fix
>>>>>>> since 4.2). One aspect here is the double mapping
of legacy IRQs
>>>>>>> (once to their IO-APIC vector, and once to their
legacy vector,
>>>>>>> i.e. vector_irq[] having two entries pointing to
the same IRQ).
>>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR
and it doesn''t hit that
>>>>>> BUG/ASSERT. But still it doesn''t work - only
CPU0 used by scheduler, also some
>>>>>> errors from dom0 kernel, and errors about PCI devices
used by domU(1).
>>>>>>
>>>>>> Messages from resume (different tries):
>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>>
>>>>>> Also one time I''ve got fatal page fault error,
earlier in resume (it isn''t
>>>>>> deterministic):
>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>>
>>>>> This pagefault is a Null structure pointer dereference,
likely the
>>>>> scheduling data.  At a first glance, it looks related to
the assertion
>>>>> failures I have been seeing sporadically in testing, but
unable to
>>>>> reproduce reliably.  There seems to be something quite
dodgy with
>>>>> interaction of vcpu_wake and scheduling loops.
>>>>>
>>>>> The other logs indicate that dom0 appears to have a domain
id of 1,
>>>>> which is sure to cause problems.
>>>> Actually - ignore this
>>>>
>>>> >From the log,
>>>>
>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22
for 32752
>>>> domain
>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi!
>>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22
for 32752
>>>> domain
>>>>
>>>> and later
>>>>
>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>>>> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for
1 domain
>>>> [  121.954080] error enable msi for guest 1 status ffffffea
>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi!
>>>> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for
1 domain
>>>> [  122.044421] error enable msi for guest 1 status ffffffea
>>>>
>>>> I think that there is a separate bug where mapped irqs are not
unmapped
>>>> on the suspend path.
>>> You thinking this is a Linux (xen irq machinery) issue? Meaning it
should
>>> end up calling PHYSDEV_unmap_pirq as part of the suspend process?
>> I am not sure.  Without looking at the code, I am only speculating.
>>
>> Beyond that, the main question is about the expected behaviour.  Do we
>> expect dom0/U to unmap its irqs and remap them after resume?  What do
we
>> expect from domains which are unaware of the host sleep action?
> BTW this is the case: domain 1 isn''t fully aware of sleep. It have
some PCI
> devices assigned. The only action taken there before suspend is shutdown
> network interfaces (without this system hanged during suspend).
>
What do you mean here by shutting down the network interfaces? Are the
devices being assigned back to dom0?  Ifso, is dom0 assigning them back
to domU before the domU driver tries to set itself up?

~Andrew

Marek Marczykowski

2013-Mar-28 17:44 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 18:41, Andrew Cooper wrote:> On 27/03/2013 17:15, Marek Marczykowski wrote:
>> On 27.03.2013 17:56, Andrew Cooper wrote:
>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:
>>>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>>>>>>>> So vector e9 doesn''t appear to be
programmed in anywhere.
>>>>>>>> Quite obviously, as it''s the 8259A
vector for IRQ 9. The question
>>>>>>>> really is why an IRQ appears on that vector in
the first place. The
>>>>>>>> 8259A resume code _should_ leave all IRQs
masked on a fully
>>>>>>>> IO-APIC system (see my question raised
yesterday).
>>>>>>>>
>>>>>>>> And that''s also why I suggested, for
an experiment, to fiddle with
>>>>>>>> the loop exit condition to exclude legacy
vectors (which wouldn''t
>>>>>>>> be a final solution, but would at least tell us
whether the direction
>>>>>>>> is the right one). In the end, besides
understanding why an
>>>>>>>> interrupt on vector E9 gets raised at all, we
may also need to
>>>>>>>> tweak the IRQ migration logic to not do
anything on legacy IRQs,
>>>>>>>> but that would need to happen earlier than in
>>>>>>>> smp_irq_move_cleanup_interrupt(). Considering
that 4.3
>>>>>>>> apparently doesn''t have this problem,
we may need to go hunt for
>>>>>>>> a change that isn''t directly connected
to this, yet deals with the
>>>>>>>> problem as a side effect (at least I
don''t recall any particular fix
>>>>>>>> since 4.2). One aspect here is the double
mapping of legacy IRQs
>>>>>>>> (once to their IO-APIC vector, and once to
their legacy vector,
>>>>>>>> i.e. vector_irq[] having two entries pointing
to the same IRQ).
>>>>>>> So tried change loop condition to
LAST_DYNAMIC_VECTOR and it doesn''t hit that
>>>>>>> BUG/ASSERT. But still it doesn''t work -
only CPU0 used by scheduler, also some
>>>>>>> errors from dom0 kernel, and errors about PCI
devices used by domU(1).
>>>>>>>
>>>>>>> Messages from resume (different tries):
>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>>>
>>>>>>> Also one time I''ve got fatal page fault
error, earlier in resume (it isn''t
>>>>>>> deterministic):
>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>>>
>>>>>> This pagefault is a Null structure pointer dereference,
likely the
>>>>>> scheduling data.  At a first glance, it looks related
to the assertion
>>>>>> failures I have been seeing sporadically in testing,
but unable to
>>>>>> reproduce reliably.  There seems to be something quite
dodgy with
>>>>>> interaction of vcpu_wake and scheduling loops.
>>>>>>
>>>>>> The other logs indicate that dom0 appears to have a
domain id of 1,
>>>>>> which is sure to cause problems.
>>>>> Actually - ignore this
>>>>>
>>>>> >From the log,
>>>>>
>>>>> (XEN) physdev.c:153: dom0: can''t create irq for
msi!
>>>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed
-22 for 32752
>>>>> domain
>>>>> (XEN) physdev.c:153: dom0: can''t create irq for
msi!
>>>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed
-22 for 32752
>>>>> domain
>>>>>
>>>>> and later
>>>>>
>>>>> (XEN) physdev.c:153: dom1: can''t create irq for
msi!
>>>>> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22
for 1 domain
>>>>> [  121.954080] error enable msi for guest 1 status ffffffea
>>>>> (XEN) physdev.c:153: dom1: can''t create irq for
msi!
>>>>> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22
for 1 domain
>>>>> [  122.044421] error enable msi for guest 1 status ffffffea
>>>>>
>>>>> I think that there is a separate bug where mapped irqs are
not unmapped
>>>>> on the suspend path.
>>>> You thinking this is a Linux (xen irq machinery) issue? Meaning
it should
>>>> end up calling PHYSDEV_unmap_pirq as part of the suspend
process?
>>> I am not sure.  Without looking at the code, I am only speculating.
>>>
>>> Beyond that, the main question is about the expected behaviour.  Do
we
>>> expect dom0/U to unmap its irqs and remap them after resume?  What
do we
>>> expect from domains which are unaware of the host sleep action?
>> BTW this is the case: domain 1 isn''t fully aware of sleep. It
have some PCI
>> devices assigned. The only action taken there before suspend is
shutdown
>> network interfaces (without this system hanged during suspend).
>>
> 
> What do you mean here by shutting down the network interfaces? Are the
> devices being assigned back to dom0?  
No, just simple ip link set eth0 down. Seems to be enough to suspend succeed,
at least on most hardware...
> Ifso, is dom0 assigning them back
> to domU before the domU driver tries to set itself up?
-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Mar-28 17:50 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28/03/2013 17:44, Marek Marczykowski wrote:> On 28.03.2013 18:41, Andrew Cooper wrote:
>> On 27/03/2013 17:15, Marek Marczykowski wrote:
>>> On 27.03.2013 17:56, Andrew Cooper wrote:
>>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper
wrote:
>>>>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>>>>> On 26.03.13 at 19:50, Andrew
Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>> So vector e9 doesn''t appear to
be programmed in anywhere.
>>>>>>>>> Quite obviously, as it''s the 8259A
vector for IRQ 9. The question
>>>>>>>>> really is why an IRQ appears on that vector
in the first place. The
>>>>>>>>> 8259A resume code _should_ leave all IRQs
masked on a fully
>>>>>>>>> IO-APIC system (see my question raised
yesterday).
>>>>>>>>>
>>>>>>>>> And that''s also why I suggested,
for an experiment, to fiddle with
>>>>>>>>> the loop exit condition to exclude legacy
vectors (which wouldn''t
>>>>>>>>> be a final solution, but would at least
tell us whether the direction
>>>>>>>>> is the right one). In the end, besides
understanding why an
>>>>>>>>> interrupt on vector E9 gets raised at all,
we may also need to
>>>>>>>>> tweak the IRQ migration logic to not do
anything on legacy IRQs,
>>>>>>>>> but that would need to happen earlier than
in
>>>>>>>>> smp_irq_move_cleanup_interrupt().
Considering that 4.3
>>>>>>>>> apparently doesn''t have this
problem, we may need to go hunt for
>>>>>>>>> a change that isn''t directly
connected to this, yet deals with the
>>>>>>>>> problem as a side effect (at least I
don''t recall any particular fix
>>>>>>>>> since 4.2). One aspect here is the double
mapping of legacy IRQs
>>>>>>>>> (once to their IO-APIC vector, and once to
their legacy vector,
>>>>>>>>> i.e. vector_irq[] having two entries
pointing to the same IRQ).
>>>>>>>> So tried change loop condition to
LAST_DYNAMIC_VECTOR and it doesn''t hit that
>>>>>>>> BUG/ASSERT. But still it doesn''t work
- only CPU0 used by scheduler, also some
>>>>>>>> errors from dom0 kernel, and errors about PCI
devices used by domU(1).
>>>>>>>>
>>>>>>>> Messages from resume (different tries):
>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>>>>
>>>>>>>> Also one time I''ve got fatal page
fault error, earlier in resume (it isn''t
>>>>>>>> deterministic):
>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>>>>
>>>>>>> This pagefault is a Null structure pointer
dereference, likely the
>>>>>>> scheduling data.  At a first glance, it looks
related to the assertion
>>>>>>> failures I have been seeing sporadically in
testing, but unable to
>>>>>>> reproduce reliably.  There seems to be something
quite dodgy with
>>>>>>> interaction of vcpu_wake and scheduling loops.
>>>>>>>
>>>>>>> The other logs indicate that dom0 appears to have a
domain id of 1,
>>>>>>> which is sure to cause problems.
>>>>>> Actually - ignore this
>>>>>>
>>>>>> >From the log,
>>>>>>
>>>>>> (XEN) physdev.c:153: dom0: can''t create irq
for msi!
>>>>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq
failed -22 for 32752
>>>>>> domain
>>>>>> (XEN) physdev.c:153: dom0: can''t create irq
for msi!
>>>>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq
failed -22 for 32752
>>>>>> domain
>>>>>>
>>>>>> and later
>>>>>>
>>>>>> (XEN) physdev.c:153: dom1: can''t create irq
for msi!
>>>>>> [  121.909814] pciback 0000:00:19.0: xen map irq failed
-22 for 1 domain
>>>>>> [  121.954080] error enable msi for guest 1 status
ffffffea
>>>>>> (XEN) physdev.c:153: dom1: can''t create irq
for msi!
>>>>>> [  122.035355] pciback 0000:00:19.0: xen map irq failed
-22 for 1 domain
>>>>>> [  122.044421] error enable msi for guest 1 status
ffffffea
>>>>>>
>>>>>> I think that there is a separate bug where mapped irqs
are not unmapped
>>>>>> on the suspend path.
>>>>> You thinking this is a Linux (xen irq machinery) issue?
Meaning it should
>>>>> end up calling PHYSDEV_unmap_pirq as part of the suspend
process?
>>>> I am not sure.  Without looking at the code, I am only
speculating.
>>>>
>>>> Beyond that, the main question is about the expected behaviour.
Do we
>>>> expect dom0/U to unmap its irqs and remap them after resume? 
What do we
>>>> expect from domains which are unaware of the host sleep action?
>>> BTW this is the case: domain 1 isn''t fully aware of sleep.
It have some PCI
>>> devices assigned. The only action taken there before suspend is
shutdown
>>> network interfaces (without this system hanged during suspend).
>>>
>> What do you mean here by shutting down the network interfaces? Are the
>> devices being assigned back to dom0?  
> No, just simple ip link set eth0 down. Seems to be enough to suspend
succeed,
> at least on most hardware...
In which case repeat map_pirq hypercalls will fail with -EINVAL because
the pirq is already set up.  It is probably worth putting a printk in
map_pirq and unmap_pirq to see exactly what is happening across the
sleep/resume cycle.

~Andrew

Marek Marczykowski

2013-Mar-28 19:03 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 17:13, Jan Beulich wrote:>>>> On 27.03.13 at 15:31, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>> Also one time I''ve got fatal page fault error, earlier in
resume (it isn''t
>> deterministic):
>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log 
> 
> This is mostly identical to
> http://lists.xen.org/archives/html/xen-devel/2013-01/msg02175.html,
> and hence I would assume that the patch Ben posted (v4 came
> through yesterday) would be fixing this. Care to give this a try?
With this, together with your previous patch ("x86:
irq_move_cleanup_interrupt() must ignore legacy vectors") I can''t
hit previous
IRQ setup problem (at least for few tries).

But it still doesn''t solve original problem - after suspend system
temperature
goes high, apparently only CPU0 is online.
If I pin some domain vCPU to non-0 CPU before suspend, I hit ASSERT() on resume:
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Suppress EOI broadcast on CPU#1
(XEN) masked ExtINT on CPU#1
(XEN) Suppress EOI broadcast on CPU#2
(XEN) masked ExtINT on CPU#2
(XEN) Suppress EOI broadcast on CPU#3
(XEN) masked ExtINT on CPU#3
(XEN) Restoring affinity for d2v3
(XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu,
cpus)'' failed at
sched_credit.c:481

xl cpupool-list -c:
Name               CPU list
Pool-0             0
xl cpupool-cpu-add Pool-0 1
-> -EBUSY

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Mar-29 00:26 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 18:50, Andrew Cooper wrote:> On 28/03/2013 17:44, Marek Marczykowski wrote:
>> On 28.03.2013 18:41, Andrew Cooper wrote:
>>> On 27/03/2013 17:15, Marek Marczykowski wrote:
>>>> On 27.03.2013 17:56, Andrew Cooper wrote:
>>>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>>>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper
wrote:
>>>>>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>>>>>> On 26.03.13 at 19:50,
Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>>>>>>> So vector e9 doesn''t
appear to be programmed in anywhere.
>>>>>>>>>> Quite obviously, as it''s the
8259A vector for IRQ 9. The question
>>>>>>>>>> really is why an IRQ appears on that
vector in the first place. The
>>>>>>>>>> 8259A resume code _should_ leave all
IRQs masked on a fully
>>>>>>>>>> IO-APIC system (see my question raised
yesterday).
>>>>>>>>>>
>>>>>>>>>> And that''s also why I
suggested, for an experiment, to fiddle with
>>>>>>>>>> the loop exit condition to exclude
legacy vectors (which wouldn''t
>>>>>>>>>> be a final solution, but would at least
tell us whether the direction
>>>>>>>>>> is the right one). In the end, besides
understanding why an
>>>>>>>>>> interrupt on vector E9 gets raised at
all, we may also need to
>>>>>>>>>> tweak the IRQ migration logic to not do
anything on legacy IRQs,
>>>>>>>>>> but that would need to happen earlier
than in
>>>>>>>>>> smp_irq_move_cleanup_interrupt().
Considering that 4.3
>>>>>>>>>> apparently doesn''t have this
problem, we may need to go hunt for
>>>>>>>>>> a change that isn''t directly
connected to this, yet deals with the
>>>>>>>>>> problem as a side effect (at least I
don''t recall any particular fix
>>>>>>>>>> since 4.2). One aspect here is the
double mapping of legacy IRQs
>>>>>>>>>> (once to their IO-APIC vector, and once
to their legacy vector,
>>>>>>>>>> i.e. vector_irq[] having two entries
pointing to the same IRQ).
>>>>>>>>> So tried change loop condition to
LAST_DYNAMIC_VECTOR and it doesn''t hit that
>>>>>>>>> BUG/ASSERT. But still it doesn''t
work - only CPU0 used by scheduler, also some
>>>>>>>>> errors from dom0 kernel, and errors about
PCI devices used by domU(1).
>>>>>>>>>
>>>>>>>>> Messages from resume (different tries):
>>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>>>>>
>>>>>>>>> Also one time I''ve got fatal page
fault error, earlier in resume (it isn''t
>>>>>>>>> deterministic):
>>>>>>>>>
http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>>>>>
>>>>>>>> This pagefault is a Null structure pointer
dereference, likely the
>>>>>>>> scheduling data.  At a first glance, it looks
related to the assertion
>>>>>>>> failures I have been seeing sporadically in
testing, but unable to
>>>>>>>> reproduce reliably.  There seems to be
something quite dodgy with
>>>>>>>> interaction of vcpu_wake and scheduling loops.
>>>>>>>>
>>>>>>>> The other logs indicate that dom0 appears to
have a domain id of 1,
>>>>>>>> which is sure to cause problems.
>>>>>>> Actually - ignore this
>>>>>>>
>>>>>>> >From the log,
>>>>>>>
>>>>>>> (XEN) physdev.c:153: dom0: can''t create
irq for msi!
>>>>>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq
failed -22 for 32752
>>>>>>> domain
>>>>>>> (XEN) physdev.c:153: dom0: can''t create
irq for msi!
>>>>>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq
failed -22 for 32752
>>>>>>> domain
>>>>>>>
>>>>>>> and later
>>>>>>>
>>>>>>> (XEN) physdev.c:153: dom1: can''t create
irq for msi!
>>>>>>> [  121.909814] pciback 0000:00:19.0: xen map irq
failed -22 for 1 domain
>>>>>>> [  121.954080] error enable msi for guest 1 status
ffffffea
>>>>>>> (XEN) physdev.c:153: dom1: can''t create
irq for msi!
>>>>>>> [  122.035355] pciback 0000:00:19.0: xen map irq
failed -22 for 1 domain
>>>>>>> [  122.044421] error enable msi for guest 1 status
ffffffea
>>>>>>>
>>>>>>> I think that there is a separate bug where mapped
irqs are not unmapped
>>>>>>> on the suspend path.
>>>>>> You thinking this is a Linux (xen irq machinery) issue?
Meaning it should
>>>>>> end up calling PHYSDEV_unmap_pirq as part of the
suspend process?
>>>>> I am not sure.  Without looking at the code, I am only
speculating.
>>>>>
>>>>> Beyond that, the main question is about the expected
behaviour.  Do we
>>>>> expect dom0/U to unmap its irqs and remap them after
resume?  What do we
>>>>> expect from domains which are unaware of the host sleep
action?
>>>> BTW this is the case: domain 1 isn''t fully aware of
sleep. It have some PCI
>>>> devices assigned. The only action taken there before suspend is
shutdown
>>>> network interfaces (without this system hanged during suspend).
>>>>
>>> What do you mean here by shutting down the network interfaces? Are
the
>>> devices being assigned back to dom0?  
>> No, just simple ip link set eth0 down. Seems to be enough to suspend
succeed,
>> at least on most hardware...
> 
> In which case repeat map_pirq hypercalls will fail with -EINVAL because
> the pirq is already set up.  It is probably worth putting a printk in
> map_pirq and unmap_pirq to see exactly what is happening across the
> sleep/resume cycle.
No unmap/map is done during sleep/resume cycle regarding that domain (have two
mapped pirqs). Even for dom0 I see only one unmap/map during suspend/resume.
For most devices this doesn''t break anything. Few exceptions needs
module
reload after resume (e.g. sky2), but not sure about the reason (no additional
logs, simply no link detected).

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2013-Apr-01 13:53 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:> (XEN) Restoring affinity for d2v3
> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu,
cpus)'' failed at
> sched_credit.c:481

I think the "fix-suspend-scheduler-*" patches posted here are
applicable here:
http://markmail.org/message/llj3oyhgjzvw3t23


Specifically, I think you need this bit:

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 630881e..e20868c 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -5,6 +5,7 @@
 #include <xen/init.h>
 #include <xen/sched.h>
 #include <xen/stop_machine.h>
+#include <xen/sched-if.h>

 unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
 #ifndef nr_cpumask_bits
@@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
             BUG_ON(error == -EBUSY);
             printk("Error taking CPU%d up: %d\n", cpu, error);
         }
+        if (system_state == SYS_STATE_resume)
+            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
     }

     cpumask_clear(&frozen_cpus);

Marek Marczykowski

2013-Apr-02 01:13 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 01.04.2013 15:53, Ben Guthro wrote:> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
> <marmarek@invisiblethingslab.com> wrote:
>> (XEN) Restoring affinity for d2v3
>> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu,
cpus)'' failed at
>> sched_credit.c:481
> 
> 
> I think the "fix-suspend-scheduler-*" patches posted here are
applicable here:
> http://markmail.org/message/llj3oyhgjzvw3t23
> 
> 
> Specifically, I think you need this bit:
> 
> diff --git a/xen/common/cpu.c b/xen/common/cpu.c
> index 630881e..e20868c 100644
> --- a/xen/common/cpu.c
> +++ b/xen/common/cpu.c
> @@ -5,6 +5,7 @@
>  #include <xen/init.h>
>  #include <xen/sched.h>
>  #include <xen/stop_machine.h>
> +#include <xen/sched-if.h>
> 
>  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
>  #ifndef nr_cpumask_bits
> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
>              BUG_ON(error == -EBUSY);
>              printk("Error taking CPU%d up: %d\n", cpu, error);
>          }
> +        if (system_state == SYS_STATE_resume)
> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>      }
> 
>      cpumask_clear(&frozen_cpus);
> 
Indeed, this makes things better, but still not ideal.
Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more
preferred than others (xl vcpu-list). For example if I start 4 busy loops in
dom0, I got (even after some time):
[user@dom0 ~]$ xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
dom0                                 0     0    0   r--      98.5  any cpu
dom0                                 0     1    0   ---     181.3  any cpu
dom0                                 0     2    2   r--     262.4  any cpu
dom0                                 0     3    3   r--     230.8  any cpu
netvm                                1     0    0   -b-      18.4  any cpu
netvm                                1     1    0   -b-       9.1  any cpu
netvm                                1     2    0   -b-       7.1  any cpu
netvm                                1     3    0   -b-       5.4  any cpu
firewallvm                           2     0    0   -b-      10.7  any cpu
firewallvm                           2     1    0   -b-       3.0  any cpu
firewallvm                           2     2    0   -b-       2.5  any cpu
firewallvm                           2     3    3   -b-       3.6  any cpu

If I remove some CPU from Pool-0 and re-add it, things back to normal for this
particular CPU (so I got two equally used CPUs) - to fully restore system I
must remove all but CPU0 from Pool-0 and add it again.

Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1.
This probably could be fixed by your "xen: Re-upload processor PM data to
hypervisor after S3 resume" patch (reload of xen-acpi-processor module
helps
here). But I don''t think it is a right way. It isn''t necessary
on other
systems (with somehow older hardware). It must be something missing on resume
path. The question is what...

Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check
if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?).
Unfortunately I don''t know x86 details so good to follow that code...

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Apr-02 14:05 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Tue, Apr 02, 2013 at 03:13:56AM +0200, Marek Marczykowski
wrote:> On 01.04.2013 15:53, Ben Guthro wrote:
> > On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
> > <marmarek@invisiblethingslab.com> wrote:
> >> (XEN) Restoring affinity for d2v3
> >> (XEN) Assertion ''!cpus_empty(cpus) &&
cpu_isset(cpu, cpus)'' failed at
> >> sched_credit.c:481
> > 
> > 
> > I think the "fix-suspend-scheduler-*" patches posted here
are applicable here:
> > http://markmail.org/message/llj3oyhgjzvw3t23
> > 
> > 
> > Specifically, I think you need this bit:
> > 
> > diff --git a/xen/common/cpu.c b/xen/common/cpu.c
> > index 630881e..e20868c 100644
> > --- a/xen/common/cpu.c
> > +++ b/xen/common/cpu.c
> > @@ -5,6 +5,7 @@
> >  #include <xen/init.h>
> >  #include <xen/sched.h>
> >  #include <xen/stop_machine.h>
> > +#include <xen/sched-if.h>
> > 
> >  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
> >  #ifndef nr_cpumask_bits
> > @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
> >              BUG_ON(error == -EBUSY);
> >              printk("Error taking CPU%d up: %d\n", cpu,
error);
> >          }
> > +        if (system_state == SYS_STATE_resume)
> > +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
> >      }
> > 
> >      cpumask_clear(&frozen_cpus);
> > 
> 
> Indeed, this makes things better, but still not ideal.
> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much
more
> preferred than others (xl vcpu-list). For example if I start 4 busy loops
in
> dom0, I got (even after some time):
> [user@dom0 ~]$ xl vcpu-list
> Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
> dom0                                 0     0    0   r--      98.5  any cpu
> dom0                                 0     1    0   ---     181.3  any cpu
> dom0                                 0     2    2   r--     262.4  any cpu
> dom0                                 0     3    3   r--     230.8  any cpu
> netvm                                1     0    0   -b-      18.4  any cpu
> netvm                                1     1    0   -b-       9.1  any cpu
> netvm                                1     2    0   -b-       7.1  any cpu
> netvm                                1     3    0   -b-       5.4  any cpu
> firewallvm                           2     0    0   -b-      10.7  any cpu
> firewallvm                           2     1    0   -b-       3.0  any cpu
> firewallvm                           2     2    0   -b-       2.5  any cpu
> firewallvm                           2     3    3   -b-       3.6  any cpu
> 
> If I remove some CPU from Pool-0 and re-add it, things back to normal for
this
> particular CPU (so I got two equally used CPUs) - to fully restore system I
> must remove all but CPU0 from Pool-0 and add it again.
> 
> Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1.
> This probably could be fixed by your "xen: Re-upload processor PM data
to
> hypervisor after S3 resume" patch (reload of xen-acpi-processor module
helps
> here). But I don''t think it is a right way. It isn''t
necessary on other
> systems (with somehow older hardware). It must be something missing on
resume
> path. The question is what...
The xen-acpi-processor should probably also have the cpu hotplug notification
in it to deal with this - so that you don''t need to do the reload.
> 
> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and
check
> if it restore all things disabled in disable_nonboot_cpus()
(__cpu_disable?).
> Unfortunately I don''t know x86 details so good to follow that
code...
> 
> -- 
> Best Regards / Pozdrawiam,
> Marek Marczykowski
> Invisible Things Lab
> 

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Marek Marczykowski

2013-Apr-15 22:09 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 02.04.2013 03:13, Marek Marczykowski wrote:> On 01.04.2013 15:53, Ben Guthro wrote:
>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
>> <marmarek@invisiblethingslab.com> wrote:
>>> (XEN) Restoring affinity for d2v3
>>> (XEN) Assertion ''!cpus_empty(cpus) &&
cpu_isset(cpu, cpus)'' failed at
>>> sched_credit.c:481
>>
>>
>> I think the "fix-suspend-scheduler-*" patches posted here are
applicable here:
>> http://markmail.org/message/llj3oyhgjzvw3t23
>>
>>
>> Specifically, I think you need this bit:
>>
>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c
>> index 630881e..e20868c 100644
>> --- a/xen/common/cpu.c
>> +++ b/xen/common/cpu.c
>> @@ -5,6 +5,7 @@
>>  #include <xen/init.h>
>>  #include <xen/sched.h>
>>  #include <xen/stop_machine.h>
>> +#include <xen/sched-if.h>
>>
>>  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
>>  #ifndef nr_cpumask_bits
>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
>>              BUG_ON(error == -EBUSY);
>>              printk("Error taking CPU%d up: %d\n", cpu,
error);
>>          }
>> +        if (system_state == SYS_STATE_resume)
>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>      }
>>
>>      cpumask_clear(&frozen_cpus);
>>
> 
> Indeed, this makes things better, but still not ideal.
> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much
more
> preferred than others (xl vcpu-list). For example if I start 4 busy loops
in
> dom0, I got (even after some time):
> [user@dom0 ~]$ xl vcpu-list
> Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
> dom0                                 0     0    0   r--      98.5  any cpu
> dom0                                 0     1    0   ---     181.3  any cpu
> dom0                                 0     2    2   r--     262.4  any cpu
> dom0                                 0     3    3   r--     230.8  any cpu
> netvm                                1     0    0   -b-      18.4  any cpu
> netvm                                1     1    0   -b-       9.1  any cpu
> netvm                                1     2    0   -b-       7.1  any cpu
> netvm                                1     3    0   -b-       5.4  any cpu
> firewallvm                           2     0    0   -b-      10.7  any cpu
> firewallvm                           2     1    0   -b-       3.0  any cpu
> firewallvm                           2     2    0   -b-       2.5  any cpu
> firewallvm                           2     3    3   -b-       3.6  any cpu
> 
> If I remove some CPU from Pool-0 and re-add it, things back to normal for
this
> particular CPU (so I got two equally used CPUs) - to fully restore system I
> must remove all but CPU0 from Pool-0 and add it again.
> 
> Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1.
> This probably could be fixed by your "xen: Re-upload processor PM data
to
> hypervisor after S3 resume" patch (reload of xen-acpi-processor module
helps
> here). But I don''t think it is a right way. It isn''t
necessary on other
> systems (with somehow older hardware). It must be something missing on
resume
> path. The question is what...
> 
> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and
check
> if it restore all things disabled in disable_nonboot_cpus()
(__cpu_disable?).
> Unfortunately I don''t know x86 details so good to follow that
code...
Summarize ACPI S3 issues:

I. Fixed issues:

1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore
legacy
vectors" commit
2. Assertion failure on resume with vcpu affinity used, fixes by "x86/S3:
Restore broken vcpu affinity on resume" commit


II. Not (fully) fixed issues:

1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the
issue, but it isn''t applied to xen-unstable
2. After resume scheduler chooses (almost) only CPU0 (above quoted listing).
Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some
timers are not restarted after resume?
3. ACPI C-states are only present for CPU0 (after resume of course), fixed by
"xen: Re-upload processor PM data to hypervisor after S3" patch by
Ben, but it
isn''t in upstream linux (nor Konrad''s acpi-s3 branches).

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2013-Apr-15 23:36 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Mon, Apr 15, 2013 at 11:09 PM, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:> On 02.04.2013 03:13, Marek Marczykowski wrote:
>> On 01.04.2013 15:53, Ben Guthro wrote:
>>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
>>> <marmarek@invisiblethingslab.com> wrote:
>>>> (XEN) Restoring affinity for d2v3
>>>> (XEN) Assertion ''!cpus_empty(cpus) &&
cpu_isset(cpu, cpus)'' failed at
>>>> sched_credit.c:481
>>>
>>>
>>> I think the "fix-suspend-scheduler-*" patches posted here
are applicable here:
>>> http://markmail.org/message/llj3oyhgjzvw3t23
>>>
>>>
>>> Specifically, I think you need this bit:
>>>
>>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c
>>> index 630881e..e20868c 100644
>>> --- a/xen/common/cpu.c
>>> +++ b/xen/common/cpu.c
>>> @@ -5,6 +5,7 @@
>>>  #include <xen/init.h>
>>>  #include <xen/sched.h>
>>>  #include <xen/stop_machine.h>
>>> +#include <xen/sched-if.h>
>>>
>>>  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
>>>  #ifndef nr_cpumask_bits
>>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
>>>              BUG_ON(error == -EBUSY);
>>>              printk("Error taking CPU%d up: %d\n", cpu,
error);
>>>          }
>>> +        if (system_state == SYS_STATE_resume)
>>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>>      }
>>>
>>>      cpumask_clear(&frozen_cpus);
>>>
>>
>> Indeed, this makes things better, but still not ideal.
>> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is
much more
>> preferred than others (xl vcpu-list). For example if I start 4 busy
loops in
>> dom0, I got (even after some time):
>> [user@dom0 ~]$ xl vcpu-list
>> Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
>> dom0                                 0     0    0   r--      98.5  any
cpu
>> dom0                                 0     1    0   ---     181.3  any
cpu
>> dom0                                 0     2    2   r--     262.4  any
cpu
>> dom0                                 0     3    3   r--     230.8  any
cpu
>> netvm                                1     0    0   -b-      18.4  any
cpu
>> netvm                                1     1    0   -b-       9.1  any
cpu
>> netvm                                1     2    0   -b-       7.1  any
cpu
>> netvm                                1     3    0   -b-       5.4  any
cpu
>> firewallvm                           2     0    0   -b-      10.7  any
cpu
>> firewallvm                           2     1    0   -b-       3.0  any
cpu
>> firewallvm                           2     2    0   -b-       2.5  any
cpu
>> firewallvm                           2     3    3   -b-       3.6  any
cpu
>>
>> If I remove some CPU from Pool-0 and re-add it, things back to normal
for this
>> particular CPU (so I got two equally used CPUs) - to fully restore
system I
>> must remove all but CPU0 from Pool-0 and add it again.
>>
>> Also still only CPU0 have all C-states (C0-C3), all others have only
C0-C1.
>> This probably could be fixed by your "xen: Re-upload processor PM
data to
>> hypervisor after S3 resume" patch (reload of xen-acpi-processor
module helps
>> here). But I don''t think it is a right way. It isn''t
necessary on other
>> systems (with somehow older hardware). It must be something missing on
resume
>> path. The question is what...
>>
>> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?)
and check
>> if it restore all things disabled in disable_nonboot_cpus()
(__cpu_disable?).
>> Unfortunately I don''t know x86 details so good to follow that
code...
>
> Summarize ACPI S3 issues:
>
> I. Fixed issues:
>
> 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore
legacy
> vectors" commit
> 2. Assertion failure on resume with vcpu affinity used, fixes by
"x86/S3:
> Restore broken vcpu affinity on resume" commit
>
>
> II. Not (fully) fixed issues:
>
> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes
the
> issue, but it isn''t applied to xen-unstable
> 2. After resume scheduler chooses (almost) only CPU0 (above quoted
listing).
> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some
> timers are not restarted after resume?
Marek,
Please try the patch from this thread to see if it solves your 2 issues above:
http://markmail.org/thread/35ecqimv7bwq3k6d

This patch was NAK''ed due to cpupool breakage...but in my testing, it
solved both of these problems.

I don''t know how to properly solve it in a cpupool compatible way...
but I also haven''t put much additional effort into doing so.

> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed
by
> "xen: Re-upload processor PM data to hypervisor after S3" patch
by Ben, but it
> isn''t in upstream linux (nor Konrad''s acpi-s3 branches).
I don''t recall seeing any ACK / NAK from Konrad on this.

Original post:
https://patchwork.kernel.org/patch/2033981/

Konrad - do you have any thoughts about incorporating this into a
future merge window?

Ben

konrad wilk

2013-Apr-15 23:51 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>> 3. ACPI C-states are only present for CPU0 (after resume of course),
fixed by
>> "xen: Re-upload processor PM data to hypervisor after S3"
patch by Ben, but it
>> isn''t in upstream linux (nor Konrad''s acpi-s3
branches).
> I don''t recall seeing any ACK / NAK from Konrad on this.
>
> Original post:
> https://patchwork.kernel.org/patch/2033981/
>
> Konrad - do you have any thoughts about incorporating this into a
> future merge window?
Hey Ben,
I seem to have missed it.
I think the patch is missing a change to pr_backup->acpi_id = i, 
otherwise it would resend
the C-states with the same APIC ID. Also the upstream version does 
kfree(pr_backup) at some point.

But more importantly, do you know why it is needed? Is Xen hypervisor 
"loosing" this information because they go offline and then they are 
onlined again?

Ben Guthro

2013-Apr-16 00:19 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk <konrad.wilk@oracle.com>
wrote:>
>>> 3. ACPI C-states are only present for CPU0 (after resume of
course),
>>> fixed by
>>> "xen: Re-upload processor PM data to hypervisor after S3"
patch by Ben,
>>> but it
>>> isn''t in upstream linux (nor Konrad''s acpi-s3
branches).
>>
>> I don''t recall seeing any ACK / NAK from Konrad on this.
>>
>> Original post:
>> https://patchwork.kernel.org/patch/2033981/
>>
>> Konrad - do you have any thoughts about incorporating this into a
>> future merge window?
>
>
> Hey Ben,
> I seem to have missed it.
> I think the patch is missing a change to pr_backup->acpi_id = i,
otherwise
> it would resend
> the C-states with the same APIC ID. Also the upstream version does
> kfree(pr_backup) at some point.
Hmm. I''ll look into this, and re-submit.
>
> But more importantly, do you know why it is needed? Is Xen hypervisor
> "loosing" this information because they go offline and then they
are onlined
> again?
It was a while ago...the first of a number of 4.2 S3 related
performance issues that we chasing reports from users / automated QA
that the end result was "slow performance on S3 in XP"

As it turns out - this didn''t fix the performance problem...but it
also didn''t seem right.

I''m not sure if it is because the non-boot cpus are offlined...but it
would seem to make logical sense.

Ben

Ben Guthro

2013-Apr-16 00:46 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Tue, Apr 16, 2013 at 1:19 AM, Ben Guthro <ben@guthro.net>
wrote:> On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk
<konrad.wilk@oracle.com> wrote:
>>
>>>> 3. ACPI C-states are only present for CPU0 (after resume of
course),
>>>> fixed by
>>>> "xen: Re-upload processor PM data to hypervisor after
S3" patch by Ben,
>>>> but it
>>>> isn''t in upstream linux (nor Konrad''s acpi-s3
branches).
>>>
>>> I don''t recall seeing any ACK / NAK from Konrad on this.
>>>
>>> Original post:
>>> https://patchwork.kernel.org/patch/2033981/
>>>
>>> Konrad - do you have any thoughts about incorporating this into a
>>> future merge window?
>>
>>
>> Hey Ben,
>> I seem to have missed it.
>> I think the patch is missing a change to pr_backup->acpi_id = i,
otherwise
>> it would resend
>> the C-states with the same APIC ID. Also the upstream version does
>> kfree(pr_backup) at some point.
>
> Hmm. I''ll look into this, and re-submit.
At the risk of seeming a bit dim, could you elaborate a bit here?
I''m looking at the function again, and perhaps I''m missing
something.

Since xen_acpi_processor_resume() was a subset of what was done in
xen_acpi_processor_init() - I trimmed a number of things unused in the
functionality I was using. This included the pr_backup related things
(both alloc & free)

I''m not seeing exactly what you are suggesting I am missing, if I
don''t even have a pr_backup. This usually means I overlooked something
embarrassingly obvious. If you would be so kind as to point this out
so I can slap my forehead, I''d appreciate it.

Thanks
Ben


>
>>
>> But more importantly, do you know why it is needed? Is Xen hypervisor
>> "loosing" this information because they go offline and then
they are onlined
>> again?
>
> It was a while ago...the first of a number of 4.2 S3 related
> performance issues that we chasing reports from users / automated QA
> that the end result was "slow performance on S3 in XP"
>
> As it turns out - this didn''t fix the performance problem...but it
> also didn''t seem right.
>
> I''m not sure if it is because the non-boot cpus are offlined...but
it
> would seem to make logical sense.
>
> Ben

Marek Marczykowski

2013-Apr-16 01:02 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 16.04.2013 01:36, Ben Guthro wrote:> On Mon, Apr 15, 2013 at 11:09 PM, Marek Marczykowski
> <marmarek@invisiblethingslab.com> wrote:
>> On 02.04.2013 03:13, Marek Marczykowski wrote:
>>> On 01.04.2013 15:53, Ben Guthro wrote:
>>>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
>>>> <marmarek@invisiblethingslab.com> wrote:
>>>>> (XEN) Restoring affinity for d2v3
>>>>> (XEN) Assertion ''!cpus_empty(cpus) &&
cpu_isset(cpu, cpus)'' failed at
>>>>> sched_credit.c:481
>>>>
>>>>
>>>> I think the "fix-suspend-scheduler-*" patches posted
here are applicable here:
>>>> http://markmail.org/message/llj3oyhgjzvw3t23
>>>>
>>>>
>>>> Specifically, I think you need this bit:
>>>>
>>>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c
>>>> index 630881e..e20868c 100644
>>>> --- a/xen/common/cpu.c
>>>> +++ b/xen/common/cpu.c
>>>> @@ -5,6 +5,7 @@
>>>>  #include <xen/init.h>
>>>>  #include <xen/sched.h>
>>>>  #include <xen/stop_machine.h>
>>>> +#include <xen/sched-if.h>
>>>>
>>>>  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
>>>>  #ifndef nr_cpumask_bits
>>>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
>>>>              BUG_ON(error == -EBUSY);
>>>>              printk("Error taking CPU%d up: %d\n",
cpu, error);
>>>>          }
>>>> +        if (system_state == SYS_STATE_resume)
>>>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>>>      }
>>>>
>>>>      cpumask_clear(&frozen_cpus);
>>>>
>>>
>>> Indeed, this makes things better, but still not ideal.
>>> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is
much more
>>> preferred than others (xl vcpu-list). For example if I start 4 busy
loops in
>>> dom0, I got (even after some time):
>>> [user@dom0 ~]$ xl vcpu-list
>>> Name                                ID  VCPU   CPU State   Time(s)
CPU Affinity
>>> dom0                                 0     0    0   r--      98.5 
any cpu
>>> dom0                                 0     1    0   ---     181.3 
any cpu
>>> dom0                                 0     2    2   r--     262.4 
any cpu
>>> dom0                                 0     3    3   r--     230.8 
any cpu
>>> netvm                                1     0    0   -b-      18.4 
any cpu
>>> netvm                                1     1    0   -b-       9.1 
any cpu
>>> netvm                                1     2    0   -b-       7.1 
any cpu
>>> netvm                                1     3    0   -b-       5.4 
any cpu
>>> firewallvm                           2     0    0   -b-      10.7 
any cpu
>>> firewallvm                           2     1    0   -b-       3.0 
any cpu
>>> firewallvm                           2     2    0   -b-       2.5 
any cpu
>>> firewallvm                           2     3    3   -b-       3.6 
any cpu
>>>
>>> If I remove some CPU from Pool-0 and re-add it, things back to
normal for this
>>> particular CPU (so I got two equally used CPUs) - to fully restore
system I
>>> must remove all but CPU0 from Pool-0 and add it again.
>>>
>>> Also still only CPU0 have all C-states (C0-C3), all others have
only C0-C1.
>>> This probably could be fixed by your "xen: Re-upload processor
PM data to
>>> hypervisor after S3 resume" patch (reload of
xen-acpi-processor module helps
>>> here). But I don''t think it is a right way. It
isn''t necessary on other
>>> systems (with somehow older hardware). It must be something missing
on resume
>>> path. The question is what...
>>>
>>> Perhaps someone need to go through enable_nonboot_cpus()
(__cpu_up?) and check
>>> if it restore all things disabled in disable_nonboot_cpus()
(__cpu_disable?).
>>> Unfortunately I don''t know x86 details so good to follow
that code...
>>
>> Summarize ACPI S3 issues:
>>
>> I. Fixed issues:
>>
>> 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must
ignore legacy
>> vectors" commit
>> 2. Assertion failure on resume with vcpu affinity used, fixes by
"x86/S3:
>> Restore broken vcpu affinity on resume" commit
>>
>>
>> II. Not (fully) fixed issues:
>>
>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above
fixes the
>> issue, but it isn''t applied to xen-unstable
>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted
listing).
>> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps
some
>> timers are not restarted after resume?
> 
> Marek,
> Please try the patch from this thread to see if it solves your 2 issues
above:
> http://markmail.org/thread/35ecqimv7bwq3k6d
> 
> This patch was NAK''ed due to cpupool breakage...but in my testing,
it
> solved both of these problems.
> 
> I don''t know how to properly solve it in a cpupool compatible
way...
> but I also haven''t put much additional effort into doing so.
Indeed this makes problem disappear.

-- 
Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

konrad wilk

2013-Apr-16 03:20 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 4/15/2013 8:46 PM, Ben Guthro wrote:> On Tue, Apr 16, 2013 at 1:19 AM, Ben Guthro <ben@guthro.net> wrote:
>> On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk
<konrad.wilk@oracle.com> wrote:
>>>>> 3. ACPI C-states are only present for CPU0 (after resume of
course),
>>>>> fixed by
>>>>> "xen: Re-upload processor PM data to hypervisor after
S3" patch by Ben,
>>>>> but it
>>>>> isn''t in upstream linux (nor Konrad''s
acpi-s3 branches).
>>>> I don''t recall seeing any ACK / NAK from Konrad on
this.
>>>>
>>>> Original post:
>>>> https://patchwork.kernel.org/patch/2033981/
>>>>
>>>> Konrad - do you have any thoughts about incorporating this into
a
>>>> future merge window?
>>>
>>> Hey Ben,
>>> I seem to have missed it.
>>> I think the patch is missing a change to pr_backup->acpi_id = i,
otherwise
>>> it would resend
>>> the C-states with the same APIC ID. Also the upstream version does
>>> kfree(pr_backup) at some point.
>> Hmm. I''ll look into this, and re-submit.
> At the risk of seeming a bit dim, could you elaborate a bit here?Part of what xen-acpi-processor has to deal with is the 
''dom0_max_vcpus='' case. Which means that when
''acpi_processor_get_performance_info'' is called to parse ACPI
C-states
it will limit itself to only the ''online''
CPUs it sees. Meaning that all the other ones (which might be physically 
present) which Linux does not see are skipped.

As such there is this:

545                 if (!pr_backup) {
546                         pr_backup = kzalloc(sizeof(struct 
acpi_processor), GFP_KERNEL);
547                         if (pr_backup)
548                                 memcpy(pr_backup, _pr, sizeof(struct 
acpi_processor));
549                 }

And then later

552         rc = check_acpi_ids(pr_backup);

which walks the ACPI namespace checking whether it has uploaded the 
ACPI-IDs for all the CPUs. If there
are some that are missing (b/c dom0_max_vcpus=X) was used, then it 
uploads the pr_backup with the ACPI ID altered.

What I think you ought to try is just to call check_acpi_ids after the 
for_cpu_online() loop with the pr_backup.

Hm, you could actually make this even easier. Just move this code:

539         for_each_possible_cpu(i) {
540                 struct acpi_processor *_pr;
541                 _pr = per_cpu(processors, i /* APIC ID */);
542                 if (!_pr)
543                         continue;
544
545                 if (!pr_backup) {
546                         pr_backup = kzalloc(sizeof(struct 
acpi_processor), GFP_KERNEL);
547                         if (pr_backup)
548                                 memcpy(pr_backup, _pr, sizeof(struct 
acpi_processor));
549                 }
550                 (void)upload_pm_data(_pr);
551         }
552         rc = check_acpi_ids(pr_backup);

in its own function. Then make both the module loading _and_ the syscore 
resume call said function.
Viola!

Naturally the kfree(pr_backup) and pr_backup = NULL have to be 
eliminated from the module_init function.. and the module_exit needs the 
pr_backup moved past the syscore_unregister.

Jan Beulich

2013-Apr-16 08:47 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 16.04.13 at 00:09, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
> II. Not (fully) fixed issues:
> 
> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes
the
> issue, but it isn''t applied to xen-unstable
> 2. After resume scheduler chooses (almost) only CPU0 (above quoted
listing).
> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some
> timers are not restarted after resume?
So I understand there is a patch dealing with this, but I''m not clear
whether that''s known to break CPU pools?
> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed
by
> "xen: Re-upload processor PM data to hypervisor after S3" patch
by Ben, but
> it isn''t in upstream linux (nor Konrad''s acpi-s3
branches).
Perhaps this rather ought to be fixed in the hypervisor (to not
forget the respective information; perhaps also for P-states)?
After all that''s another case where S3 is different from soft or hard
offlining an individual CPU (in particular we can expect the same
CPU to come back up during resume, whereas namely a hot-
unplugged one could get replaced by a [slightly] different one).

Jan

Ben Guthro

2013-Apr-16 11:49 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 16.04.13 at 00:09, Marek Marczykowski
<marmarek@invisiblethingslab.com> wrote:
>> II. Not (fully) fixed issues:
>>
>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above
fixes the
>> issue, but it isn''t applied to xen-unstable
>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted
listing).
>> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps
some
>> timers are not restarted after resume?
>
> So I understand there is a patch dealing with this, but I''m not
clear
> whether that''s known to break CPU pools?
All cpus will end up in cpu pool 0 after S3.
I''m not sure that is "broken" - but it probably
isn''t ideal either.

IMO - it is better than the alternative state...but Juergen seems to disagree.


>
>> 3. ACPI C-states are only present for CPU0 (after resume of course),
fixed by
>> "xen: Re-upload processor PM data to hypervisor after S3"
patch by Ben, but
>> it isn''t in upstream linux (nor Konrad''s acpi-s3
branches).
>
> Perhaps this rather ought to be fixed in the hypervisor (to not
> forget the respective information; perhaps also for P-states)?
> After all that''s another case where S3 is different from soft or
hard
> offlining an individual CPU (in particular we can expect the same
> CPU to come back up during resume, whereas namely a hot-
> unplugged one could get replaced by a [slightly] different one).
>
> Jan
>

Jan Beulich

2013-Apr-16 11:57 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net> wrote:
> On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 16.04.13 at 00:09, Marek Marczykowski
<marmarek@invisiblethingslab.com>
> wrote:
>>> II. Not (fully) fixed issues:
>>>
>>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above
fixes the
>>> issue, but it isn''t applied to xen-unstable
>>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted
listing).
>>> Removing and re-adding all CPUs to Pool-0 solves the problem.
Perhaps some
>>> timers are not restarted after resume?
>>
>> So I understand there is a patch dealing with this, but I''m
not clear
>> whether that''s known to break CPU pools?
> 
> All cpus will end up in cpu pool 0 after S3.
> I''m not sure that is "broken" - but it probably
isn''t ideal either.
> 
> IMO - it is better than the alternative state...but Juergen seems to 
> disagree.
But it can''t be that difficult to save/restore pool association on top
of said patch?

Jan

Ben Guthro

2013-Apr-16 12:09 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Tue, Apr 16, 2013 at 7:57 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net> wrote:
>> On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>>> On 16.04.13 at 00:09, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>> wrote:
>>>> II. Not (fully) fixed issues:
>>>>
>>>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted
above fixes the
>>>> issue, but it isn''t applied to xen-unstable
>>>> 2. After resume scheduler chooses (almost) only CPU0 (above
quoted listing).
>>>> Removing and re-adding all CPUs to Pool-0 solves the problem.
Perhaps some
>>>> timers are not restarted after resume?
>>>
>>> So I understand there is a patch dealing with this, but
I''m not clear
>>> whether that''s known to break CPU pools?
>>
>> All cpus will end up in cpu pool 0 after S3.
>> I''m not sure that is "broken" - but it probably
isn''t ideal either.
>>
>> IMO - it is better than the alternative state...but Juergen seems to
>> disagree.
>
> But it can''t be that difficult to save/restore pool association on
top
> of said patch?
I took a brief look, in the hopes of taking a similar tack as with the
vcpu affinity restoration.
However, it seems to be a slightly more difficult problem.
In the vcpu affinity, there was an existing structure to stash away
the information we needed after resume.

In a pcpu, there is no such associated metadata...the SMP processor id
is just an integer.
So - where would we store the pool information temporarily across the
S3 process?

Ben

Jan Beulich

2013-Apr-16 12:51 UTC

head link

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

>>> On 16.04.13 at 14:09, Ben Guthro <ben@guthro.net> wrote:
> On Tue, Apr 16, 2013 at 7:57 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net>
wrote:
>>> On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich
<JBeulich@suse.com> wrote:
>>>>>>> On 16.04.13 at 00:09, Marek Marczykowski
<marmarek@invisiblethingslab.com>
>>> wrote:
>>>>> II. Not (fully) fixed issues:
>>>>>
>>>>> 1. CPU Pool-0 contains only CPU0 after resume - patch
quoted above fixes the
>>>>> issue, but it isn''t applied to xen-unstable
>>>>> 2. After resume scheduler chooses (almost) only CPU0 (above
quoted listing).
>>>>> Removing and re-adding all CPUs to Pool-0 solves the
problem. Perhaps some
>>>>> timers are not restarted after resume?
>>>>
>>>> So I understand there is a patch dealing with this, but
I''m not clear
>>>> whether that''s known to break CPU pools?
>>>
>>> All cpus will end up in cpu pool 0 after S3.
>>> I''m not sure that is "broken" - but it probably
isn''t ideal either.
>>>
>>> IMO - it is better than the alternative state...but Juergen seems
to
>>> disagree.
>>
>> But it can''t be that difficult to save/restore pool
association on top
>> of said patch?
> 
> I took a brief look, in the hopes of taking a similar tack as with the
> vcpu affinity restoration.
> However, it seems to be a slightly more difficult problem.
> In the vcpu affinity, there was an existing structure to stash away
> the information we needed after resume.
> 
> In a pcpu, there is no such associated metadata...the SMP processor id
> is just an integer.
> So - where would we store the pool information temporarily across the
> S3 process?
Do it the other way around - the CPU pools have a mask of valid
CPUs. You could latch those pre-suspend for each of the pools (e.g.
by again introducing a second mask hanging off the same structure).

(Also adding Juergen to Cc in case he has other thoughts.)

Jan

Apparently Analagous Threads

Search for more reasonably related threads

Xen devel - Mar 2013 - High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

Apparently Analagous Threads