Marek Marczykowski
2013-Mar-13 20:50 UTC
High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
Hi, I''ve still have problems with ACPI(?) on Xen. After some system startup or resume CPU temperature goes high although all domUs (and dom0) are idle. On "good" system startup it is about 50-55C, on "bad" - above 67C (most time above 70C). I''ve noticed difference in C-states repored by Xen (attached files). On "bad" startups in addition suspend doesn''t work - system restarts during suspend (still didn''t managed to get console messages - I don''t have serial port on this system). Note that sometimes system boots fine ("good" state), but problem occurs after some suspend/resume cycles. Some time ago I''ve got other symptoms: only CPU0 was used - for all VCPUs (according to xl vcpu-list). Maybe it is related? Hardware: Dell Latitude E6420 CPU: Intel i5-2520M Software: xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve picking up the idle CPU for a VCPU"), with reverted commit "Introduce system_state variable." But the same problem on vanilla xen 4.1.2. Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer (but still occurs). Kernel config: http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD I''ve tried some bisect from 3.7.4 to 3.7.6, but without success because problem isn''t 100% reproducible. Any ideas? -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Mar-15 03:00 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On mer, 2013-03-13 at 21:50 +0100, Marek Marczykowski wrote:> Hi, > > I''ve still have problems with ACPI(?) on Xen. After some system startup or > resume CPU temperature goes high although all domUs (and dom0) are idle. >Resume? Sorry for going a bit off-topic (or, if you want, for not being able to help with the issue you''re seeing), but that means suspend/resume works for you under Xen? That would be really nice, as I''ve never seen it working properly... Is that me that am missing something? :-O Actually, now that I think of it, there was a guy at FOSDEM, with QubesOS installed on its laptop, telling us suspend was working for him, but I''ve never had the chance to try it yet. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-15 03:22 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 15.03.2013 04:00, Dario Faggioli wrote:> On mer, 2013-03-13 at 21:50 +0100, Marek Marczykowski wrote: >> Hi, >> >> I''ve still have problems with ACPI(?) on Xen. After some system startup or >> resume CPU temperature goes high although all domUs (and dom0) are idle. >> > Resume? Sorry for going a bit off-topic (or, if you want, for not being > able to help with the issue you''re seeing), but that means > suspend/resume works for you under Xen?Yes, with patches from Konrad''s devel/acpi-s3.v10 branch. Actually one of those patches looks to be already in upstream linux, but two remaining still need to be applied.> > That would be really nice, as I''ve never seen it working properly... Is > that me that am missing something? :-O > > Actually, now that I think of it, there was a guy at FOSDEM, with > QubesOS installed on its laptop, telling us suspend was working for him, > but I''ve never had the chance to try it yet. > > Regards, > Dario >-- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Mar-15 13:02 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote:> Hi, > > I''ve still have problems with ACPI(?) on Xen. After some system startup or > resume CPU temperature goes high although all domUs (and dom0) are idle. On > "good" system startup it is about 50-55C, on "bad" - above 67C (most time > above 70C). I''ve noticed difference in C-states repored by Xen (attached > files). On "bad" startups in addition suspend doesn''t work - system restarts > during suspend (still didn''t managed to get console messages - I don''t have > serial port on this system). Note that sometimes system boots fine ("good" > state), but problem occurs after some suspend/resume cycles. Some time ago > I''ve got other symptoms: only CPU0 was used - for all VCPUs (according to xl > vcpu-list). Maybe it is related? > > Hardware: Dell Latitude E6420 > CPU: Intel i5-2520M > > Software: > xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve picking > up the idle CPU for a VCPU"), with reverted commit "Introduce system_state > variable." > But the same problem on vanilla xen 4.1.2. > > Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer > (but still occurs). > Kernel config: > http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD > I''ve tried some bisect from 3.7.4 to 3.7.6, but without success because > problem isn''t 100% reproducible. > > Any ideas?That C-states difference is important. The SYSIO part on your box means that the CPU ends up doing an MWAIT. An HALT on the other hand is not so power-saving friendly. Looking at this:> (XEN) no cpu_id for acpi_id 5 > (XEN) no cpu_id for acpi_id 6 > (XEN) no cpu_id for acpi_id 7 > (XEN) no cpu_id for acpi_id 8.. means that xen-acpi-processor was trying to probe for the ACPI IDs of the the other CPUs that the machine theoritcally can support. That means it got the ACPI information for the first four CPUs (which is good). You can as the first step in trying to figure this out, add #define DEBUG 1 in xen-acpi-processor.c right before any of the #includes. And also boot Xen with ''cpufreq=verbose''. That should tell you what kind of C-states the xen-acpi-processor uploaded (And if it did it for all of the vCPUS). If both bootups show that we do upload the C-states for all the CPUs but they vary that means digging a bit deeper in the ACPI code. Specifically in acpi_processor_get_power_info_cst and seeing if it hits any of the ''continue''. Then I would say take also the DSDT for both bootups and compare them. It might be that the BIOS is using a scratch register at reboot to construct the C-states and somehow it ends up being corrupted. Which means that on the next warm reboot the C-states has bogus data. This does show up in the field :-(
Marek Marczykowski
2013-Mar-22 15:34 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 15.03.2013 14:02, Konrad Rzeszutek Wilk wrote:> On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote: >> Hi, >> >> I''ve still have problems with ACPI(?) on Xen. After some system startup or >> resume CPU temperature goes high although all domUs (and dom0) are idle. On >> "good" system startup it is about 50-55C, on "bad" - above 67C (most time >> above 70C). I''ve noticed difference in C-states repored by Xen (attached >> files). On "bad" startups in addition suspend doesn''t work - system restarts >> during suspend (still didn''t managed to get console messages - I don''t have >> serial port on this system). Note that sometimes system boots fine ("good" >> state), but problem occurs after some suspend/resume cycles. Some time ago >> I''ve got other symptoms: only CPU0 was used - for all VCPUs (according to xl >> vcpu-list). Maybe it is related? >> >> Hardware: Dell Latitude E6420 >> CPU: Intel i5-2520M >> >> Software: >> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve picking >> up the idle CPU for a VCPU"), with reverted commit "Introduce system_state >> variable." >> But the same problem on vanilla xen 4.1.2. >> >> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer >> (but still occurs). >> Kernel config: >> http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD >> I''ve tried some bisect from 3.7.4 to 3.7.6, but without success because >> problem isn''t 100% reproducible. >> >> Any ideas? > > That C-states difference is important. The SYSIO part on your box means that the > CPU ends up doing an MWAIT. An HALT on the other hand is not so power-saving > friendly. > > Looking at this: >> (XEN) no cpu_id for acpi_id 5 >> (XEN) no cpu_id for acpi_id 6 >> (XEN) no cpu_id for acpi_id 7 >> (XEN) no cpu_id for acpi_id 8 > > .. means that xen-acpi-processor was trying to probe for the ACPI IDs of the > the other CPUs that the machine theoritcally can support. That means it got > the ACPI information for the first four CPUs (which is good). > > You can as the first step in trying to figure this out, add #define DEBUG 1 > in xen-acpi-processor.c right before any of the #includes. And also boot > Xen with ''cpufreq=verbose''. That should tell you what kind of C-states the > xen-acpi-processor uploaded (And if it did it for all of the vCPUS). > > If both bootups show that we do upload the C-states for all the CPUs but they > vary that means digging a bit deeper in the ACPI code. Specifically in > acpi_processor_get_power_info_cst and seeing if it hits any of the ''continue''. > > Then I would say take also the DSDT for both bootups and compare them. It might > be that the BIOS is using a scratch register at reboot to construct the C-states > and somehow it ends up being corrupted. Which means that on the next warm reboot > the C-states has bogus data. This does show up in the field :-(Finally I''ve found some time for further debugging this. And it looks like some deeper ACPI code problem... I''ve switched to 3.8.4, on which problem is much easier to reproduce (almost every startup). On bad bootup, xen-acpi-processor didn''t found any C-state: for each CPU _pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In this case suspend (or shutdown) always ends up with reset. On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c), but still temperature keep high. Regardless of xen-acpi-processor reloading, next suspend always fails. Not sure how C-states can be related to S3 suspend, but perhaps something more general with ACPI is wrong? Each time DSDT (get from /sys/firmware/acpi/tables) is exactly the same. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Mar-22 16:56 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote:> On 15.03.2013 14:02, Konrad Rzeszutek Wilk wrote: > > On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote: > >> Hi, > >> > >> I''ve still have problems with ACPI(?) on Xen. After some system startup or > >> resume CPU temperature goes high although all domUs (and dom0) are idle. On > >> "good" system startup it is about 50-55C, on "bad" - above 67C (most time > >> above 70C). I''ve noticed difference in C-states repored by Xen (attached > >> files). On "bad" startups in addition suspend doesn''t work - system restarts > >> during suspend (still didn''t managed to get console messages - I don''t have > >> serial port on this system). Note that sometimes system boots fine ("good" > >> state), but problem occurs after some suspend/resume cycles. Some time ago > >> I''ve got other symptoms: only CPU0 was used - for all VCPUs (according to xl > >> vcpu-list). Maybe it is related? > >> > >> Hardware: Dell Latitude E6420 > >> CPU: Intel i5-2520M > >> > >> Software: > >> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve picking > >> up the idle CPU for a VCPU"), with reverted commit "Introduce system_state > >> variable." > >> But the same problem on vanilla xen 4.1.2. > >> > >> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer > >> (but still occurs). > >> Kernel config: > >> http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD > >> I''ve tried some bisect from 3.7.4 to 3.7.6, but without success because > >> problem isn''t 100% reproducible. > >> > >> Any ideas? > > > > That C-states difference is important. The SYSIO part on your box means that the > > CPU ends up doing an MWAIT. An HALT on the other hand is not so power-saving > > friendly. > > > > Looking at this: > >> (XEN) no cpu_id for acpi_id 5 > >> (XEN) no cpu_id for acpi_id 6 > >> (XEN) no cpu_id for acpi_id 7 > >> (XEN) no cpu_id for acpi_id 8 > > > > .. means that xen-acpi-processor was trying to probe for the ACPI IDs of the > > the other CPUs that the machine theoritcally can support. That means it got > > the ACPI information for the first four CPUs (which is good). > > > > You can as the first step in trying to figure this out, add #define DEBUG 1 > > in xen-acpi-processor.c right before any of the #includes. And also boot > > Xen with ''cpufreq=verbose''. That should tell you what kind of C-states the > > xen-acpi-processor uploaded (And if it did it for all of the vCPUS). > > > > If both bootups show that we do upload the C-states for all the CPUs but they > > vary that means digging a bit deeper in the ACPI code. Specifically in > > acpi_processor_get_power_info_cst and seeing if it hits any of the ''continue''. > > > > Then I would say take also the DSDT for both bootups and compare them. It might > > be that the BIOS is using a scratch register at reboot to construct the C-states > > and somehow it ends up being corrupted. Which means that on the next warm reboot > > the C-states has bogus data. This does show up in the field :-( > > Finally I''ve found some time for further debugging this. And it looks like > some deeper ACPI code problem... > > I''ve switched to 3.8.4, on which problem is much easier to reproduce (almost > every startup). > > On bad bootup, xen-acpi-processor didn''t found any C-state: for each CPU > _pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In > this case suspend (or shutdown) always ends up with reset.This is you booting the machine from a cold-state or a warm one? There are some BIOSes out there that I know that use the scratchpad registers in IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or Linux touch it then the P-states and C-states that the BIOS generates are buggy. But that is not the case here - you are saying that the DSDT after disassembling (so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them), the _PSD, _PSS, and _PCT look the same? You could also look at the FACP table and see if they are different.> > On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend > succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading > xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c), but > still temperature keep high. Regardless of xen-acpi-processor reloading, next > suspend always fails.If you reload, and look at the runqeueus, are all of them using the ACPI idler or the default one?> > Not sure how C-states can be related to S3 suspend, but perhaps something more > general with ACPI is wrong?This reminds me of something. I recall a long long time ago seeing something like this.... Completly forgot about this until now. The difference was whether the Xen''s cpu_idle as running a) the acpi_idle (so using the different C-states), or b) the default one (so just using HLT). With the b), during resume it would get half-way through (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would actually continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard. Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html And it looks Kevin''s recommendation was use the a) case with max_cstates=1 to narrow it down.> > Each time DSDT (get from /sys/firmware/acpi/tables) is exactly the same. > > -- > Best Regards / Pozdrawiam, > Marek Marczykowski > Invisible Things Lab >
Marek Marczykowski
2013-Mar-25 11:36 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote:> On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote: >> I''ve switched to 3.8.4, on which problem is much easier to reproduce (almost >> every startup). >> >> On bad bootup, xen-acpi-processor didn''t found any C-state: for each CPU >> _pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In >> this case suspend (or shutdown) always ends up with reset. > > This is you booting the machine from a cold-state or a warm one?Doesn''t matter - in both cases the same result.> There are some BIOSes out there that I know that use the scratchpad registers in > IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or Linux > touch it then the P-states and C-states that the BIOS generates are buggy. > > But that is not the case here - you are saying that the DSDT after disassembling > (so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them), the > _PSD, _PSS, and _PCT look the same?Binary versions are the same so assume disassembled also. I''ve copied full /sys/firmware/acpi/tables at some startups and in all cases (both cold and warm startups) all were the same. In case of any noticed difference will check disassembled versions.> You could also look at the FACP table and see if they are different. >> >> On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend >> succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading >> xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c), but >> still temperature keep high. Regardless of xen-acpi-processor reloading, next >> suspend always fails. > > If you reload, and look at the runqeueus, are all of them using the ACPI > idler or the default one?The ACPI one (before reload and after).>> Not sure how C-states can be related to S3 suspend, but perhaps something more >> general with ACPI is wrong? > > This reminds me of something. I recall a long long time ago seeing something like this.... > Completly forgot about this until now. The difference was whether the Xen''s cpu_idle > as running a) the acpi_idle (so using the different C-states), or b) the default one > (so just using HLT). > > With the b), during resume it would get half-way through > (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would actually > continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log > > This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard. > > Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html > > And it looks Kevin''s recommendation was use the a) case with max_cstates=1 > to narrow it down.When default_idle used, resume doesn''t work at all (even the first one). Details: (1) With max_cstates=1, without xen-acpi-processor module: default_idle used. Suspend succeed, but always hang at resume. (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle used. Suspend succeed, resume also, but after resume above problem exists (high temperature, C2-C3 states only present on CPU0, subsequent suspends always ends up with reboot). (3) Without max_cstate=1, with xen-acpi-processor module loaded: same as (2). (4) Without max_cstate=1, without xen-acpi-processor module loaded: same as (1). One more observation: when xen compiled with debug=y, (2) and (4) cases behaves the same as (1). Hopefully I will have real serial console somehow in this week and will be able to get more details from hang and reboot cases. BTW Any chances for Xen ACPI S3 patches in upstream kernel? -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Mar-25 14:17 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski wrote:> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote: > > On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote: > >> I''ve switched to 3.8.4, on which problem is much easier to reproduce (almost > >> every startup). > >> > >> On bad bootup, xen-acpi-processor didn''t found any C-state: for each CPU > >> _pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In > >> this case suspend (or shutdown) always ends up with reset. > > > > This is you booting the machine from a cold-state or a warm one? > > Doesn''t matter - in both cases the same result. > > > There are some BIOSes out there that I know that use the scratchpad registers in > > IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or Linux > > touch it then the P-states and C-states that the BIOS generates are buggy. > > > > But that is not the case here - you are saying that the DSDT after disassembling > > (so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them), the > > _PSD, _PSS, and _PCT look the same? > > Binary versions are the same so assume disassembled also. I''ve copied full > /sys/firmware/acpi/tables at some startups and in all cases (both cold and > warm startups) all were the same. > In case of any noticed difference will check disassembled versions.<sigh> Was hoping it was something as simple as that :-) .. snip..> > This reminds me of something. I recall a long long time ago seeing something like this.... > > Completly forgot about this until now. The difference was whether the Xen''s cpu_idle > > as running a) the acpi_idle (so using the different C-states), or b) the default one > > (so just using HLT). > > > > With the b), during resume it would get half-way through > > (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would actually > > continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log > > > > This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard. > > > > Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html > > > > And it looks Kevin''s recommendation was use the a) case with max_cstates=1 > > to narrow it down. > > When default_idle used, resume doesn''t work at all (even the first one). Details: > (1) With max_cstates=1, without xen-acpi-processor module: default_idle used. > Suspend succeed, but always hang at resume.AHA! So the bug persist.> > (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle used. > Suspend succeed, resume also, but after resume above problem exists (high > temperature, C2-C3 states only present on CPU0, subsequent suspends always > ends up with reboot). > > (3) Without max_cstate=1, with xen-acpi-processor module loaded: same as (2). > > (4) Without max_cstate=1, without xen-acpi-processor module loaded: same as (1). > > One more observation: when xen compiled with debug=y, (2) and (4) cases > behaves the same as (1).Oh, that is something new.> > Hopefully I will have real serial console somehow in this week and will be > able to get more details from hang and reboot cases. > > BTW Any chances for Xen ACPI S3 patches in upstream kernel?<sigh> Now that the regression storm of v3.9 has subsided I should have some breathing room to address that.> > -- > Best Regards / Pozdrawiam, > Marek Marczykowski > Invisible Things Lab >
Marek Marczykowski
2013-Mar-25 14:56 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 25.03.2013 15:17, Konrad Rzeszutek Wilk wrote:> On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski wrote: >> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote: >>> This reminds me of something. I recall a long long time ago seeing something like this.... >>> Completly forgot about this until now. The difference was whether the Xen''s cpu_idle >>> as running a) the acpi_idle (so using the different C-states), or b) the default one >>> (so just using HLT). >>> >>> With the b), during resume it would get half-way through >>> (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would actually >>> continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log >>> >>> This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard. >>> >>> Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html >>> >>> And it looks Kevin''s recommendation was use the a) case with max_cstates=1 >>> to narrow it down. >> >> When default_idle used, resume doesn''t work at all (even the first one). Details: >> (1) With max_cstates=1, without xen-acpi-processor module: default_idle used. >> Suspend succeed, but always hang at resume. > > AHA! So the bug persist. >> >> (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle used. >> Suspend succeed, resume also, but after resume above problem exists (high >> temperature, C2-C3 states only present on CPU0, subsequent suspends always >> ends up with reboot). >> >> (3) Without max_cstate=1, with xen-acpi-processor module loaded: same as (2). >> >> (4) Without max_cstate=1, without xen-acpi-processor module loaded: same as (1). >> >> One more observation: when xen compiled with debug=y, (2) and (4) cases >> behaves the same as (1). > > Oh, that is something new.I''ve tried also some (automated :) ) bisection on xen from 4.1.2 to 4.1.4, but unfortunately results wasn''t deterministic... My script don''t distinguish different symptoms (reboot at suspend, hang at resume, incomplete C-states after resume, etc), so this can be reason for such non-deterministic results... One time I''ve got this commit as first bad: commit 329d4280255ff44300913f24119f52d3459c1ed0 Author: Jan Beulich <jbeulich@suse.com> Date: Tue Apr 17 08:33:33 2012 +0100 XENPF_set_processor_pminfo XEN_PM_CX overflows states array Maybe related?>> >> Hopefully I will have real serial console somehow in this week and will be >> able to get more details from hang and reboot cases. >> >> BTW Any chances for Xen ACPI S3 patches in upstream kernel? > > <sigh> Now that the regression storm of v3.9 has subsided I should have > some breathing room to address that.I keep fingers crossed. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-26 12:17 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 25.03.2013 15:17, Konrad Rzeszutek Wilk wrote:> On Mon, Mar 25, 2013 at 12:36:31PM +0100, Marek Marczykowski wrote: >> On 22.03.2013 17:56, Konrad Rzeszutek Wilk wrote: >>> On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote: >>> This reminds me of something. I recall a long long time ago seeing something like this.... >>> Completly forgot about this until now. The difference was whether the Xen''s cpu_idle >>> as running a) the acpi_idle (so using the different C-states), or b) the default one >>> (so just using HLT). >>> >>> With the b), during resume it would get half-way through >>> (http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would actually >>> continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log >>> >>> This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard. >>> >>> Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html >>> >>> And it looks Kevin''s recommendation was use the a) case with max_cstates=1 >>> to narrow it down. >> >> When default_idle used, resume doesn''t work at all (even the first one). Details: >> (1) With max_cstates=1, without xen-acpi-processor module: default_idle used. >> Suspend succeed, but always hang at resume. > > AHA! So the bug persist. >> >> (2) With max_cstate=1, with xen-acpi-processor module loaded: acpi_idle used. >> Suspend succeed, resume also, but after resume above problem exists (high >> temperature, C2-C3 states only present on CPU0, subsequent suspends always >> ends up with reboot). >> >> (3) Without max_cstate=1, with xen-acpi-processor module loaded: same as (2). >> >> (4) Without max_cstate=1, without xen-acpi-processor module loaded: same as (1). >> >> One more observation: when xen compiled with debug=y, (2) and (4) cases >> behaves the same as (1). > > Oh, that is something new.Finally got serial console :) The debug=y problem is (actually at resume): (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48029feb8: (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 (XEN) 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 (XEN) **************************************** -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Mar-26 13:11 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: > Finally got serial console :) > The debug=y problem is (actually at resume): > (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 > (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c48015e288>] > smp_irq_move_cleanup_interrupt+0x1c3/0x23d > (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 > (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 > (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 > (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 > (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 > (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c48029feb8: > (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 > (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 > (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 > (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 > (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 > (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 > (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 > (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 > (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 > (XEN) 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 > (XEN) ****************************************To make sense of this, we need to know the register (and maybe stack) allocation at this point, to know which vector it was that triggered the assertion. You can either do this analysis for us, or point us at the xen-syms binary matching the xen.gz you used. From the register values, the most likely candidates are vector 0xe9 and 0x2a. The former having two registers set to this value seems more likely from than angle, but vectors in the 0xe? range should never end up in smp_irq_move_cleanup_interrupt(). And if it''s the 0x2a one, then we''d need to know what IRQ it was last used for. That can''t be reconstructed from the data above, so would require you being able to reproduce this and adding some instrumentation to the code. Jan
Marek Marczykowski
2013-Mar-26 13:50 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 14:11, Jan Beulich wrote:>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >> Finally got serial console :) >> The debug=y problem is (actually at resume): >> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >> (XEN) CPU: 0 >> (XEN) RIP: e008:[<ffff82c48015e288>] >> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >> (XEN) 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >> (XEN) **************************************** > > To make sense of this, we need to know the register (and maybe > stack) allocation at this point, to know which vector it was that > triggered the assertion. You can either do this analysis for us, or > point us at the xen-syms binary matching the xen.gz you used."info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9.> From the register values, the most likely candidates are vector 0xe9 > and 0x2a. The former having two registers set to this value seems > more likely from than angle, but vectors in the 0xe? range should > never end up in smp_irq_move_cleanup_interrupt(). > > And if it''s the 0x2a one, then we''d need to know what IRQ it was > last used for. That can''t be reconstructed from the data above, so > would require you being able to reproduce this and adding some > instrumentation to the code. > > Jan >-- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-26 15:47 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26/03/2013 13:50, Marek Marczykowski wrote:> On 26.03.2013 14:11, Jan Beulich wrote: >>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >>> Finally got serial console :) >>> The debug=y problem is (actually at resume): >>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e008:[<ffff82c48015e288>] >>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>> (XEN) 0000000000000000 >>> (XEN) Xen call trace: >>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>> (XEN) **************************************** >> To make sense of this, we need to know the register (and maybe >> stack) allocation at this point, to know which vector it was that >> triggered the assertion. You can either do this analysis for us, or >> point us at the xen-syms binary matching the xen.gz you used. > "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. > >> From the register values, the most likely candidates are vector 0xe9 >> and 0x2a. The former having two registers set to this value seems >> more likely from than angle, but vectors in the 0xe? range should >> never end up in smp_irq_move_cleanup_interrupt(). >> >> And if it''s the 0x2a one, then we''d need to know what IRQ it was >> last used for. That can''t be reconstructed from the data above, so >> would require you being able to reproduce this and adding some >> instrumentation to the code. >> >> Jan >> >Could it be something to do with switching virtual wire mode, and having PIC compatibility stuff left in the IO-APIC after leaving the BIOS but before starting back up again? Looking at the stack dump, there is an extra exception frame under what is printed by the assertion failure. 0000002000000000 TRAP_syscall ffffffff81a01db8 guest kernel addr 0000000000000246 FLAGS 000000000000e033 FLAT_RING3_CS64 ffffffff8105dd5a guest kernel addr 000000000000e02b FLAT_RING3_SS{64,32} So it appears that we are already executing a guest (presumably dom0) by the time this assertion occurs. From the serial, is there any indication that dom0 has started up again? I would have thought that we should have successfully reset the IO-APIC back up properly before we would ever get back around to executing dom0. ~Andrew
Jan Beulich
2013-Mar-26 16:03 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 26.03.13 at 14:50, Marek Marczykowski <marmarek@invisiblethingslab.com>wrote:> On 26.03.2013 14:11, Jan Beulich wrote: >>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> > wrote: >>> Finally got serial console :) >>> The debug=y problem is (actually at resume): >>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e008:[<ffff82c48015e288>] >>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>> (XEN) 0000000000000000 >>> (XEN) Xen call trace: >>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>> (XEN) **************************************** >> >> To make sense of this, we need to know the register (and maybe >> stack) allocation at this point, to know which vector it was that >> triggered the assertion. You can either do this analysis for us, or >> point us at the xen-syms binary matching the xen.gz you used. > > "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9.And that system isn''t using a strange mixed mode IO-APIC/legacy PIC model, where particularly IRQ 9 (usually ACPI SCI) gets channeled through the legacy PIC? Could you attach the complete log, ideally with ''i'' output logged right before suspending? Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily reproducible with 4.1.5-rc1, could you try changing the containing loop''s upper bound from "< NR_VECTORS" to "<= LAST_DYNAMIC_VECTOR"? Jan
Andrew Cooper
2013-Mar-26 16:12 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26/03/2013 15:47, Andrew Cooper wrote:> On 26/03/2013 13:50, Marek Marczykowski wrote: >> On 26.03.2013 14:11, Jan Beulich wrote: >>>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >>>> Finally got serial console :) >>>> The debug=y problem is (actually at resume): >>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>>> (XEN) CPU: 0 >>>> (XEN) RIP: e008:[<ffff82c48015e288>] >>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>>> (XEN) 0000000000000000 >>>> (XEN) Xen call trace: >>>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>> (XEN) >>>> (XEN) >>>> (XEN) **************************************** >>>> (XEN) Panic on CPU 0: >>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>> (XEN) **************************************** >>> To make sense of this, we need to know the register (and maybe >>> stack) allocation at this point, to know which vector it was that >>> triggered the assertion. You can either do this analysis for us, or >>> point us at the xen-syms binary matching the xen.gz you used. >> "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. >> >>> From the register values, the most likely candidates are vector 0xe9 >>> and 0x2a. The former having two registers set to this value seems >>> more likely from than angle, but vectors in the 0xe? range should >>> never end up in smp_irq_move_cleanup_interrupt(). >>> >>> And if it''s the 0x2a one, then we''d need to know what IRQ it was >>> last used for. That can''t be reconstructed from the data above, so >>> would require you being able to reproduce this and adding some >>> instrumentation to the code. >>> >>> Jan >>> > Could it be something to do with switching virtual wire mode, and having > PIC compatibility stuff left in the IO-APIC after leaving the BIOS but > before starting back up again? > > Looking at the stack dump, there is an extra exception frame under what > is printed by the assertion failure. > > 0000002000000000 TRAP_syscallApologies - this is a vector 0x20 interrupt, not TRAP_syscall, which makes sense as 0x20 is FIRST_DYNAMIC_IRQ which is also the cleanup IPI vector. The other comments still stand, espcially as we appear to be interrupting dom0 which is already running. ~Andrew> ffffffff81a01db8 guest kernel addr > 0000000000000246 FLAGS > 000000000000e033 FLAT_RING3_CS64 > ffffffff8105dd5a guest kernel addr > 000000000000e02b FLAT_RING3_SS{64,32} > > So it appears that we are already executing a guest (presumably dom0) by the time this assertion occurs. From the serial, is there any indication that dom0 has started up again? > > I would have thought that we should have successfully reset the IO-APIC back up properly before we would ever get back around to executing dom0. > > ~Andrew > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-26 16:45 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 17:03, Jan Beulich wrote:>>>> On 26.03.13 at 14:50, Marek Marczykowski <marmarek@invisiblethingslab.com> > wrote: >> On 26.03.2013 14:11, Jan Beulich wrote: >>>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> >> wrote: >>>> Finally got serial console :) >>>> The debug=y problem is (actually at resume): >>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>>> (XEN) CPU: 0 >>>> (XEN) RIP: e008:[<ffff82c48015e288>] >>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>>> (XEN) 0000000000000000 >>>> (XEN) Xen call trace: >>>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>> (XEN) >>>> (XEN) >>>> (XEN) **************************************** >>>> (XEN) Panic on CPU 0: >>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>> (XEN) **************************************** >>> >>> To make sense of this, we need to know the register (and maybe >>> stack) allocation at this point, to know which vector it was that >>> triggered the assertion. You can either do this analysis for us, or >>> point us at the xen-syms binary matching the xen.gz you used. >> >> "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. > > And that system isn''t using a strange mixed mode IO-APIC/legacy > PIC model, where particularly IRQ 9 (usually ACPI SCI) gets > channeled through the legacy PIC?I don''t know...> Could you attach the complete log, ideally with ''i'' output logged > right before suspending?Sure, attached.> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily > reproducible with 4.1.5-rc1, could you try changing the containing > loop''s upper bound from "< NR_VECTORS" to > "<= LAST_DYNAMIC_VECTOR"?I''ve tried 4.2.x some time ago and bug also exists there (but I had not console, so not sure if exactly the same). 4.3 seems to be not affected. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-26 16:47 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 17:12, Andrew Cooper wrote:> On 26/03/2013 15:47, Andrew Cooper wrote: >> On 26/03/2013 13:50, Marek Marczykowski wrote: >>> On 26.03.2013 14:11, Jan Beulich wrote: >>>>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >>>>> Finally got serial console :) >>>>> The debug=y problem is (actually at resume): >>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>>>> (XEN) CPU: 0 >>>>> (XEN) RIP: e008:[<ffff82c48015e288>] >>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>>>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>>>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>>>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>>>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>>>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>>>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>>>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>>>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>>>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>>>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>>>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>>>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>>>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>>>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>>>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>>>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>>>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>>>> (XEN) 0000000000000000 >>>>> (XEN) Xen call trace: >>>>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>> (XEN) >>>>> (XEN) >>>>> (XEN) **************************************** >>>>> (XEN) Panic on CPU 0: >>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>> (XEN) **************************************** >>>> To make sense of this, we need to know the register (and maybe >>>> stack) allocation at this point, to know which vector it was that >>>> triggered the assertion. You can either do this analysis for us, or >>>> point us at the xen-syms binary matching the xen.gz you used. >>> "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. >>> >>>> From the register values, the most likely candidates are vector 0xe9 >>>> and 0x2a. The former having two registers set to this value seems >>>> more likely from than angle, but vectors in the 0xe? range should >>>> never end up in smp_irq_move_cleanup_interrupt(). >>>> >>>> And if it''s the 0x2a one, then we''d need to know what IRQ it was >>>> last used for. That can''t be reconstructed from the data above, so >>>> would require you being able to reproduce this and adding some >>>> instrumentation to the code. >>>> >>>> Jan >>>> >> Could it be something to do with switching virtual wire mode, and having >> PIC compatibility stuff left in the IO-APIC after leaving the BIOS but >> before starting back up again? >> >> Looking at the stack dump, there is an extra exception frame under what >> is printed by the assertion failure. >> >> 0000002000000000 TRAP_syscall > > Apologies - this is a vector 0x20 interrupt, not TRAP_syscall, which > makes sense as 0x20 is FIRST_DYNAMIC_IRQ which is also the cleanup IPI > vector. > > The other comments still stand, espcially as we appear to be > interrupting dom0 which is already running.Indeed, dom0 is running at this stage (see log in my second email).> > ~Andrew > >> ffffffff81a01db8 guest kernel addr >> 0000000000000246 FLAGS >> 000000000000e033 FLAT_RING3_CS64 >> ffffffff8105dd5a guest kernel addr >> 000000000000e02b FLAT_RING3_SS{64,32} >> >> So it appears that we are already executing a guest (presumably dom0) by the time this assertion occurs. From the serial, is there any indication that dom0 has started up again? >> >> I would have thought that we should have successfully reset the IO-APIC back up properly before we would ever get back around to executing dom0. >> >> ~Andrew >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >-- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-26 17:02 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26/03/2013 16:45, Marek Marczykowski wrote:> On 26.03.2013 17:03, Jan Beulich wrote: >>>>> On 26.03.13 at 14:50, Marek Marczykowski <marmarek@invisiblethingslab.com> >> wrote: >>> On 26.03.2013 14:11, Jan Beulich wrote: >>>>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> >>> wrote: >>>>> Finally got serial console :) >>>>> The debug=y problem is (actually at resume): >>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>>>> (XEN) CPU: 0 >>>>> (XEN) RIP: e008:[<ffff82c48015e288>] >>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>>>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>>>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>>>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>>>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>>>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>>>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>>>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>>>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>>>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>>>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>>>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>>>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>>>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>>>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>>>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>>>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>>>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>>>> (XEN) 0000000000000000 >>>>> (XEN) Xen call trace: >>>>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>> (XEN) >>>>> (XEN) >>>>> (XEN) **************************************** >>>>> (XEN) Panic on CPU 0: >>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>> (XEN) **************************************** >>>> To make sense of this, we need to know the register (and maybe >>>> stack) allocation at this point, to know which vector it was that >>>> triggered the assertion. You can either do this analysis for us, or >>>> point us at the xen-syms binary matching the xen.gz you used. >>> "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. >> And that system isn''t using a strange mixed mode IO-APIC/legacy >> PIC model, where particularly IRQ 9 (usually ACPI SCI) gets >> channeled through the legacy PIC? > I don''t know... > >> Could you attach the complete log, ideally with ''i'' output logged >> right before suspending? > Sure, attached. > >> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily >> reproducible with 4.1.5-rc1, could you try changing the containing >> loop''s upper bound from "< NR_VECTORS" to >> "<= LAST_DYNAMIC_VECTOR"? > I''ve tried 4.2.x some time ago and bug also exists there (but I had not > console, so not sure if exactly the same). 4.3 seems to be not affected. >Can you replace the ASSERT() with code similar to that in http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 Which should call dump_irqs() in before dying because of the ASSERT. You might need to also take the latest version of dump_irqs() from unstable, as I seem to remember there was another assertion failure due to xfree()''ing in IRQ context. ~Andrew
Marek Marczykowski
2013-Mar-26 17:42 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 18:02, Andrew Cooper wrote:> On 26/03/2013 16:45, Marek Marczykowski wrote: >> On 26.03.2013 17:03, Jan Beulich wrote: >>>>>> On 26.03.13 at 14:50, Marek Marczykowski <marmarek@invisiblethingslab.com> >>> wrote: >>>> On 26.03.2013 14:11, Jan Beulich wrote: >>>>>>>> On 26.03.13 at 13:17, Marek Marczykowski <marmarek@invisiblethingslab.com> >>>> wrote: >>>>>> Finally got serial console :) >>>>>> The debug=y problem is (actually at resume): >>>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>>> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >>>>>> (XEN) CPU: 0 >>>>>> (XEN) RIP: e008:[<ffff82c48015e288>] >>>>>> smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>>> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor >>>>>> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: ffff82c48029ff18 >>>>>> (XEN) rdx: 00000000000000e9 rsi: 000000000000002a rdi: ffff830421060538 >>>>>> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: ffff88041820eb60 >>>>>> (XEN) r9: 0000000000000000 r10: 0000000000007ff0 r11: 0000000000000000 >>>>>> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >>>>>> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >>>>>> (XEN) cr3: 0000000300b81000 cr2: ffff880402070198 >>>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>>>> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >>>>>> (XEN) 0000000000000000 000000000000e030 ffff82c48029ff18 ffff82c4802dd9e0 >>>>>> (XEN) ffff8802cac3c7c0 00000000ffff3729 00000000ffff3729 000000013fff3728 >>>>>> (XEN) ffffffff81b907c0 00000000ffff3729 00007d3b7fd600c7 ffff82c48014de60 >>>>>> (XEN) 00000000ffff3729 ffffffff81b907c0 000000013fff3728 00000000ffff3729 >>>>>> (XEN) ffffffff81a01e18 00000000ffff3729 0000000000000000 0000000000007ff0 >>>>>> (XEN) 0000000000000000 ffff88041820eb60 ffff8803fd1820a8 ffffffff81b90a88 >>>>>> (XEN) 000000000000002a 000000000000002a 00000000ffff372a 0000002000000000 >>>>>> (XEN) ffffffff8105dd5a 000000000000e033 0000000000000246 ffffffff81a01db8 >>>>>> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >>>>>> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >>>>>> (XEN) 0000000000000000 >>>>>> (XEN) Xen call trace: >>>>>> (XEN) [<ffff82c48015e288>] smp_irq_move_cleanup_interrupt+0x1c3/0x23d >>>>>> (XEN) >>>>>> (XEN) >>>>>> (XEN) **************************************** >>>>>> (XEN) Panic on CPU 0: >>>>>> (XEN) Assertion ''test_bit(vector, cfg->used_vectors)'' failed at io_apic.c:542 >>>>>> (XEN) **************************************** >>>>> To make sense of this, we need to know the register (and maybe >>>>> stack) allocation at this point, to know which vector it was that >>>>> triggered the assertion. You can either do this analysis for us, or >>>>> point us at the xen-syms binary matching the xen.gz you used. >>>> "info scope smp_irq_move_cleanup_interrupt" said vector is in %rbx, so 0xe9. >>> And that system isn''t using a strange mixed mode IO-APIC/legacy >>> PIC model, where particularly IRQ 9 (usually ACPI SCI) gets >>> channeled through the legacy PIC? >> I don''t know... >> >>> Could you attach the complete log, ideally with ''i'' output logged >>> right before suspending? >> Sure, attached. >> >>> Is this reproducible with 4.2.x or 4.3-unstable? If not, but if readily >>> reproducible with 4.1.5-rc1, could you try changing the containing >>> loop''s upper bound from "< NR_VECTORS" to >>> "<= LAST_DYNAMIC_VECTOR"? >> I''ve tried 4.2.x some time ago and bug also exists there (but I had not >> console, so not sure if exactly the same). 4.3 seems to be not affected.Checked 4.2 and indeed also assert() in similar place. If anyone interested, log here: http://duch.mimuw.edu.pl/~marmarek/qubes/console-4.2-failed-resume.log>> > > Can you replace the ASSERT() with code similar to that in > > http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 > > Which should call dump_irqs() in before dying because of the ASSERT. > You might need to also take the latest version of dump_irqs() from > unstable, as I seem to remember there was another assertion failure due > to xfree()''ing in IRQ context.Full log here: http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log Interesting part: (XEN) *** IRQ BUG found *** (XEN) CPU0 -Testing vector 233 from bitmap 39,47,63-65,72,80,88,96,98,112,120,125,144,152,160,168,174,182-183,190,192,198,200,208,214,222 (XEN) Guest interrupt information: (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000002 vec:c6 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC status=00000000 mapped, unbound (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000001 vec:58 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:de type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:27 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:2f type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:3f type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000008 vec:41 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000002 vec:b7 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:62 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 26 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:6f type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:77 type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7f type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:87 type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 31 affinity:00000000,00000000,00000000,00000002 vec:a6 type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:47 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000002 vec:5f type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(PS--), (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000001 vec:67 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(-S--), (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000001 vec:4f type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 55(-S--), (XEN) IO-APIC interrupt information: (XEN) IRQ 0 Vec240: (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 1 Vec198: (XEN) Apic 0x00, Pin 1: vec=c6 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 3 Vec 64: (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 4 Vec241: (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 5 Vec 72: (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 6 Vec 80: (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 7 Vec 88: (XEN) Apic 0x00, Pin 7: vec=58 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 8 Vec 96: (XEN) Apic 0x00, Pin 8: vec=60 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 9 Vec222: (XEN) Apic 0x00, Pin 9: vec=de delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 10 Vec112: (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 11 Vec120: (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 12 Vec 39: (XEN) Apic 0x00, Pin 12: vec=27 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 13 Vec144: (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) IRQ 14 Vec152: (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 15 Vec160: (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 16 Vec 47: (XEN) Apic 0x00, Pin 16: vec=2f delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 17 Vec 63: (XEN) Apic 0x00, Pin 17: vec=3f delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 18 Vec 65: (XEN) Apic 0x00, Pin 18: vec=41 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 19 Vec200: (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 20 Vec183: (XEN) Apic 0x00, Pin 20: vec=b7 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 22 Vec 98: (XEN) Apic 0x00, Pin 22: vec=62 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 23 Vec168: (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) Xen BUG at io_apic.c:554 (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015e2d6>] smp_irq_move_cleanup_interrupt+0x211/0x289 (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 (XEN) rdx: 0000000000000016 rsi: 000000000000000a rdi: ffff82c4802592e0 (XEN) rbp: ffff82c48029fd08 rsp: ffff82c48029fcb8 r8: 0000000000000018 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000001 (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000000119a96000 cr2: ffff880402070198 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48029fcb8: (XEN) 0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000 (XEN) ffff83042109ba04 ffff830421008000 0000000000000114 000000000000001d (XEN) 0000000000000114 0000000000000000 00007d3b7fd602c7 ffff82c48014de60 (XEN) 0000000000000000 0000000000000114 000000000000001d 0000000000000114 (XEN) ffff82c48029fdc8 ffff830421008000 0000000000000246 ffff82c48025c1f0 (XEN) 0000000000000003 0000001944602466 0000000000000000 0000000000000001 (XEN) 0000000000000000 0000000000000286 ffff830421060f34 0000002000000000 (XEN) ffff82c4801226c0 000000000000e008 0000000000000286 ffff82c48029fdc8 (XEN) 000000000000e010 0000000000000286 ffff82c48029fe48 ffff82c480164446 (XEN) ffff82c4802dd9e0 0000000000000286 ffff830421060f00 ffff830421060f34 (XEN) ffff830421050ac0 000000000000001d 0000000000000246 ffff8301108fd140 (XEN) ffff82c4801226d3 ffff82c48029fe78 000000000000001d ffff8803fa889af0 (XEN) 0000000000000114 ffff8804023be000 ffff82c48029fef8 ffff82c48017655b (XEN) ffff830114c7f300 ffffffff81381646 ffff82f600000008 ffff830421008000 (XEN) 0000000000000003 000000030000001d 00000000e2200000 0000000100a0fb00 (XEN) 0000000000007ff0 ffffffffffffffff 0000000000000003 0000000000000003 (XEN) 00000000e2200000 c390ed90d1ffffff 0000000000000202 ffff8300ca666000 (XEN) ffff8803fc880240 0000000000000011 ffff8804023be858 ffff8804023be000 (XEN) 00007d3b7fd600c7 ffff82c480209f38 ffffffff8100142a 0000000000000021 (XEN) ffff8804023be000 ffff8804023be858 0000000000000011 ffff8803fc880240 (XEN) Xen call trace: (XEN) [<ffff82c48015e2d6>] smp_irq_move_cleanup_interrupt+0x211/0x289 (XEN) [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40 (XEN) [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24 (XEN) [<ffff82c480164446>] map_domain_pirq+0x37a/0x3df (XEN) [<ffff82c48017655b>] do_physdev_op+0xa2b/0x1508 (XEN) [<ffff82c480209f38>] syscall_enter+0xc8/0x122> > ~Andrew >-- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-26 17:54 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>> Can you replace the ASSERT() with code similar to that in >> >> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 >> >> Which should call dump_irqs() in before dying because of the ASSERT. >> You might need to also take the latest version of dump_irqs() from >> unstable, as I seem to remember there was another assertion failure due >> to xfree()''ing in IRQ context. > Full log here: > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log > Interesting part: > (XEN) *** IRQ BUG found *** > (XEN) CPU0 -Testing vector 233 from bitmap > 39,47,63-65,72,80,88,96,98,112,120,125,144,152,160,168,174,182-183,190,192,198,200,208,214,222 > (XEN) Guest interrupt information: > (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 > type=IO-APIC-edge status=00000000 mapped, unbound > (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000002 vec:c6 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), > (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 > type=XT-PIC status=00000000 mapped, unbound > (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 > type=IO-APIC-edge status=00000000 mapped, unbound > (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000001 vec:58 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), > (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000001 vec:60 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), > (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:de > type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), > (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:27 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), > (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:2f > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), > (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:3f > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), > (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000008 vec:41 > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000002 vec:b7 > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), > (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:62 > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 > type=DMA_MSI status=00000000 mapped, unbound > (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 > type=DMA_MSI status=00000000 mapped, unbound > (XEN) IRQ: 26 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:6f > type=PCI-MSI status=00000042 mapped, unbound > (XEN) IRQ: 27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:77 > type=PCI-MSI status=00000042 mapped, unbound > (XEN) IRQ: 28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7f > type=PCI-MSI status=00000042 mapped, unbound > (XEN) IRQ: 29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:87 > type=PCI-MSI status=00000042 mapped, unbound > (XEN) IRQ: 31 affinity:00000000,00000000,00000000,00000002 vec:a6 > type=PCI-MSI status=00000002 mapped, unbound > (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:47 > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), > (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000002 vec:5f > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(PS--), > (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000001 vec:67 > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(-S--), > (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000001 vec:4f > type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 55(-S--), > (XEN) IO-APIC interrupt information: > (XEN) IRQ 0 Vec240: > (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 1 Vec198: > (XEN) Apic 0x00, Pin 1: vec=c6 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 3 Vec 64: > (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 4 Vec241: > (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 5 Vec 72: > (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 6 Vec 80: > (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 7 Vec 88: > (XEN) Apic 0x00, Pin 7: vec=58 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 8 Vec 96: > (XEN) Apic 0x00, Pin 8: vec=60 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 9 Vec222: > (XEN) Apic 0x00, Pin 9: vec=de delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 10 Vec112: > (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 11 Vec120: > (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 12 Vec 39: > (XEN) Apic 0x00, Pin 12: vec=27 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 13 Vec144: > (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=1 dest_id:0 > (XEN) IRQ 14 Vec152: > (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 15 Vec160: > (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 16 Vec 47: > (XEN) Apic 0x00, Pin 16: vec=2f delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 17 Vec 63: > (XEN) Apic 0x00, Pin 17: vec=3f delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 18 Vec 65: > (XEN) Apic 0x00, Pin 18: vec=41 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 19 Vec200: > (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 20 Vec183: > (XEN) Apic 0x00, Pin 20: vec=b7 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 22 Vec 98: > (XEN) Apic 0x00, Pin 22: vec=62 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 23 Vec168: > (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=1 dest_id:0 > (XEN) Xen BUG at io_apic.c:554 > (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c48015e2d6>] smp_irq_move_cleanup_interrupt+0x211/0x289 > (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 > (XEN) rdx: 0000000000000016 rsi: 000000000000000a rdi: ffff82c4802592e0 > (XEN) rbp: ffff82c48029fd08 rsp: ffff82c48029fcb8 r8: 0000000000000018 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000001 > (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 > (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 0000000119a96000 cr2: ffff880402070198 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c48029fcb8: > (XEN) 0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000 > (XEN) ffff83042109ba04 ffff830421008000 0000000000000114 000000000000001d > (XEN) 0000000000000114 0000000000000000 00007d3b7fd602c7 ffff82c48014de60 > (XEN) 0000000000000000 0000000000000114 000000000000001d 0000000000000114 > (XEN) ffff82c48029fdc8 ffff830421008000 0000000000000246 ffff82c48025c1f0 > (XEN) 0000000000000003 0000001944602466 0000000000000000 0000000000000001 > (XEN) 0000000000000000 0000000000000286 ffff830421060f34 0000002000000000 > (XEN) ffff82c4801226c0 000000000000e008 0000000000000286 ffff82c48029fdc8 > (XEN) 000000000000e010 0000000000000286 ffff82c48029fe48 ffff82c480164446 > (XEN) ffff82c4802dd9e0 0000000000000286 ffff830421060f00 ffff830421060f34 > (XEN) ffff830421050ac0 000000000000001d 0000000000000246 ffff8301108fd140 > (XEN) ffff82c4801226d3 ffff82c48029fe78 000000000000001d ffff8803fa889af0 > (XEN) 0000000000000114 ffff8804023be000 ffff82c48029fef8 ffff82c48017655b > (XEN) ffff830114c7f300 ffffffff81381646 ffff82f600000008 ffff830421008000 > (XEN) 0000000000000003 000000030000001d 00000000e2200000 0000000100a0fb00 > (XEN) 0000000000007ff0 ffffffffffffffff 0000000000000003 0000000000000003 > (XEN) 00000000e2200000 c390ed90d1ffffff 0000000000000202 ffff8300ca666000 > (XEN) ffff8803fc880240 0000000000000011 ffff8804023be858 ffff8804023be000 > (XEN) 00007d3b7fd600c7 ffff82c480209f38 ffffffff8100142a 0000000000000021 > (XEN) ffff8804023be000 ffff8804023be858 0000000000000011 ffff8803fc880240 > (XEN) Xen call trace: > (XEN) [<ffff82c48015e2d6>] smp_irq_move_cleanup_interrupt+0x211/0x289 > (XEN) [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40 > (XEN) [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24 > (XEN) [<ffff82c480164446>] map_domain_pirq+0x37a/0x3df > (XEN) [<ffff82c48017655b>] do_physdev_op+0xa2b/0x1508 > (XEN) [<ffff82c480209f38>] syscall_enter+0xc8/0x122 > > >> ~Andrew >> >Even more curious. vector e9 does not appear to be programmed in. Can you extend the debugging to also call __print_IO_APIC(). The i debug key and z debug key list IO-APIC entries from different sources of information. ~Andrew
Marek Marczykowski
2013-Mar-26 18:21 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 18:54, Andrew Cooper wrote:> >>> Can you replace the ASSERT() with code similar to that in >>> >>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 >>> >>> Which should call dump_irqs() in before dying because of the ASSERT. >>> You might need to also take the latest version of dump_irqs() from >>> unstable, as I seem to remember there was another assertion failure due >>> to xfree()''ing in IRQ context. >> Full log here: >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log >> Interesting part:(...)> Even more curious. vector e9 does not appear to be programmed in. Can > you extend the debugging to also call __print_IO_APIC(). > > The i debug key and z debug key list IO-APIC entries from different > sources of information.As you wish, full log: http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log Final part: (XEN) *** IRQ BUG found *** (XEN) CPU0 -Testing vector 233 from bitmap 43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218 (XEN) Guest interrupt information: (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000001 vec:7f type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC status=00000000 mapped, unbound (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000008 vec:da type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000004 vec:d8 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:87 type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:8f type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:97 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:9f type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000004 vec:79 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000002 vec:d3 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:2b type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 26 affinity:00000000,00000000,00000000,00000001 vec:c7 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:279(-S--), (XEN) IRQ: 27 affinity:00000000,00000000,00000000,00000001 vec:cf type=PCI-MSI status=00000050 in-flight=0 domain-list=0:278(-S--), (XEN) IRQ: 28 affinity:00000000,00000000,00000000,00000001 vec:d7 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:277(-S--), (XEN) IRQ: 29 affinity:00000000,00000000,00000000,00000001 vec:df type=PCI-MSI status=00000050 in-flight=0 domain-list=0:276(-S--), (XEN) IRQ: 30 affinity:00000000,00000000,00000000,00000001 vec:38 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:275(-S--), (XEN) IRQ: 31 affinity:00000000,00000000,00000000,00000004 vec:47 type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:a7 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000001 vec:b7 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(-S--), (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000004 vec:40 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(PS--), (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000001 vec:af type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 55(-S--), (XEN) IO-APIC interrupt information: (XEN) IRQ 0 Vec240: (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 1 Vec127: (XEN) Apic 0x00, Pin 1: vec=7f delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 3 Vec 64: (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 4 Vec241: (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 5 Vec 72: (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 6 Vec 80: (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 7 Vec218: (XEN) Apic 0x00, Pin 7: vec=da delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 8 Vec216: (XEN) Apic 0x00, Pin 8: vec=d8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 9 Vec135: (XEN) Apic 0x00, Pin 9: vec=87 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 10 Vec112: (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 11 Vec120: (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 12 Vec143: (XEN) Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 13 Vec144: (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) IRQ 14 Vec152: (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 15 Vec160: (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 16 Vec151: (XEN) Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 17 Vec159: (XEN) Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 18 Vec121: (XEN) Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 19 Vec200: (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 20 Vec211: (XEN) Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 22 Vec 43: (XEN) Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 23 Vec168: (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) number of MP IRQ sources: 15. (XEN) number of IO-APIC #2 registers: 24. (XEN) testing the IO APIC....................... (XEN) IO APIC #2...... (XEN) .... register #00: 02000000 (XEN) ....... : physical APIC id: 02 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS : 0 (XEN) .... register #01: 00170020 (XEN) ....... : max redirection entries: 0017 (XEN) ....... : PRQ implemented: 0 (XEN) ....... : IO APIC version: 0020 (XEN) .... IRQ redirection table: (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 0DC 0C 1 0 0 0 0 1 2 87 (XEN) 01 000 00 0 0 0 0 0 1 1 7F (XEN) 02 000 00 0 0 0 0 0 1 1 F0 (XEN) 03 000 00 0 0 0 0 0 1 1 40 (XEN) 04 000 00 0 0 0 0 0 1 1 F1 (XEN) 05 000 00 0 0 0 0 0 1 1 48 (XEN) 06 000 00 0 0 0 0 0 1 1 50 (XEN) 07 000 00 0 0 0 0 0 1 1 DA (XEN) 08 000 00 0 0 0 0 0 1 1 D8 (XEN) 09 000 00 0 1 0 0 0 1 1 87 (XEN) 0a 000 00 0 0 0 0 0 1 1 70 (XEN) 0b 000 00 0 0 0 0 0 1 1 78 (XEN) 0c 000 00 0 0 0 0 0 1 1 8F (XEN) 0d 000 00 1 0 0 0 0 1 1 90 (XEN) 0e 000 00 0 0 0 0 0 1 1 98 (XEN) 0f 000 00 0 0 0 0 0 1 1 A0 (XEN) 10 000 00 0 1 0 1 0 1 1 97 (XEN) 11 000 00 0 1 0 1 0 1 1 9F (XEN) 12 000 00 1 1 0 1 0 1 1 79 (XEN) 13 000 00 1 1 0 1 0 1 1 C8 (XEN) 14 000 00 0 1 0 1 0 1 1 D3 (XEN) 15 000 00 1 0 0 0 0 0 0 00 (XEN) 16 000 00 1 1 0 1 0 1 1 2B (XEN) 17 000 00 1 0 0 0 0 1 1 A8 (XEN) Using vector-based indexing (XEN) IRQ to pin mappings: (XEN) IRQ240 -> 0:2 (XEN) IRQ127 -> 0:1 (XEN) IRQ64 -> 0:3 (XEN) IRQ241 -> 0:4 (XEN) IRQ72 -> 0:5 (XEN) IRQ80 -> 0:6 (XEN) IRQ218 -> 0:7 (XEN) IRQ216 -> 0:8 (XEN) IRQ135 -> 0:9 (XEN) IRQ112 -> 0:10 (XEN) IRQ120 -> 0:11 (XEN) IRQ143 -> 0:12 (XEN) IRQ144 -> 0:13 (XEN) IRQ152 -> 0:14 (XEN) IRQ160 -> 0:15 (XEN) IRQ151 -> 0:16 (XEN) IRQ159 -> 0:17 (XEN) IRQ121 -> 0:18 (XEN) IRQ200 -> 0:19 (XEN) IRQ211 -> 0:20 (XEN) IRQ43 -> 0:22 (XEN) IRQ168 -> 0:23 (XEN) .................................... done. (XEN) Xen BUG at io_apic.c:556 (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: 000000000000000a rdi: ffff82c4802592e0 (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: 0000000000000004 (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000002 (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000026582c000 cr2: ffff8804020701d8 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48029feb8: (XEN) 0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000 (XEN) 000000000000e02b 0000000000000000 000000004bf51982 00000000000060a9 (XEN) 0000000000000000 0000000000000000 00007d3b7fd600c7 ffff82c48014de60 (XEN) 0000000000000000 0000000000000000 00000000000060a9 000000004bf51982 (XEN) ffff8802d2665b28 0000000000000000 0000000000000000 0000000000007ff0 (XEN) 0000000000000022 0000000000000000 000000024bf57322 0000000001307da0 (XEN) 00000000000059a0 0000000000000000 00000000000060a9 0000002000000000 (XEN) ffffffff8123c51a 000000000000e033 0000000000000293 ffff8802d2665b08 (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 (XEN) 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-26 18:50 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26/03/2013 18:21, Marek Marczykowski wrote:> On 26.03.2013 18:54, Andrew Cooper wrote: >>>> Can you replace the ASSERT() with code similar to that in >>>> >>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 >>>> >>>> Which should call dump_irqs() in before dying because of the ASSERT. >>>> You might need to also take the latest version of dump_irqs() from >>>> unstable, as I seem to remember there was another assertion failure due >>>> to xfree()''ing in IRQ context. >>> Full log here: >>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log >>> Interesting part: > (...) >> Even more curious. vector e9 does not appear to be programmed in. Can >> you extend the debugging to also call __print_IO_APIC(). >> >> The i debug key and z debug key list IO-APIC entries from different >> sources of information. > As you wish, full log: > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log > > Final part: > (XEN) *** IRQ BUG found *** > (XEN) CPU0 -Testing vector 233 from bitmap > 43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218 > (XEN) Guest interrupt information: > (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 > type=IO-APIC-edge status=00000000 mapped, unbound > (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000001 vec:7f > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), > (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 > type=XT-PIC status=00000000 mapped, unbound > (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 > type=IO-APIC-edge status=00000000 mapped, unbound > (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000008 vec:da > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), > (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000004 vec:d8 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), > (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:87 > type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), > (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:8f > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), > (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:97 > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), > (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:9f > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), > (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000004 vec:79 > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000002 vec:d3 > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), > (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:2b > type=IO-APIC-level status=00000002 mapped, unbound > (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 > type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 > type=DMA_MSI status=00000000 mapped, unbound > (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 > type=DMA_MSI status=00000000 mapped, unbound > (XEN) IRQ: 26 affinity:00000000,00000000,00000000,00000001 vec:c7 > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:279(-S--), > (XEN) IRQ: 27 affinity:00000000,00000000,00000000,00000001 vec:cf > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:278(-S--), > (XEN) IRQ: 28 affinity:00000000,00000000,00000000,00000001 vec:d7 > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:277(-S--), > (XEN) IRQ: 29 affinity:00000000,00000000,00000000,00000001 vec:df > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:276(-S--), > (XEN) IRQ: 30 affinity:00000000,00000000,00000000,00000001 vec:38 > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:275(-S--), > (XEN) IRQ: 31 affinity:00000000,00000000,00000000,00000004 vec:47 > type=PCI-MSI status=00000002 mapped, unbound > (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:a7 > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), > (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000001 vec:b7 > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(-S--), > (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000004 vec:40 > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(PS--), > (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000001 vec:af > type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 55(-S--), > (XEN) IO-APIC interrupt information: > (XEN) IRQ 0 Vec240: > (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 1 Vec127: > (XEN) Apic 0x00, Pin 1: vec=7f delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 3 Vec 64: > (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 4 Vec241: > (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 5 Vec 72: > (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 6 Vec 80: > (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 7 Vec218: > (XEN) Apic 0x00, Pin 7: vec=da delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 8 Vec216: > (XEN) Apic 0x00, Pin 8: vec=d8 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 9 Vec135: > (XEN) Apic 0x00, Pin 9: vec=87 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 10 Vec112: > (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 11 Vec120: > (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 12 Vec143: > (XEN) Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 13 Vec144: > (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=1 dest_id:0 > (XEN) IRQ 14 Vec152: > (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 15 Vec160: > (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 16 Vec151: > (XEN) Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 17 Vec159: > (XEN) Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 18 Vec121: > (XEN) Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 19 Vec200: > (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 20 Vec211: > (XEN) Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 22 Vec 43: > (XEN) Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 23 Vec168: > (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=1 dest_id:0 > (XEN) number of MP IRQ sources: 15. > (XEN) number of IO-APIC #2 registers: 24. > (XEN) testing the IO APIC....................... > (XEN) IO APIC #2...... > (XEN) .... register #00: 02000000 > (XEN) ....... : physical APIC id: 02 > (XEN) ....... : Delivery Type: 0 > (XEN) ....... : LTS : 0 > (XEN) .... register #01: 00170020 > (XEN) ....... : max redirection entries: 0017 > (XEN) ....... : PRQ implemented: 0 > (XEN) ....... : IO APIC version: 0020 > (XEN) .... IRQ redirection table: > (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: > (XEN) 00 0DC 0C 1 0 0 0 0 1 2 87 > (XEN) 01 000 00 0 0 0 0 0 1 1 7F > (XEN) 02 000 00 0 0 0 0 0 1 1 F0 > (XEN) 03 000 00 0 0 0 0 0 1 1 40 > (XEN) 04 000 00 0 0 0 0 0 1 1 F1 > (XEN) 05 000 00 0 0 0 0 0 1 1 48 > (XEN) 06 000 00 0 0 0 0 0 1 1 50 > (XEN) 07 000 00 0 0 0 0 0 1 1 DA > (XEN) 08 000 00 0 0 0 0 0 1 1 D8 > (XEN) 09 000 00 0 1 0 0 0 1 1 87 > (XEN) 0a 000 00 0 0 0 0 0 1 1 70 > (XEN) 0b 000 00 0 0 0 0 0 1 1 78 > (XEN) 0c 000 00 0 0 0 0 0 1 1 8F > (XEN) 0d 000 00 1 0 0 0 0 1 1 90 > (XEN) 0e 000 00 0 0 0 0 0 1 1 98 > (XEN) 0f 000 00 0 0 0 0 0 1 1 A0 > (XEN) 10 000 00 0 1 0 1 0 1 1 97 > (XEN) 11 000 00 0 1 0 1 0 1 1 9F > (XEN) 12 000 00 1 1 0 1 0 1 1 79 > (XEN) 13 000 00 1 1 0 1 0 1 1 C8 > (XEN) 14 000 00 0 1 0 1 0 1 1 D3 > (XEN) 15 000 00 1 0 0 0 0 0 0 00 > (XEN) 16 000 00 1 1 0 1 0 1 1 2B > (XEN) 17 000 00 1 0 0 0 0 1 1 A8 > (XEN) Using vector-based indexing > (XEN) IRQ to pin mappings: > (XEN) IRQ240 -> 0:2 > (XEN) IRQ127 -> 0:1 > (XEN) IRQ64 -> 0:3 > (XEN) IRQ241 -> 0:4 > (XEN) IRQ72 -> 0:5 > (XEN) IRQ80 -> 0:6 > (XEN) IRQ218 -> 0:7 > (XEN) IRQ216 -> 0:8 > (XEN) IRQ135 -> 0:9 > (XEN) IRQ112 -> 0:10 > (XEN) IRQ120 -> 0:11 > (XEN) IRQ143 -> 0:12 > (XEN) IRQ144 -> 0:13 > (XEN) IRQ152 -> 0:14 > (XEN) IRQ160 -> 0:15 > (XEN) IRQ151 -> 0:16 > (XEN) IRQ159 -> 0:17 > (XEN) IRQ121 -> 0:18 > (XEN) IRQ200 -> 0:19 > (XEN) IRQ211 -> 0:20 > (XEN) IRQ43 -> 0:22 > (XEN) IRQ168 -> 0:23 > (XEN) .................................... done. > (XEN) Xen BUG at io_apic.c:556 > (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e > (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 > (XEN) rdx: 0000000000000000 rsi: 000000000000000a rdi: ffff82c4802592e0 > (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: 0000000000000004 > (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000002 > (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 > (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 000000026582c000 cr2: ffff8804020701d8 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c48029feb8: > (XEN) 0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000 > (XEN) 000000000000e02b 0000000000000000 000000004bf51982 00000000000060a9 > (XEN) 0000000000000000 0000000000000000 00007d3b7fd600c7 ffff82c48014de60 > (XEN) 0000000000000000 0000000000000000 00000000000060a9 000000004bf51982 > (XEN) ffff8802d2665b28 0000000000000000 0000000000000000 0000000000007ff0 > (XEN) 0000000000000022 0000000000000000 000000024bf57322 0000000001307da0 > (XEN) 00000000000059a0 0000000000000000 00000000000060a9 0000002000000000 > (XEN) ffffffff8123c51a 000000000000e033 0000000000000293 ffff8802d2665b08 > (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 > (XEN) 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e > > >So vector e9 doesn''t appear to be programmed in anywhere. I am starting to get more into the realm of guessing here but, can you use apic_verbosity=debug on the command line and copy this extra debugging logic to send_cleanup_vector() You should be able to conditionally trigger it on "desc->arch.vector =0xe9". You will probably also want to change the BUG() to a WARN(), so we get the interrupt and ioapic information on both sides of the cleanup vector, as well as getting the stack trace of the codepath through Xen as a result of vector 0xe9. ~Andrew
Marek Marczykowski
2013-Mar-27 08:50 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 26.03.2013 19:50, Andrew Cooper wrote:> On 26/03/2013 18:21, Marek Marczykowski wrote: >> On 26.03.2013 18:54, Andrew Cooper wrote: >>>>> Can you replace the ASSERT() with code similar to that in >>>>> >>>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/irq.c;h=5e0f463c381750090373dabd8967635bc297d457;hb=refs/heads/staging#l668 >>>>> >>>>> Which should call dump_irqs() in before dying because of the ASSERT. >>>>> You might need to also take the latest version of dump_irqs() from >>>>> unstable, as I seem to remember there was another assertion failure due >>>>> to xfree()''ing in IRQ context. >>>> Full log here: >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs.log >>>> Interesting part: >> (...) >>> Even more curious. vector e9 does not appear to be programmed in. Can >>> you extend the debugging to also call __print_IO_APIC(). >>> >>> The i debug key and z debug key list IO-APIC entries from different >>> sources of information. >> As you wish, full log: >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs2.log >> >> Final part: >> (XEN) *** IRQ BUG found *** >> (XEN) CPU0 -Testing vector 233 from bitmap >> 43,49,64,72,80,87-88,95-96,103,112,119-121,127,135,143-144,151-152,159-160,168,192,197,200,211,216,218 >> (XEN) Guest interrupt information: >> (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 >> type=IO-APIC-edge status=00000000 mapped, unbound >> (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000001 vec:7f >> type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), >> (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 >> type=XT-PIC status=00000000 mapped, unbound >> (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 >> type=IO-APIC-edge status=00000000 mapped, unbound >> (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000008 vec:da >> type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), >> (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000004 vec:d8 >> type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), >> (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:87 >> type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), >> (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:8f >> type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), >> (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:97 >> type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), >> (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:9f >> type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), >> (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000004 vec:79 >> type=IO-APIC-level status=00000002 mapped, unbound >> (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 >> type=IO-APIC-level status=00000002 mapped, unbound >> (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000002 vec:d3 >> type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), >> (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:2b >> type=IO-APIC-level status=00000002 mapped, unbound >> (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 >> type=IO-APIC-edge status=00000002 mapped, unbound >> (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 >> type=DMA_MSI status=00000000 mapped, unbound >> (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 >> type=DMA_MSI status=00000000 mapped, unbound >> (XEN) IRQ: 26 affinity:00000000,00000000,00000000,00000001 vec:c7 >> type=PCI-MSI status=00000010 in-flight=0 domain-list=0:279(-S--), >> (XEN) IRQ: 27 affinity:00000000,00000000,00000000,00000001 vec:cf >> type=PCI-MSI status=00000050 in-flight=0 domain-list=0:278(-S--), >> (XEN) IRQ: 28 affinity:00000000,00000000,00000000,00000001 vec:d7 >> type=PCI-MSI status=00000050 in-flight=0 domain-list=0:277(-S--), >> (XEN) IRQ: 29 affinity:00000000,00000000,00000000,00000001 vec:df >> type=PCI-MSI status=00000050 in-flight=0 domain-list=0:276(-S--), >> (XEN) IRQ: 30 affinity:00000000,00000000,00000000,00000001 vec:38 >> type=PCI-MSI status=00000050 in-flight=0 domain-list=0:275(-S--), >> (XEN) IRQ: 31 affinity:00000000,00000000,00000000,00000004 vec:47 >> type=PCI-MSI status=00000002 mapped, unbound >> (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:a7 >> type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), >> (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000001 vec:b7 >> type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(-S--), >> (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000004 vec:40 >> type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(PS--), >> (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000001 vec:af >> type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 55(-S--), >> (XEN) IO-APIC interrupt information: >> (XEN) IRQ 0 Vec240: >> (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 1 Vec127: >> (XEN) Apic 0x00, Pin 1: vec=7f delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 3 Vec 64: >> (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 4 Vec241: >> (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 5 Vec 72: >> (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 6 Vec 80: >> (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 7 Vec218: >> (XEN) Apic 0x00, Pin 7: vec=da delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 8 Vec216: >> (XEN) Apic 0x00, Pin 8: vec=d8 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 9 Vec135: >> (XEN) Apic 0x00, Pin 9: vec=87 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 10 Vec112: >> (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 11 Vec120: >> (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 12 Vec143: >> (XEN) Apic 0x00, Pin 12: vec=8f delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 13 Vec144: >> (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=1 dest_id:0 >> (XEN) IRQ 14 Vec152: >> (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 15 Vec160: >> (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 16 Vec151: >> (XEN) Apic 0x00, Pin 16: vec=97 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 17 Vec159: >> (XEN) Apic 0x00, Pin 17: vec=9f delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 18 Vec121: >> (XEN) Apic 0x00, Pin 18: vec=79 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 19 Vec200: >> (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 20 Vec211: >> (XEN) Apic 0x00, Pin 20: vec=d3 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 22 Vec 43: >> (XEN) Apic 0x00, Pin 22: vec=2b delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 23 Vec168: >> (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=1 dest_id:0 >> (XEN) number of MP IRQ sources: 15. >> (XEN) number of IO-APIC #2 registers: 24. >> (XEN) testing the IO APIC....................... >> (XEN) IO APIC #2...... >> (XEN) .... register #00: 02000000 >> (XEN) ....... : physical APIC id: 02 >> (XEN) ....... : Delivery Type: 0 >> (XEN) ....... : LTS : 0 >> (XEN) .... register #01: 00170020 >> (XEN) ....... : max redirection entries: 0017 >> (XEN) ....... : PRQ implemented: 0 >> (XEN) ....... : IO APIC version: 0020 >> (XEN) .... IRQ redirection table: >> (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: >> (XEN) 00 0DC 0C 1 0 0 0 0 1 2 87 >> (XEN) 01 000 00 0 0 0 0 0 1 1 7F >> (XEN) 02 000 00 0 0 0 0 0 1 1 F0 >> (XEN) 03 000 00 0 0 0 0 0 1 1 40 >> (XEN) 04 000 00 0 0 0 0 0 1 1 F1 >> (XEN) 05 000 00 0 0 0 0 0 1 1 48 >> (XEN) 06 000 00 0 0 0 0 0 1 1 50 >> (XEN) 07 000 00 0 0 0 0 0 1 1 DA >> (XEN) 08 000 00 0 0 0 0 0 1 1 D8 >> (XEN) 09 000 00 0 1 0 0 0 1 1 87 >> (XEN) 0a 000 00 0 0 0 0 0 1 1 70 >> (XEN) 0b 000 00 0 0 0 0 0 1 1 78 >> (XEN) 0c 000 00 0 0 0 0 0 1 1 8F >> (XEN) 0d 000 00 1 0 0 0 0 1 1 90 >> (XEN) 0e 000 00 0 0 0 0 0 1 1 98 >> (XEN) 0f 000 00 0 0 0 0 0 1 1 A0 >> (XEN) 10 000 00 0 1 0 1 0 1 1 97 >> (XEN) 11 000 00 0 1 0 1 0 1 1 9F >> (XEN) 12 000 00 1 1 0 1 0 1 1 79 >> (XEN) 13 000 00 1 1 0 1 0 1 1 C8 >> (XEN) 14 000 00 0 1 0 1 0 1 1 D3 >> (XEN) 15 000 00 1 0 0 0 0 0 0 00 >> (XEN) 16 000 00 1 1 0 1 0 1 1 2B >> (XEN) 17 000 00 1 0 0 0 0 1 1 A8 >> (XEN) Using vector-based indexing >> (XEN) IRQ to pin mappings: >> (XEN) IRQ240 -> 0:2 >> (XEN) IRQ127 -> 0:1 >> (XEN) IRQ64 -> 0:3 >> (XEN) IRQ241 -> 0:4 >> (XEN) IRQ72 -> 0:5 >> (XEN) IRQ80 -> 0:6 >> (XEN) IRQ218 -> 0:7 >> (XEN) IRQ216 -> 0:8 >> (XEN) IRQ135 -> 0:9 >> (XEN) IRQ112 -> 0:10 >> (XEN) IRQ120 -> 0:11 >> (XEN) IRQ143 -> 0:12 >> (XEN) IRQ144 -> 0:13 >> (XEN) IRQ152 -> 0:14 >> (XEN) IRQ160 -> 0:15 >> (XEN) IRQ151 -> 0:16 >> (XEN) IRQ159 -> 0:17 >> (XEN) IRQ121 -> 0:18 >> (XEN) IRQ200 -> 0:19 >> (XEN) IRQ211 -> 0:20 >> (XEN) IRQ43 -> 0:22 >> (XEN) IRQ168 -> 0:23 >> (XEN) .................................... done. >> (XEN) Xen BUG at io_apic.c:556 >> (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- >> (XEN) CPU: 0 >> (XEN) RIP: e008:[<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e >> (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor >> (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 >> (XEN) rdx: 0000000000000000 rsi: 000000000000000a rdi: ffff82c4802592e0 >> (XEN) rbp: ffff82c48029ff08 rsp: ffff82c48029feb8 r8: 0000000000000004 >> (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000002 >> (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 >> (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 >> (XEN) cr3: 000000026582c000 cr2: ffff8804020701d8 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >> (XEN) Xen stack trace from rsp=ffff82c48029feb8: >> (XEN) 0000000000000000 ffff82c48029ff18 ffff82c4802dd9e0 000000e900000000 >> (XEN) 000000000000e02b 0000000000000000 000000004bf51982 00000000000060a9 >> (XEN) 0000000000000000 0000000000000000 00007d3b7fd600c7 ffff82c48014de60 >> (XEN) 0000000000000000 0000000000000000 00000000000060a9 000000004bf51982 >> (XEN) ffff8802d2665b28 0000000000000000 0000000000000000 0000000000007ff0 >> (XEN) 0000000000000022 0000000000000000 000000024bf57322 0000000001307da0 >> (XEN) 00000000000059a0 0000000000000000 00000000000060a9 0000002000000000 >> (XEN) ffffffff8123c51a 000000000000e033 0000000000000293 ffff8802d2665b08 >> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 ffff8300ca9a0000 0000000000000000 >> (XEN) 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [<ffff82c48015e2db>] smp_irq_move_cleanup_interrupt+0x216/0x28e >> >> >> > > So vector e9 doesn''t appear to be programmed in anywhere. > > I am starting to get more into the realm of guessing here but, can you > use apic_verbosity=debug on the command line and copy this extra > debugging logic to send_cleanup_vector() > > You should be able to conditionally trigger it on "desc->arch.vector => 0xe9". You will probably also want to change the BUG() to a WARN(), so > we get the interrupt and ioapic information on both sides of the cleanup > vector, as well as getting the stack trace of the codepath through Xen > as a result of vector 0xe9.send_cleanup_vector() doesn''t seem to be called with cfg->vector == 0xe9... Can dom0 mess something here around? -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Mar-27 08:52 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > So vector e9 doesn''t appear to be programmed in anywhere.Quite obviously, as it''s the 8259A vector for IRQ 9. The question really is why an IRQ appears on that vector in the first place. The 8259A resume code _should_ leave all IRQs masked on a fully IO-APIC system (see my question raised yesterday). And that''s also why I suggested, for an experiment, to fiddle with the loop exit condition to exclude legacy vectors (which wouldn''t be a final solution, but would at least tell us whether the direction is the right one). In the end, besides understanding why an interrupt on vector E9 gets raised at all, we may also need to tweak the IRQ migration logic to not do anything on legacy IRQs, but that would need to happen earlier than in smp_irq_move_cleanup_interrupt(). Considering that 4.3 apparently doesn''t have this problem, we may need to go hunt for a change that isn''t directly connected to this, yet deals with the problem as a side effect (at least I don''t recall any particular fix since 4.2). One aspect here is the double mapping of legacy IRQs (once to their IO-APIC vector, and once to their legacy vector, i.e. vector_irq[] having two entries pointing to the same IRQ). Jan
Jan Beulich
2013-Mar-27 08:58 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 27.03.13 at 09:50, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: > send_cleanup_vector() doesn''t seem to be called with cfg->vector == 0xe9... > Can dom0 mess something here around?Of course not - I suppose it is being called for IRQ9 (with whatever vector the IO-APIC has set for that IRQ at that point in time). Jan
Jan Beulich
2013-Mar-27 09:03 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 27.03.13 at 09:52, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> So vector e9 doesn''t appear to be programmed in anywhere. > > Quite obviously, as it''s the 8259A vector for IRQ 9. The question > really is why an IRQ appears on that vector in the first place. The > 8259A resume code _should_ leave all IRQs masked on a fully > IO-APIC system (see my question raised yesterday).So to put this in consumable form: Please log what i8259A_resume() writes to ports 21 and A1 (i.e. cached_21 and cached_A1), and also dump those ports'' contents at the crash point (i.e. alongside the dump_irqs()). Jan
Marek Marczykowski
2013-Mar-27 14:01 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 10:03, Jan Beulich wrote:>>>> On 27.03.13 at 09:52, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> So vector e9 doesn''t appear to be programmed in anywhere. >> >> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >> really is why an IRQ appears on that vector in the first place. The >> 8259A resume code _should_ leave all IRQs masked on a fully >> IO-APIC system (see my question raised yesterday). > > So to put this in consumable form: Please log what i8259A_resume() > writes to ports 21 and A1 (i.e. cached_21 and cached_A1), and also > dump those ports'' contents at the crash point (i.e. alongside the > dump_irqs()).I''ve noticed that not all messages are available on serial console, especially nothing from inside of i8259A_resume(). So changed BUG to WARN and got some additional lines. Ports: 21:0xfb, A1:0xff (the same in i8259A_resume() as at crash point). Part of http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-failed-resume-dump-irqs3.log: (XEN) Preparing system for ACPI S3 state. (XEN) Disabling non-boot CPUs ... (XEN) Broke affinity for irq 1 (XEN) Broke affinity for irq 12 (XEN) Broke affinity for irq 17 (XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty entry! (XEN) Broke affinity for irq 27 (XEN) Broke affinity for irq 1 (XEN) Broke affinity for irq 7 (XEN) Broke affinity for irq 9 (XEN) Broke affinity for irq 16 (XEN) Broke affinity for irq 20 (XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty entry! (XEN) Broke affinity for irq 32 (XEN) Broke affinity for irq 36 (XEN) Broke affinity for irq 1 (XEN) Broke affinity for irq 7 (XEN) Broke affinity for irq 20 (XEN) [VT-D]intremap.c:552: remap_entry_to_msi_msg: index (65535) get an empty entry! (XEN) Broke affinity for irq 28 (XEN) Broke affinity for irq 29 (XEN) Broke affinity for irq 30 (XEN) Broke affinity for irq 31 (XEN) Entering ACPI S3 state. (XEN) i8259A_suspend: cached_21: 0xfb, cached_A1: 0xff (XEN) i8259A_resume: cached_21: 0xfb, cached_A1: 0xff (XEN) mce_intel.c:1162: MCA Capability: BCAST 1 SER 0 CMCI 1 firstbank 0 extended MCE MSR 0 (XEN) CPU0 CMCI LVT vector (0xf7) already installed (XEN) CPU0: Thermal LVT vector (0xfa) already installed (XEN) Finishing wakeup from ACPI S3 state. (XEN) Enabling non-boot CPUs ... (XEN) Suppress EOI broadcast on CPU#1 (XEN) masked ExtINT on CPU#1 (XEN) Suppress EOI broadcast on CPU#2 (XEN) masked ExtINT on CPU#2 (XEN) Suppress EOI broadcast on CPU#3 (XEN) masked ExtINT on CPU#3 (XEN) *** IRQ BUG found *** (XEN) CPU0 -Testing vector 233 from bitmap 44,49,57,64,68,72,76,80,84,88,96,100,108,112,120,122,144,152,154,160,168,192,194,200,208,211,218-219 (XEN) Guest interrupt information: (XEN) IRQ: 0 affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 1 affinity:00000000,00000000,00000000,00000002 vec:db type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(-S--), (XEN) IRQ: 2 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC status=00000000 mapped, unbound (XEN) IRQ: 3 affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 4 affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 5 affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 6 affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 7 affinity:00000000,00000000,00000000,00000004 vec:7a type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 7(-S--), (XEN) IRQ: 8 affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(-S--), (XEN) IRQ: 9 affinity:00000000,00000000,00000000,00000001 vec:64 type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 9(-S--), (XEN) IRQ: 10 affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 11 affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 12 affinity:00000000,00000000,00000000,00000001 vec:4c type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 12(-S--), (XEN) IRQ: 13 affinity:00000000,00000000,00000000,0000000f vec:90 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 14 affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 15 affinity:00000000,00000000,00000000,00000001 vec:a0 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 16 affinity:00000000,00000000,00000000,00000001 vec:6c type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 16(-S--), (XEN) IRQ: 17 affinity:00000000,00000000,00000000,00000001 vec:54 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 17(-S--), (XEN) IRQ: 18 affinity:00000000,00000000,00000000,00000008 vec:39 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 19 affinity:00000000,00000000,00000000,0000000f vec:c8 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 20 affinity:00000000,00000000,00000000,00000004 vec:da type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 20(-S--), (XEN) IRQ: 22 affinity:00000000,00000000,00000000,0000000f vec:9a type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 23 affinity:00000000,00000000,00000000,0000000f vec:a8 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 24 affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 25 affinity:00000000,00000000,00000000,00000001 vec:30 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 26 affinity:00000000,00000000,00000000,00000004 vec:3c type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 27 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:9c type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 28 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:a4 type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 29 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ac type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 32 affinity:00000000,00000000,00000000,00000001 vec:74 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(-S--), (XEN) IRQ: 33 affinity:00000000,00000000,00000000,00000004 vec:8c type=PCI-MSI status=00000010 in-flight=0 domain-list=0:272(PS--), (XEN) IRQ: 34 affinity:00000000,00000000,00000000,00000001 vec:94 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:271(-S--), (XEN) IRQ: 35 affinity:00000000,00000000,00000000,00000004 vec:d9 type=PCI-MSI status=00000042 mapped, unbound (XEN) IRQ: 36 affinity:00000000,00000000,00000000,00000001 vec:7c type=PCI-MSI status=00000050 in-flight=0 domain-list=1: 54(-S--), (XEN) IO-APIC interrupt information: (XEN) IRQ 0 Vec240: (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 1 Vec219: (XEN) Apic 0x00, Pin 1: vec=db delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 3 Vec 64: (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 4 Vec241: (XEN) Apic 0x00, Pin 4: vec=f1 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 5 Vec 72: (XEN) Apic 0x00, Pin 5: vec=48 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 6 Vec 80: (XEN) Apic 0x00, Pin 6: vec=50 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 7 Vec122: (XEN) Apic 0x00, Pin 7: vec=7a delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 8 Vec 96: (XEN) Apic 0x00, Pin 8: vec=60 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 9 Vec100: (XEN) Apic 0x00, Pin 9: vec=64 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 10 Vec112: (XEN) Apic 0x00, Pin 10: vec=70 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 11 Vec120: (XEN) Apic 0x00, Pin 11: vec=78 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 12 Vec 76: (XEN) Apic 0x00, Pin 12: vec=4c delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 13 Vec144: (XEN) Apic 0x00, Pin 13: vec=90 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) IRQ 14 Vec152: (XEN) Apic 0x00, Pin 14: vec=98 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 15 Vec160: (XEN) Apic 0x00, Pin 15: vec=a0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 16 Vec108: (XEN) Apic 0x00, Pin 16: vec=6c delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 17 Vec 84: (XEN) Apic 0x00, Pin 17: vec=54 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 18 Vec 57: (XEN) Apic 0x00, Pin 18: vec=39 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 19 Vec200: (XEN) Apic 0x00, Pin 19: vec=c8 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 20 Vec218: (XEN) Apic 0x00, Pin 20: vec=da delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 22 Vec154: (XEN) Apic 0x00, Pin 22: vec=9a delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 23 Vec168: (XEN) Apic 0x00, Pin 23: vec=a8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=1 dest_id:0 (XEN) number of MP IRQ sources: 15. (XEN) number of IO-APIC #2 registers: 24. (XEN) testing the IO APIC....................... (XEN) IO APIC #2...... (XEN) .... register #00: 02000000 (XEN) ....... : physical APIC id: 02 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS : 0 (XEN) .... register #01: 00170020 (XEN) ....... : max redirection entries: 0017 (XEN) ....... : PRQ implemented: 0 (XEN) ....... : IO APIC version: 0020 (XEN) .... IRQ redirection table: (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 000 00 1 0 0 0 0 0 0 00 (XEN) 01 000 00 0 0 0 0 0 1 1 DB (XEN) 02 000 00 0 0 0 0 0 1 1 F0 (XEN) 03 000 00 0 0 0 0 0 1 1 40 (XEN) 04 000 00 0 0 0 0 0 1 1 F1 (XEN) 05 000 00 0 0 0 0 0 1 1 48 (XEN) 06 000 00 0 0 0 0 0 1 1 50 (XEN) 07 000 00 0 0 0 0 0 1 1 7A (XEN) 08 000 00 0 0 0 0 0 1 1 60 (XEN) 09 000 00 0 1 0 0 0 1 1 64 (XEN) 0a 000 00 0 0 0 0 0 1 1 70 (XEN) 0b 000 00 0 0 0 0 0 1 1 78 (XEN) 0c 000 00 0 0 0 0 0 1 1 4C (XEN) 0d 000 00 1 0 0 0 0 1 1 90 (XEN) 0e 000 00 0 0 0 0 0 1 1 98 (XEN) 0f 000 00 0 0 0 0 0 1 1 A0 (XEN) 10 000 00 0 1 0 1 0 1 1 6C (XEN) 11 000 00 0 1 0 1 0 1 1 54 (XEN) 12 000 00 1 1 0 1 0 1 1 39 (XEN) 13 000 00 1 1 0 1 0 1 1 C8 (XEN) 14 000 00 0 1 0 1 0 1 1 DA (XEN) 15 000 00 1 0 0 0 0 0 0 00 (XEN) 16 000 00 1 1 0 1 0 1 1 9A (XEN) 17 000 00 1 0 0 0 0 1 1 A8 (XEN) Using vector-based indexing (XEN) IRQ to pin mappings: (XEN) IRQ240 -> 0:2 (XEN) IRQ219 -> 0:1 (XEN) IRQ64 -> 0:3 (XEN) IRQ241 -> 0:4 (XEN) IRQ72 -> 0:5 (XEN) IRQ80 -> 0:6 (XEN) IRQ122 -> 0:7 (XEN) IRQ96 -> 0:8 (XEN) IRQ100 -> 0:9 (XEN) IRQ112 -> 0:10 (XEN) IRQ120 -> 0:11 (XEN) IRQ76 -> 0:12 (XEN) IRQ144 -> 0:13 (XEN) IRQ152 -> 0:14 (XEN) IRQ160 -> 0:15 (XEN) IRQ108 -> 0:16 (XEN) IRQ84 -> 0:17 (XEN) IRQ57 -> 0:18 (XEN) IRQ200 -> 0:19 (XEN) IRQ218 -> 0:20 (XEN) IRQ154 -> 0:22 (XEN) IRQ168 -> 0:23 (XEN) .................................... done. (XEN) i8259: 21: 0xfb, A1: 0xff (XEN) Xen WARN at io_apic.c:558 (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015e341>] smp_irq_move_cleanup_interrupt+0x23c/0x2bc (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 00000000000000e9 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: 000000000000000a rdi: ffff82c4802592e0 (XEN) rbp: ffff82c48029fb58 rsp: ffff82c48029fb08 r8: 0000000000000004 (XEN) r9: 0000000000000001 r10: 00000000000000ff r11: 0000000000000002 (XEN) r12: ffff830421080250 r13: ffff830421060534 r14: ffff82c48029ff18 (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000037e7a8000 cr2: ffff880402070318 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48029fb08: (XEN) 0000000000000000 0000000000000008 ffff82c48029ff18 ffff82c4802dd9e0 (XEN) ffff82c48029fb58 0000000000000004 0000000000000000 0000000080030014 (XEN) 0000000000000000 0000000000000000 00007d3b7fd60477 ffff82c48014de60 (XEN) 0000000000000000 0000000000000000 0000000080030014 0000000000000000 (XEN) ffff82c48029fc18 0000000000000004 0000000000000246 0000000000000000 (XEN) 00000000ffffffff 00000000ffffffff 0000000000000000 0000000000000001 (XEN) 0000000000000cfc 0000000000000282 ffff82c48025a9c0 0000002000000000 (XEN) ffff82c4801226c0 000000000000e008 0000000000000282 ffff82c48029fc18 (XEN) 000000000000e010 0000000000000282 ffff82c48029fc48 ffff82c480175950 (XEN) 0000000000000202 0000000000000006 0000000000000010 00000000e2200004 (XEN) ffff82c48029fc68 ffff82c4802105dc ffff82c48029fc78 ffff82c480122614 (XEN) ffff82c48029fcc8 ffff82c480160183 ffff82c48029fca8 ffff82c480175950 (XEN) 000082c4ffffffff 0000000000000003 ffff8301108fd1c0 ffff830421050ac0 (XEN) ffff8301108fd1c0 0000000000000000 0000000000000000 0000000000000003 (XEN) ffff82c48029fd58 ffff82c48016033a 000000000000002f 0000000000000082 (XEN) 000782c48029fd08 ffff82c48029fe10 0000006a00000008 ffff82c48029fe78 (XEN) 0000000300000068 0000000000000000 0000000000002000 ffff82c4ffffffff (XEN) ffff82c48029fe10 ffff82c48029fe78 ffff82c48029fe10 ffff830421050ac0 (XEN) 0000000000000000 000000000000001e ffff82c48029fdc8 ffff82c4801610ef (XEN) ffff82c48029fdb8 ffff82c480115ec5 0000000000000293 ffff83042100a1f8 (XEN) Xen call trace: (XEN) [<ffff82c48015e341>] smp_irq_move_cleanup_interrupt+0x23c/0x2bc (XEN) [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40 (XEN) [<ffff82c4801226c0>] _spin_unlock_irqrestore+0x22/0x24 (XEN) [<ffff82c480175950>] pci_conf_read+0xb0/0xc1 (XEN) [<ffff82c4802105dc>] pci_conf_read32+0x7c/0x7e (XEN) [<ffff82c480160183>] read_pci_mem_bar+0x2b0/0x303 (XEN) [<ffff82c48016033a>] msix_capability_init+0x164/0x5fa (XEN) [<ffff82c4801610ef>] pci_enable_msi+0x19b/0x49b (XEN) [<ffff82c4801643bd>] map_domain_pirq+0x281/0x3df (XEN) [<ffff82c4801765cb>] do_physdev_op+0xa2b/0x1508 (XEN) [<ffff82c480209fa8>] syscall_enter+0xc8/0x122 (XEN) -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-27 14:31 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 09:52, Jan Beulich wrote:>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> So vector e9 doesn''t appear to be programmed in anywhere. > > Quite obviously, as it''s the 8259A vector for IRQ 9. The question > really is why an IRQ appears on that vector in the first place. The > 8259A resume code _should_ leave all IRQs masked on a fully > IO-APIC system (see my question raised yesterday). > > And that''s also why I suggested, for an experiment, to fiddle with > the loop exit condition to exclude legacy vectors (which wouldn''t > be a final solution, but would at least tell us whether the direction > is the right one). In the end, besides understanding why an > interrupt on vector E9 gets raised at all, we may also need to > tweak the IRQ migration logic to not do anything on legacy IRQs, > but that would need to happen earlier than in > smp_irq_move_cleanup_interrupt(). Considering that 4.3 > apparently doesn''t have this problem, we may need to go hunt for > a change that isn''t directly connected to this, yet deals with the > problem as a side effect (at least I don''t recall any particular fix > since 4.2). One aspect here is the double mapping of legacy IRQs > (once to their IO-APIC vector, and once to their legacy vector, > i.e. vector_irq[] having two entries pointing to the same IRQ).So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some errors from dom0 kernel, and errors about PCI devices used by domU(1). Messages from resume (different tries): http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log Also one time I''ve got fatal page fault error, earlier in resume (it isn''t deterministic): http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-27 14:46 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 14:31, Marek Marczykowski wrote:> On 27.03.2013 09:52, Jan Beulich wrote: >>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> So vector e9 doesn''t appear to be programmed in anywhere. >> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >> really is why an IRQ appears on that vector in the first place. The >> 8259A resume code _should_ leave all IRQs masked on a fully >> IO-APIC system (see my question raised yesterday). >> >> And that''s also why I suggested, for an experiment, to fiddle with >> the loop exit condition to exclude legacy vectors (which wouldn''t >> be a final solution, but would at least tell us whether the direction >> is the right one). In the end, besides understanding why an >> interrupt on vector E9 gets raised at all, we may also need to >> tweak the IRQ migration logic to not do anything on legacy IRQs, >> but that would need to happen earlier than in >> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >> apparently doesn''t have this problem, we may need to go hunt for >> a change that isn''t directly connected to this, yet deals with the >> problem as a side effect (at least I don''t recall any particular fix >> since 4.2). One aspect here is the double mapping of legacy IRQs >> (once to their IO-APIC vector, and once to their legacy vector, >> i.e. vector_irq[] having two entries pointing to the same IRQ). > So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that > BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some > errors from dom0 kernel, and errors about PCI devices used by domU(1). > > Messages from resume (different tries): > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log > > Also one time I''ve got fatal page fault error, earlier in resume (it isn''t > deterministic): > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >This pagefault is a Null structure pointer dereference, likely the scheduling data. At a first glance, it looks related to the assertion failures I have been seeing sporadically in testing, but unable to reproduce reliably. There seems to be something quite dodgy with interaction of vcpu_wake and scheduling loops. The other logs indicate that dom0 appears to have a domain id of 1, which is sure to cause problems. As for locating the cause of the legacy vectors, it might be a good idea to stick a printk at the top of do_IRQ() which indicates an interrupt with vector between 0xe0 and 0xef. This might at least indicate whether legacy vectors are genuinely being delivered, or whether we have some memory corruption causing these effects. ~Andrew
Marek Marczykowski
2013-Mar-27 14:49 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 15:46, Andrew Cooper wrote:> On 27/03/2013 14:31, Marek Marczykowski wrote: >> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> So vector e9 doesn''t appear to be programmed in anywhere. >>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>> really is why an IRQ appears on that vector in the first place. The >>> 8259A resume code _should_ leave all IRQs masked on a fully >>> IO-APIC system (see my question raised yesterday). >>> >>> And that''s also why I suggested, for an experiment, to fiddle with >>> the loop exit condition to exclude legacy vectors (which wouldn''t >>> be a final solution, but would at least tell us whether the direction >>> is the right one). In the end, besides understanding why an >>> interrupt on vector E9 gets raised at all, we may also need to >>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>> but that would need to happen earlier than in >>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>> apparently doesn''t have this problem, we may need to go hunt for >>> a change that isn''t directly connected to this, yet deals with the >>> problem as a side effect (at least I don''t recall any particular fix >>> since 4.2). One aspect here is the double mapping of legacy IRQs >>> (once to their IO-APIC vector, and once to their legacy vector, >>> i.e. vector_irq[] having two entries pointing to the same IRQ). >> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >> errors from dom0 kernel, and errors about PCI devices used by domU(1). >> >> Messages from resume (different tries): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >> >> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >> deterministic): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >> > > This pagefault is a Null structure pointer dereference, likely the > scheduling data. At a first glance, it looks related to the assertion > failures I have been seeing sporadically in testing, but unable to > reproduce reliably. There seems to be something quite dodgy with > interaction of vcpu_wake and scheduling loops. > > The other logs indicate that dom0 appears to have a domain id of 1, > which is sure to cause problems.Perhaps not - domain 1 exists and have some PCI devices assigned (namely two network adapters).> As for locating the cause of the legacy vectors, it might be a good idea > to stick a printk at the top of do_IRQ() which indicates an interrupt > with vector between 0xe0 and 0xef. This might at least indicate whether > legacy vectors are genuinely being delivered, or whether we have some > memory corruption causing these effects.Ok, will try something like this. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-27 14:52 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 14:46, Andrew Cooper wrote:> On 27/03/2013 14:31, Marek Marczykowski wrote: >> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> So vector e9 doesn''t appear to be programmed in anywhere. >>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>> really is why an IRQ appears on that vector in the first place. The >>> 8259A resume code _should_ leave all IRQs masked on a fully >>> IO-APIC system (see my question raised yesterday). >>> >>> And that''s also why I suggested, for an experiment, to fiddle with >>> the loop exit condition to exclude legacy vectors (which wouldn''t >>> be a final solution, but would at least tell us whether the direction >>> is the right one). In the end, besides understanding why an >>> interrupt on vector E9 gets raised at all, we may also need to >>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>> but that would need to happen earlier than in >>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>> apparently doesn''t have this problem, we may need to go hunt for >>> a change that isn''t directly connected to this, yet deals with the >>> problem as a side effect (at least I don''t recall any particular fix >>> since 4.2). One aspect here is the double mapping of legacy IRQs >>> (once to their IO-APIC vector, and once to their legacy vector, >>> i.e. vector_irq[] having two entries pointing to the same IRQ). >> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >> errors from dom0 kernel, and errors about PCI devices used by domU(1). >> >> Messages from resume (different tries): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >> >> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >> deterministic): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >> > This pagefault is a Null structure pointer dereference, likely the > scheduling data. At a first glance, it looks related to the assertion > failures I have been seeing sporadically in testing, but unable to > reproduce reliably. There seems to be something quite dodgy with > interaction of vcpu_wake and scheduling loops. > > The other logs indicate that dom0 appears to have a domain id of 1, > which is sure to cause problems.Actually - ignore this From the log, (XEN) physdev.c:153: dom0: can''t create irq for msi! [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 domain (XEN) physdev.c:153: dom0: can''t create irq for msi! [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 domain and later (XEN) physdev.c:153: dom1: can''t create irq for msi! [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain [ 121.954080] error enable msi for guest 1 status ffffffea (XEN) physdev.c:153: dom1: can''t create irq for msi! [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain [ 122.044421] error enable msi for guest 1 status ffffffea I think that there is a separate bug where mapped irqs are not unmapped on the suspend path.> > As for locating the cause of the legacy vectors, it might be a good idea > to stick a printk at the top of do_IRQ() which indicates an interrupt > with vector between 0xe0 and 0xef. This might at least indicate whether > legacy vectors are genuinely being delivered, or whether we have some > memory corruption causing these effects. > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Mar-27 15:47 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:> On 27/03/2013 14:46, Andrew Cooper wrote: > > On 27/03/2013 14:31, Marek Marczykowski wrote: > >> On 27.03.2013 09:52, Jan Beulich wrote: > >>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >>>> So vector e9 doesn''t appear to be programmed in anywhere. > >>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question > >>> really is why an IRQ appears on that vector in the first place. The > >>> 8259A resume code _should_ leave all IRQs masked on a fully > >>> IO-APIC system (see my question raised yesterday). > >>> > >>> And that''s also why I suggested, for an experiment, to fiddle with > >>> the loop exit condition to exclude legacy vectors (which wouldn''t > >>> be a final solution, but would at least tell us whether the direction > >>> is the right one). In the end, besides understanding why an > >>> interrupt on vector E9 gets raised at all, we may also need to > >>> tweak the IRQ migration logic to not do anything on legacy IRQs, > >>> but that would need to happen earlier than in > >>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 > >>> apparently doesn''t have this problem, we may need to go hunt for > >>> a change that isn''t directly connected to this, yet deals with the > >>> problem as a side effect (at least I don''t recall any particular fix > >>> since 4.2). One aspect here is the double mapping of legacy IRQs > >>> (once to their IO-APIC vector, and once to their legacy vector, > >>> i.e. vector_irq[] having two entries pointing to the same IRQ). > >> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that > >> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some > >> errors from dom0 kernel, and errors about PCI devices used by domU(1). > >> > >> Messages from resume (different tries): > >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log > >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log > >> > >> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t > >> deterministic): > >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log > >> > > This pagefault is a Null structure pointer dereference, likely the > > scheduling data. At a first glance, it looks related to the assertion > > failures I have been seeing sporadically in testing, but unable to > > reproduce reliably. There seems to be something quite dodgy with > > interaction of vcpu_wake and scheduling loops. > > > > The other logs indicate that dom0 appears to have a domain id of 1, > > which is sure to cause problems. > > Actually - ignore this > > >From the log, > > (XEN) physdev.c:153: dom0: can''t create irq for msi! > [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 > domain > (XEN) physdev.c:153: dom0: can''t create irq for msi! > [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 > domain > > and later > > (XEN) physdev.c:153: dom1: can''t create irq for msi! > [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain > [ 121.954080] error enable msi for guest 1 status ffffffea > (XEN) physdev.c:153: dom1: can''t create irq for msi! > [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain > [ 122.044421] error enable msi for guest 1 status ffffffea > > I think that there is a separate bug where mapped irqs are not unmapped > on the suspend path.You thinking this is a Linux (xen irq machinery) issue? Meaning it should end up calling PHYSDEV_unmap_pirq as part of the suspend process?
Marek Marczykowski
2013-Mar-27 15:51 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 15:49, Marek Marczykowski wrote:> On 27.03.2013 15:46, Andrew Cooper wrote: >> As for locating the cause of the legacy vectors, it might be a good idea >> to stick a printk at the top of do_IRQ() which indicates an interrupt >> with vector between 0xe0 and 0xef. This might at least indicate whether >> legacy vectors are genuinely being delivered, or whether we have some >> memory corruption causing these effects. > > Ok, will try something like this.Nothing interesting here... Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump information). -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-27 16:27 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 15:51, Marek Marczykowski wrote:> On 27.03.2013 15:49, Marek Marczykowski wrote: >> On 27.03.2013 15:46, Andrew Cooper wrote: >>> As for locating the cause of the legacy vectors, it might be a good idea >>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>> with vector between 0xe0 and 0xef. This might at least indicate whether >>> legacy vectors are genuinely being delivered, or whether we have some >>> memory corruption causing these effects. >> Ok, will try something like this. > Nothing interesting here... > Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump information). >Even in the case where we hit the original assertion? If so, then all I can thing is that the move_pending flag for that specific GSI has been corrupted in memory somehow. I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume and in the case of the assertion failure might give some hints. ~Andrew
Andrew Cooper
2013-Mar-27 16:56 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >> On 27/03/2013 14:46, Andrew Cooper wrote: >>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>> really is why an IRQ appears on that vector in the first place. The >>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>> IO-APIC system (see my question raised yesterday). >>>>> >>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>> be a final solution, but would at least tell us whether the direction >>>>> is the right one). In the end, besides understanding why an >>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>> but that would need to happen earlier than in >>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>> a change that isn''t directly connected to this, yet deals with the >>>>> problem as a side effect (at least I don''t recall any particular fix >>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>> >>>> Messages from resume (different tries): >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>> >>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>> deterministic): >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>> >>> This pagefault is a Null structure pointer dereference, likely the >>> scheduling data. At a first glance, it looks related to the assertion >>> failures I have been seeing sporadically in testing, but unable to >>> reproduce reliably. There seems to be something quite dodgy with >>> interaction of vcpu_wake and scheduling loops. >>> >>> The other logs indicate that dom0 appears to have a domain id of 1, >>> which is sure to cause problems. >> Actually - ignore this >> >> >From the log, >> >> (XEN) physdev.c:153: dom0: can''t create irq for msi! >> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >> domain >> (XEN) physdev.c:153: dom0: can''t create irq for msi! >> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >> domain >> >> and later >> >> (XEN) physdev.c:153: dom1: can''t create irq for msi! >> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >> [ 121.954080] error enable msi for guest 1 status ffffffea >> (XEN) physdev.c:153: dom1: can''t create irq for msi! >> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >> [ 122.044421] error enable msi for guest 1 status ffffffea >> >> I think that there is a separate bug where mapped irqs are not unmapped >> on the suspend path. > You thinking this is a Linux (xen irq machinery) issue? Meaning it should > end up calling PHYSDEV_unmap_pirq as part of the suspend process?I am not sure. Without looking at the code, I am only speculating. Beyond that, the main question is about the expected behaviour. Do we expect dom0/U to unmap its irqs and remap them after resume? What do we expect from domains which are unaware of the host sleep action? ~Andrew
Marek Marczykowski
2013-Mar-27 17:15 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 17:56, Andrew Cooper wrote:> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote: >> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >>> On 27/03/2013 14:46, Andrew Cooper wrote: >>>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>>> really is why an IRQ appears on that vector in the first place. The >>>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>>> IO-APIC system (see my question raised yesterday). >>>>>> >>>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>>> be a final solution, but would at least tell us whether the direction >>>>>> is the right one). In the end, besides understanding why an >>>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>>> but that would need to happen earlier than in >>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>>> a change that isn''t directly connected to this, yet deals with the >>>>>> problem as a side effect (at least I don''t recall any particular fix >>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>>> >>>>> Messages from resume (different tries): >>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>>> >>>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>>> deterministic): >>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>>> >>>> This pagefault is a Null structure pointer dereference, likely the >>>> scheduling data. At a first glance, it looks related to the assertion >>>> failures I have been seeing sporadically in testing, but unable to >>>> reproduce reliably. There seems to be something quite dodgy with >>>> interaction of vcpu_wake and scheduling loops. >>>> >>>> The other logs indicate that dom0 appears to have a domain id of 1, >>>> which is sure to cause problems. >>> Actually - ignore this >>> >>> >From the log, >>> >>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>> domain >>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>> domain >>> >>> and later >>> >>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>> [ 121.954080] error enable msi for guest 1 status ffffffea >>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>> [ 122.044421] error enable msi for guest 1 status ffffffea >>> >>> I think that there is a separate bug where mapped irqs are not unmapped >>> on the suspend path. >> You thinking this is a Linux (xen irq machinery) issue? Meaning it should >> end up calling PHYSDEV_unmap_pirq as part of the suspend process? > > I am not sure. Without looking at the code, I am only speculating. > > Beyond that, the main question is about the expected behaviour. Do we > expect dom0/U to unmap its irqs and remap them after resume? What do we > expect from domains which are unaware of the host sleep action?BTW this is the case: domain 1 isn''t fully aware of sleep. It have some PCI devices assigned. The only action taken there before suspend is shutdown network interfaces (without this system hanged during suspend). -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-27 18:16 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 17:27, Andrew Cooper wrote:> On 27/03/2013 15:51, Marek Marczykowski wrote: >> On 27.03.2013 15:49, Marek Marczykowski wrote: >>> On 27.03.2013 15:46, Andrew Cooper wrote: >>>> As for locating the cause of the legacy vectors, it might be a good idea >>>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>>> with vector between 0xe0 and 0xef. This might at least indicate whether >>>> legacy vectors are genuinely being delivered, or whether we have some >>>> memory corruption causing these effects. >>> Ok, will try something like this. >> Nothing interesting here... >> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump information). >> > > Even in the case where we hit the original assertion?Yes, even then.> If so, then all I can thing is that the move_pending flag for that > specific GSI has been corrupted in memory somehow.I guest this isn''t the case, see below.> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume > and in the case of the assertion failure might give some hints.I''ve tried something like this. Detailed log here: http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log Some interesing parts: after system startup: (XEN) irq_cfg of IRQ 9: (XEN) vector: 138 (XEN) move_cleanup_count: 0x0 (XEN) move_in_progress: 0x0 (XEN) irq_desc of IRQ 9: (XEN) status: 80 (IRQ_GUEST | IRQ_PENDING) Isn''t this wrong (status vs move_in_progress)? Then I''ve run pm-suspend, intentionally failed at the end to prevent actual suspend, but run all its hooks. After that: (XEN) irq_cfg of IRQ 9: (XEN) vector: 181 (XEN) move_cleanup_count: 0x0 (XEN) move_in_progress: 0x1 (XEN) irq_desc of IRQ 9: (XEN) status: 80 So now move_in_progress consistent with status. Wait few second, and still move_in_progress was 0x1. Isn''t it supposed to be only temporary state? Then suspended, at resume hit that bug. There was: (XEN) irq_cfg of IRQ 9: (XEN) vector: 60 (XEN) move_cleanup_count: 0x0 (XEN) move_in_progress: 0x0 (XEN) irq_desc of IRQ 9: (XEN) status: 16 move_in_progress==0, ok. But move_cleanup_count==0, while at least once was move_in_progress==1. Isn''t that wrong? -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-27 18:56 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 18:16, Marek Marczykowski wrote:> On 27.03.2013 17:27, Andrew Cooper wrote: >> On 27/03/2013 15:51, Marek Marczykowski wrote: >>> On 27.03.2013 15:49, Marek Marczykowski wrote: >>>> On 27.03.2013 15:46, Andrew Cooper wrote: >>>>> As for locating the cause of the legacy vectors, it might be a good idea >>>>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>>>> with vector between 0xe0 and 0xef. This might at least indicate whether >>>>> legacy vectors are genuinely being delivered, or whether we have some >>>>> memory corruption causing these effects. >>>> Ok, will try something like this. >>> Nothing interesting here... >>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump information). >>> >> Even in the case where we hit the original assertion? > Yes, even then. > >> If so, then all I can thing is that the move_pending flag for that >> specific GSI has been corrupted in memory somehow. > I guest this isn''t the case, see below. > >> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume >> and in the case of the assertion failure might give some hints. > I''ve tried something like this. Detailed log here: > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.logThis is concerning, unless I am getting utterly confused. Jan: Do you mind double checking my reasoning? irq 0 through 15 should be the PIC irqs, set up in init_IRQ() in arch/x86/i8259.c irq9 should be the irq for the PIC vector which is set up as 0xe9, and its vector should never change. Could you put in extra checks for the sanity of per_cpu(vector_irq, cpu)[0xe0 thru 0xef] ?> > Some interesing parts: > after system startup: > (XEN) irq_cfg of IRQ 9: > (XEN) vector: 138 > (XEN) move_cleanup_count: 0x0 > (XEN) move_in_progress: 0x0 > (XEN) irq_desc of IRQ 9: > (XEN) status: 80 (IRQ_GUEST | IRQ_PENDING) > > Isn''t this wrong (status vs move_in_progress)?This here looks fine. What do you think is wrong about it?> > Then I''ve run pm-suspend, intentionally failed at the end to prevent actual > suspend, but run all its hooks. After that: > (XEN) irq_cfg of IRQ 9: > (XEN) vector: 181 > (XEN) move_cleanup_count: 0x0 > (XEN) move_in_progress: 0x1 > (XEN) irq_desc of IRQ 9: > (XEN) status: 80 > > So now move_in_progress consistent with status. > Wait few second, and still move_in_progress was 0x1. Isn''t it supposed to be > only temporary state?move_in_progress gets set by __assign_irq_vector() when the scheduler decides to move the IRQ. It can stay set for a long time. On the next interrupt from this source, the move_in_progress bit being set causes the IRQ source to be reprogrammed to the new destination.> > Then suspended, at resume hit that bug. There was: > (XEN) irq_cfg of IRQ 9: > (XEN) vector: 60 > (XEN) move_cleanup_count: 0x0 > (XEN) move_in_progress: 0x0 > (XEN) irq_desc of IRQ 9: > (XEN) status: 16 > > move_in_progress==0, ok. But move_cleanup_count==0, while at least once was > move_in_progress==1. Isn''t that wrong? >move_cleanup_count is only set in send_cleanup_vector, for the specific vector which is being cleaned up. However, as the IPI handler cleans up all vectors which are outstanding, the move_cleanup_count can be 0 for most vectors which are actually cleaned up. This is in an attempt to reduce the number of IPIs required to clean up all moving irqs. As the scheduler currently has a habit of moving vcpus at every scheduling opportunity, this means that irqs are constantly moving. ~Andrew
Jan Beulich
2013-Mar-28 10:50 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 27.03.13 at 17:27, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 27/03/2013 15:51, Marek Marczykowski wrote: >> On 27.03.2013 15:49, Marek Marczykowski wrote: >>> On 27.03.2013 15:46, Andrew Cooper wrote: >>>> As for locating the cause of the legacy vectors, it might be a good idea >>>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>>> with vector between 0xe0 and 0xef. This might at least indicate whether >>>> legacy vectors are genuinely being delivered, or whether we have some >>>> memory corruption causing these effects. >>> Ok, will try something like this. >> Nothing interesting here... >> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump > information). >> > > Even in the case where we hit the original assertion? > > If so, then all I can thing is that the move_pending flag for that > specific GSI has been corrupted in memory somehow.No, I think the flag is legitimately set after resume, and gets looked at the after the first SCI got signaled (which would trigger the pending affinity change to be carried out that was initiated in the suspend path). The problem is a more fundamental one: irq_move_cleanup_interrupt() (in unstable terms) includes the legacy vectors, so if, upon encountering the move_cleanup_count for IRQ 9 (or any legacy IRQ) execution doesn''t make it all the way through to carrying out the cleanup, the loop, once in the legacy vector range, will re-encounter the same IRQ, find move_cleanup_count non-zero again, and thus tries to do something here. Hence I think skipping the legacy vector range here is indeed necessary, even outside the suspend/resume scenario (see below). Another alternative would be to invalidate the vector_irq[] entries for legacy vectors handled through the IO-APIC. Jan x86: irq_move_cleanup_interrupt() must ignore legacy vectors Since the main loop in the function includes legacy vectors, and since vector_irq[] gets set up for legacy vectors regardless of whether those get handled through the IO-APIC, it must not do anything on this vector range. In fact, we should never get here for IRQs not handled through the IO-APIC, so add a respective warning at once (could probably as well be an ASSERT()). Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -625,6 +625,12 @@ void irq_move_cleanup_interrupt(struct c if ((int)irq < 0) continue; + if ( vector >= FIRST_LEGACY_VECTOR && vector <= LAST_LEGACY_VECTOR ) + { + WARN_ON(!IO_APIC_IRQ(irq)); + continue; + } + desc = irq_to_desc(irq); if (!desc) continue; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-28 11:53 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28/03/2013 10:50, Jan Beulich wrote:>>>> On 27.03.13 at 17:27, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 27/03/2013 15:51, Marek Marczykowski wrote: >>> On 27.03.2013 15:49, Marek Marczykowski wrote: >>>> On 27.03.2013 15:46, Andrew Cooper wrote: >>>>> As for locating the cause of the legacy vectors, it might be a good idea >>>>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>>>> with vector between 0xe0 and 0xef. This might at least indicate whether >>>>> legacy vectors are genuinely being delivered, or whether we have some >>>>> memory corruption causing these effects. >>>> Ok, will try something like this. >>> Nothing interesting here... >>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump >> information). >> Even in the case where we hit the original assertion? >> >> If so, then all I can thing is that the move_pending flag for that >> specific GSI has been corrupted in memory somehow. > No, I think the flag is legitimately set after resume, and gets > looked at the after the first SCI got signaled (which would > trigger the pending affinity change to be carried out that was > initiated in the suspend path). The problem is a more > fundamental one: irq_move_cleanup_interrupt() (in unstable > terms) includes the legacy vectors, so if, upon encountering the > move_cleanup_count for IRQ 9 (or any legacy IRQ) execution > doesn''t make it all the way through to carrying out the cleanup, > the loop, once in the legacy vector range, will re-encounter the > same IRQ, find move_cleanup_count non-zero again, and thus > tries to do something here. > > Hence I think skipping the legacy vector range here is indeed > necessary, even outside the suspend/resume scenario (see > below). Another alternative would be to invalidate the > vector_irq[] entries for legacy vectors handled through the > IO-APIC. > > Jan > > x86: irq_move_cleanup_interrupt() must ignore legacy vectors > > Since the main loop in the function includes legacy vectors, and since > vector_irq[] gets set up for legacy vectors regardless of whether those > get handled through the IO-APIC, it must not do anything on this vector > range. In fact, we should never get here for IRQs not handled through > the IO-APIC, so add a respective warning at once (could probably as > well be an ASSERT()). > > Signed-off-by: Jan Beulich <jbeulich@suse.com>Under what circumstances would we have any vectors 0xe0-0xef programmed into the IOAPIC? I cant think of any offhand. As far as I am aware, it is not valid for any PIC interrupts to ever be up for moving, as they should only be delivered to the BSP. In addition to the check you have, the scope of the loop should probably be reduced. We should never be considering to move any vector larger than LAST_HIPRIORITY_VECTOR, which I believe are all LAPIC interrupts, making 8 useless iterations of the loop. I would also suggest that it is an ASSERT rather than a WARN, but that leaves us not fixing the bug at hand, as we have already verified that vector 0xe9 is not programmed into the IOAPIC. ~Andrew> > --- a/xen/arch/x86/irq.c > +++ b/xen/arch/x86/irq.c > @@ -625,6 +625,12 @@ void irq_move_cleanup_interrupt(struct c > if ((int)irq < 0) > continue; > > + if ( vector >= FIRST_LEGACY_VECTOR && vector <= LAST_LEGACY_VECTOR ) > + { > + WARN_ON(!IO_APIC_IRQ(irq)); > + continue; > + } > + > desc = irq_to_desc(irq); > if (!desc) > continue; > >
Jan Beulich
2013-Mar-28 12:54 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 28.03.13 at 12:53, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 28/03/2013 10:50, Jan Beulich wrote: >> x86: irq_move_cleanup_interrupt() must ignore legacy vectors >> >> Since the main loop in the function includes legacy vectors, and since >> vector_irq[] gets set up for legacy vectors regardless of whether those >> get handled through the IO-APIC, it must not do anything on this vector >> range. In fact, we should never get here for IRQs not handled through >> the IO-APIC, so add a respective warning at once (could probably as >> well be an ASSERT()). >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Under what circumstances would we have any vectors 0xe0-0xef programmed > into the IOAPIC? I cant think of any offhand.Never. And I didn''t say it would.> As far as I am aware, it is not valid for any PIC interrupts to ever be > up for moving, as they should only be delivered to the BSP.Hence the WARN_ON() (or ASSERT()).> In addition to the check you have, the scope of the loop should probably > be reduced. We should never be considering to move any vector larger > than LAST_HIPRIORITY_VECTOR, which I believe are all LAPIC interrupts, > making 8 useless iterations of the loop.Agreed. Will update the patch to also do that.> I would also suggest that it > is an ASSERT rather than a WARN, but that leaves us not fixing the bug > at hand, as we have already verified that vector 0xe9 is not programmed > into the IOAPIC.So with you repeating this I think I didn''t explain well enough what I think is happening. Hence I''ll try again: We possibly (on at least one CPU for sure) have two vector_irq[] entries referring to any particular legacy IRQ - one for the vector that the IO-APIC is using, and one for the corresponding legacy vector. Hence there''ll be two iterations of the loop here looking at the _same_ IRQ, the second of which (wrongly) being the one pointed to by the entry in the legacy vector range. It is this second instance that the change is suppressing, with the WARN_ON() being there to ascertain that we indeed never get here for an IRQ handled through the 8259A. Jan
Jan Beulich
2013-Mar-28 13:19 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 28.03.13 at 13:54, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 28.03.13 at 12:53, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 28/03/2013 10:50, Jan Beulich wrote: >>> x86: irq_move_cleanup_interrupt() must ignore legacy vectors >>> >>> Since the main loop in the function includes legacy vectors, and since >>> vector_irq[] gets set up for legacy vectors regardless of whether those >>> get handled through the IO-APIC, it must not do anything on this vector >>> range. In fact, we should never get here for IRQs not handled through >>> the IO-APIC, so add a respective warning at once (could probably as >>> well be an ASSERT()). >>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> >> Under what circumstances would we have any vectors 0xe0-0xef programmed >> into the IOAPIC? I cant think of any offhand. > > Never. And I didn''t say it would. > >> As far as I am aware, it is not valid for any PIC interrupts to ever be >> up for moving, as they should only be delivered to the BSP. > > Hence the WARN_ON() (or ASSERT()).You know what - now that I actually tried this out, I see that this triggers. For the moment I''m puzzled, will need to look into this in more detail. Jan
Marek Marczykowski
2013-Mar-28 14:43 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27.03.2013 19:56, Andrew Cooper wrote:> On 27/03/2013 18:16, Marek Marczykowski wrote: >> On 27.03.2013 17:27, Andrew Cooper wrote: >>> On 27/03/2013 15:51, Marek Marczykowski wrote: >>>> On 27.03.2013 15:49, Marek Marczykowski wrote: >>>>> On 27.03.2013 15:46, Andrew Cooper wrote: >>>>>> As for locating the cause of the legacy vectors, it might be a good idea >>>>>> to stick a printk at the top of do_IRQ() which indicates an interrupt >>>>>> with vector between 0xe0 and 0xef. This might at least indicate whether >>>>>> legacy vectors are genuinely being delivered, or whether we have some >>>>>> memory corruption causing these effects. >>>>> Ok, will try something like this. >>>> Nothing interesting here... >>>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump information). >>>> >>> Even in the case where we hit the original assertion? >> Yes, even then. >> >>> If so, then all I can thing is that the move_pending flag for that >>> specific GSI has been corrupted in memory somehow. >> I guest this isn''t the case, see below. >> >>> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume >>> and in the case of the assertion failure might give some hints. >> I''ve tried something like this. Detailed log here: >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log > > This is concerning, unless I am getting utterly confused. Jan: Do you > mind double checking my reasoning? > > irq 0 through 15 should be the PIC irqs, set up in init_IRQ() in > arch/x86/i8259.c > > irq9 should be the irq for the PIC vector which is set up as 0xe9, and > its vector should never change. > > Could you put in extra checks for the sanity of per_cpu(vector_irq, > cpu)[0xe0 thru 0xef] ?Ok, got something here: http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump2.log Now bug triggered after some time after resume (about 15s). But only CPU0 by scheduler immediately after resume. Interesting part - note vector_irq(e1): (XEN) irq_cfg of IRQ 9: (XEN) vector: 188 (XEN) cpu_mask: 00000000,00000000,00000000,00000001 (XEN) old_cpu_mask: 00000000,00000000,00000000,00000002 (XEN) move_cleanup_count: 0x0 (XEN) used_vectors: 49,64,72,74,80-81,88,98,112,120,144,148,152,156,160,164,168,172,178,188,192,196,200,207-208 (XEN) move_in_progress: 0x0 (XEN) irq_desc of IRQ 9: (XEN) status: 16 (XEN) handler: ffff82c480252660 (XEN) msi_desc: 0000000000000000 (XEN) action: ffff83041d9f1ed0 (XEN) depth: 0 (XEN) chip_data: ffff830421080250 (XEN) irq: 9 (XEN) affinity: 00000000,00000000,00000000,00000001 (XEN) pending_mask: 00000000,00000000,00000000,00000000 (XEN) (...) (XEN) vector_irq(e0): 0 (XEN) vector_irq(e1): -1 (XEN) vector_irq(e2): 2 (XEN) vector_irq(e3): 3 (XEN) vector_irq(e4): 4 (XEN) vector_irq(e5): 5 (XEN) vector_irq(e6): 6 (XEN) vector_irq(e7): 7 (XEN) vector_irq(e8): 8 (XEN) vector_irq(e9): 9 (XEN) vector_irq(ea): 10 (XEN) vector_irq(eb): 11 (XEN) vector_irq(ec): 12 (XEN) vector_irq(ed): 13 (XEN) vector_irq(ee): 14 (XEN) vector_irq(ef): 15 (XEN) Xen WARN at io_apic.c:639 (XEN) ----[ Xen-4.1.5-rc1 x86_64 debug=y Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015e5fb>] smp_irq_move_cleanup_interrupt+0x246/0x2c6 (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 00000000000000e1 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: 000000000000000a rdi: ffff82c4802592e0 (XEN) rbp: ffff82c48029fda8 rsp: ffff82c48029fd58 r8: 0000000000000004 (XEN) r9: 0000000000000001 r10: 000000000000000f r11: 0000000000000002 (XEN) r12: ffff830421080050 r13: ffff830421060134 r14: ffff82c48029ff18 (XEN) r15: ffff82c4802dd9e0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000000273d3c000 cr2: ffff88000c360318 (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48029fd58: (XEN) 0000000000000000 000000008029fd70 ffff82c48029ff18 ffff82c4802dd9e0 (XEN) ffff82c480153f55 ffff830421043260 ffff830421043320 0000006f207ab134 (XEN) 0000006f207c3b14 ffff82c4802dd600 00007d3b7fd60227 ffff82c48014de60 (XEN) ffff82c4802dd600 0000006f207c3b14 0000006f207ab134 ffff830421043320 (XEN) ffff82c48029fef0 ffff830421043260 0000ffff0000ffff 0000006f416dab2e (XEN) ffff830007ef4060 0000006f1fad2570 0000000000003f40 0000000000000001 (XEN) 0000000000000000 ffff82c4802de200 0000000002048cac 0000002000000000 (XEN) ffff82c480197940 000000000000e008 0000000000000246 ffff82c48029fe68 (XEN) 000000000000e010 ffff82c48029fef0 ffff82c4801987b7 ffff880402105d30 (XEN) 00000000ca9a4000 ffffffffffffffff aaaaaaaaaaaaaa00 aaaaaaaaaaaaaaaa (XEN) 0000006f21136437 0000000000000000 0000000000000000 ffffffffffffffff (XEN) 000004c200000542 0000000000000000 ffff82c48029ff18 ffff82c48029ff18 (XEN) 00000000ffffffff 0000000000000002 ffff82c4802dd600 ffff82c48029ff10 (XEN) ffff82c4801549ce ffff8300ca9a4000 ffff8300ca666000 ffff82c48029fdc8 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000001 (XEN) ffff880402105f00 ffff880402105fd8 0000000000000246 0000000000000001 (XEN) 0000000000000000 0000000000000000 0000000000000000 ffffffff810013aa (XEN) ffffffff81a2a858 00000000deadbeef 00000000deadbeef 0000010000000000 (XEN) ffffffff810013aa 000000000000e033 0000000000000246 ffff880402105ee8 (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c48015e5fb>] smp_irq_move_cleanup_interrupt+0x246/0x2c6 (XEN) [<ffff82c48014de60>] irq_move_cleanup_interrupt+0x30/0x40 (XEN) [<ffff82c480197940>] lapic_timer_nop+0x0/0x6 (XEN) [<ffff82c4801549ce>] idle_loop+0x4b/0x59 Ignore rest of comments from my previous mail - I clearly don''t understand IRQ handling code. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Mar-28 16:13 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: > Also one time I''ve got fatal page fault error, earlier in resume (it isn''t > deterministic): > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.logThis is mostly identical to http://lists.xen.org/archives/html/xen-devel/2013-01/msg02175.html, and hence I would assume that the patch Ben posted (v4 came through yesterday) would be fixing this. Care to give this a try? Jan
Jan Beulich
2013-Mar-28 16:25 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com>wrote:> On 27.03.2013 09:52, Jan Beulich wrote: >>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> So vector e9 doesn''t appear to be programmed in anywhere. >> >> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >> really is why an IRQ appears on that vector in the first place. The >> 8259A resume code _should_ leave all IRQs masked on a fully >> IO-APIC system (see my question raised yesterday). >> >> And that''s also why I suggested, for an experiment, to fiddle with >> the loop exit condition to exclude legacy vectors (which wouldn''t >> be a final solution, but would at least tell us whether the direction >> is the right one). In the end, besides understanding why an >> interrupt on vector E9 gets raised at all, we may also need to >> tweak the IRQ migration logic to not do anything on legacy IRQs, >> but that would need to happen earlier than in >> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >> apparently doesn''t have this problem, we may need to go hunt for >> a change that isn''t directly connected to this, yet deals with the >> problem as a side effect (at least I don''t recall any particular fix >> since 4.2). One aspect here is the double mapping of legacy IRQs >> (once to their IO-APIC vector, and once to their legacy vector, >> i.e. vector_irq[] having two entries pointing to the same IRQ). > > So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit > that > BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also > some > errors from dom0 kernel, and errors about PCI devices used by domU(1). > > Messages from resume (different tries): > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log > http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.logIs that a sensible usage scenario at all? I would think that a prerequisite to host S3 is that all guests get suspended. If you do that, do you still have these interrupt re-setup problems? Jan
Marek Marczykowski
2013-Mar-28 16:31 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28.03.2013 17:25, Jan Beulich wrote:>>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com> > wrote: >> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> So vector e9 doesn''t appear to be programmed in anywhere. >>> >>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>> really is why an IRQ appears on that vector in the first place. The >>> 8259A resume code _should_ leave all IRQs masked on a fully >>> IO-APIC system (see my question raised yesterday). >>> >>> And that''s also why I suggested, for an experiment, to fiddle with >>> the loop exit condition to exclude legacy vectors (which wouldn''t >>> be a final solution, but would at least tell us whether the direction >>> is the right one). In the end, besides understanding why an >>> interrupt on vector E9 gets raised at all, we may also need to >>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>> but that would need to happen earlier than in >>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>> apparently doesn''t have this problem, we may need to go hunt for >>> a change that isn''t directly connected to this, yet deals with the >>> problem as a side effect (at least I don''t recall any particular fix >>> since 4.2). One aspect here is the double mapping of legacy IRQs >>> (once to their IO-APIC vector, and once to their legacy vector, >>> i.e. vector_irq[] having two entries pointing to the same IRQ). >> >> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit >> that >> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also >> some >> errors from dom0 kernel, and errors about PCI devices used by domU(1). >> >> Messages from resume (different tries): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log > > Is that a sensible usage scenario at all? I would think that a > prerequisite to host S3 is that all guests get suspended.What do you mean by "suspended"? I haven''t found any sane method to do that with xl (only some manual xenstore write to control/shutdown). For now I do: - shutdown all network adapters in VMs - pause all VMs> If you > do that, do you still have these interrupt re-setup problems?Yes, even when no guest is running (which was the case on 4.2)... -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Mar-28 16:52 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 28.03.13 at 17:31, Marek Marczykowski <marmarek@invisiblethingslab.com>wrote:> On 28.03.2013 17:25, Jan Beulich wrote: >>>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com> >> wrote: >>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>> >>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>> really is why an IRQ appears on that vector in the first place. The >>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>> IO-APIC system (see my question raised yesterday). >>>> >>>> And that''s also why I suggested, for an experiment, to fiddle with >>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>> be a final solution, but would at least tell us whether the direction >>>> is the right one). In the end, besides understanding why an >>>> interrupt on vector E9 gets raised at all, we may also need to >>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>> but that would need to happen earlier than in >>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>> apparently doesn''t have this problem, we may need to go hunt for >>>> a change that isn''t directly connected to this, yet deals with the >>>> problem as a side effect (at least I don''t recall any particular fix >>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>> (once to their IO-APIC vector, and once to their legacy vector, >>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>> >>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit >>> that >>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also >>> some >>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>> >>> Messages from resume (different tries): >>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >> >> Is that a sensible usage scenario at all? I would think that a >> prerequisite to host S3 is that all guests get suspended. > > What do you mean by "suspended"? I haven''t found any sane method to do that > with xl (only some manual xenstore write to control/shutdown). For now I do: > - shutdown all network adapters in VMs > - pause all VMsAren''t there "xl save" and "xl restore"? And for HVM guests, I think there''s also a way to do virtual S3. Jan
Marek Marczykowski
2013-Mar-28 17:09 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28.03.2013 17:52, Jan Beulich wrote:>>>> On 28.03.13 at 17:31, Marek Marczykowski <marmarek@invisiblethingslab.com> > wrote: >> On 28.03.2013 17:25, Jan Beulich wrote: >>>>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com> >>> wrote: >>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>> >>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>> really is why an IRQ appears on that vector in the first place. The >>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>> IO-APIC system (see my question raised yesterday). >>>>> >>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>> be a final solution, but would at least tell us whether the direction >>>>> is the right one). In the end, besides understanding why an >>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>> but that would need to happen earlier than in >>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>> a change that isn''t directly connected to this, yet deals with the >>>>> problem as a side effect (at least I don''t recall any particular fix >>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>> >>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit >>>> that >>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also >>>> some >>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>> >>>> Messages from resume (different tries): >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>> >>> Is that a sensible usage scenario at all? I would think that a >>> prerequisite to host S3 is that all guests get suspended. >> >> What do you mean by "suspended"? I haven''t found any sane method to do that >> with xl (only some manual xenstore write to control/shutdown). For now I do: >> - shutdown all network adapters in VMs >> - pause all VMs > > Aren''t there "xl save" and "xl restore"? And for HVM guests, I think > there''s also a way to do virtual S3.xl save/restore takes far to much time. I''ve tried xenstore-write "suspend" to control/shutdown, then xc_domain_resume call some time ago, but I had some problems with that (unfortunately don''t remember details...). This is basically what xl save and restore does, but without actual data dump. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-28 17:41 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 27/03/2013 17:15, Marek Marczykowski wrote:> On 27.03.2013 17:56, Andrew Cooper wrote: >> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote: >>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >>>> On 27/03/2013 14:46, Andrew Cooper wrote: >>>>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>>>> really is why an IRQ appears on that vector in the first place. The >>>>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>>>> IO-APIC system (see my question raised yesterday). >>>>>>> >>>>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>>>> be a final solution, but would at least tell us whether the direction >>>>>>> is the right one). In the end, besides understanding why an >>>>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>>>> but that would need to happen earlier than in >>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>>>> a change that isn''t directly connected to this, yet deals with the >>>>>>> problem as a side effect (at least I don''t recall any particular fix >>>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>>>> >>>>>> Messages from resume (different tries): >>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>>>> >>>>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>>>> deterministic): >>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>>>> >>>>> This pagefault is a Null structure pointer dereference, likely the >>>>> scheduling data. At a first glance, it looks related to the assertion >>>>> failures I have been seeing sporadically in testing, but unable to >>>>> reproduce reliably. There seems to be something quite dodgy with >>>>> interaction of vcpu_wake and scheduling loops. >>>>> >>>>> The other logs indicate that dom0 appears to have a domain id of 1, >>>>> which is sure to cause problems. >>>> Actually - ignore this >>>> >>>> >From the log, >>>> >>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>> domain >>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>> domain >>>> >>>> and later >>>> >>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>> [ 121.954080] error enable msi for guest 1 status ffffffea >>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>> [ 122.044421] error enable msi for guest 1 status ffffffea >>>> >>>> I think that there is a separate bug where mapped irqs are not unmapped >>>> on the suspend path. >>> You thinking this is a Linux (xen irq machinery) issue? Meaning it should >>> end up calling PHYSDEV_unmap_pirq as part of the suspend process? >> I am not sure. Without looking at the code, I am only speculating. >> >> Beyond that, the main question is about the expected behaviour. Do we >> expect dom0/U to unmap its irqs and remap them after resume? What do we >> expect from domains which are unaware of the host sleep action? > BTW this is the case: domain 1 isn''t fully aware of sleep. It have some PCI > devices assigned. The only action taken there before suspend is shutdown > network interfaces (without this system hanged during suspend). >What do you mean here by shutting down the network interfaces? Are the devices being assigned back to dom0? Ifso, is dom0 assigning them back to domU before the domU driver tries to set itself up? ~Andrew
Marek Marczykowski
2013-Mar-28 17:44 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28.03.2013 18:41, Andrew Cooper wrote:> On 27/03/2013 17:15, Marek Marczykowski wrote: >> On 27.03.2013 17:56, Andrew Cooper wrote: >>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote: >>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >>>>> On 27/03/2013 14:46, Andrew Cooper wrote: >>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>>>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>>>>> really is why an IRQ appears on that vector in the first place. The >>>>>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>>>>> IO-APIC system (see my question raised yesterday). >>>>>>>> >>>>>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>>>>> be a final solution, but would at least tell us whether the direction >>>>>>>> is the right one). In the end, besides understanding why an >>>>>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>>>>> but that would need to happen earlier than in >>>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>>>>> a change that isn''t directly connected to this, yet deals with the >>>>>>>> problem as a side effect (at least I don''t recall any particular fix >>>>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>>>>> >>>>>>> Messages from resume (different tries): >>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>>>>> >>>>>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>>>>> deterministic): >>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>>>>> >>>>>> This pagefault is a Null structure pointer dereference, likely the >>>>>> scheduling data. At a first glance, it looks related to the assertion >>>>>> failures I have been seeing sporadically in testing, but unable to >>>>>> reproduce reliably. There seems to be something quite dodgy with >>>>>> interaction of vcpu_wake and scheduling loops. >>>>>> >>>>>> The other logs indicate that dom0 appears to have a domain id of 1, >>>>>> which is sure to cause problems. >>>>> Actually - ignore this >>>>> >>>>> >From the log, >>>>> >>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>> domain >>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>> domain >>>>> >>>>> and later >>>>> >>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>> [ 121.954080] error enable msi for guest 1 status ffffffea >>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>> [ 122.044421] error enable msi for guest 1 status ffffffea >>>>> >>>>> I think that there is a separate bug where mapped irqs are not unmapped >>>>> on the suspend path. >>>> You thinking this is a Linux (xen irq machinery) issue? Meaning it should >>>> end up calling PHYSDEV_unmap_pirq as part of the suspend process? >>> I am not sure. Without looking at the code, I am only speculating. >>> >>> Beyond that, the main question is about the expected behaviour. Do we >>> expect dom0/U to unmap its irqs and remap them after resume? What do we >>> expect from domains which are unaware of the host sleep action? >> BTW this is the case: domain 1 isn''t fully aware of sleep. It have some PCI >> devices assigned. The only action taken there before suspend is shutdown >> network interfaces (without this system hanged during suspend). >> > > What do you mean here by shutting down the network interfaces? Are the > devices being assigned back to dom0?No, just simple ip link set eth0 down. Seems to be enough to suspend succeed, at least on most hardware...> Ifso, is dom0 assigning them back > to domU before the domU driver tries to set itself up?-- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Mar-28 17:50 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28/03/2013 17:44, Marek Marczykowski wrote:> On 28.03.2013 18:41, Andrew Cooper wrote: >> On 27/03/2013 17:15, Marek Marczykowski wrote: >>> On 27.03.2013 17:56, Andrew Cooper wrote: >>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote: >>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >>>>>> On 27/03/2013 14:46, Andrew Cooper wrote: >>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>>>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>>>>>> really is why an IRQ appears on that vector in the first place. The >>>>>>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>>>>>> IO-APIC system (see my question raised yesterday). >>>>>>>>> >>>>>>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>>>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>>>>>> be a final solution, but would at least tell us whether the direction >>>>>>>>> is the right one). In the end, besides understanding why an >>>>>>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>>>>>> but that would need to happen earlier than in >>>>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>>>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>>>>>> a change that isn''t directly connected to this, yet deals with the >>>>>>>>> problem as a side effect (at least I don''t recall any particular fix >>>>>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>>>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>>>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>>>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>>>>>> >>>>>>>> Messages from resume (different tries): >>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>>>>>> >>>>>>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>>>>>> deterministic): >>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>>>>>> >>>>>>> This pagefault is a Null structure pointer dereference, likely the >>>>>>> scheduling data. At a first glance, it looks related to the assertion >>>>>>> failures I have been seeing sporadically in testing, but unable to >>>>>>> reproduce reliably. There seems to be something quite dodgy with >>>>>>> interaction of vcpu_wake and scheduling loops. >>>>>>> >>>>>>> The other logs indicate that dom0 appears to have a domain id of 1, >>>>>>> which is sure to cause problems. >>>>>> Actually - ignore this >>>>>> >>>>>> >From the log, >>>>>> >>>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>>> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>>> domain >>>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>>> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>>> domain >>>>>> >>>>>> and later >>>>>> >>>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>>> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>>> [ 121.954080] error enable msi for guest 1 status ffffffea >>>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>>> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>>> [ 122.044421] error enable msi for guest 1 status ffffffea >>>>>> >>>>>> I think that there is a separate bug where mapped irqs are not unmapped >>>>>> on the suspend path. >>>>> You thinking this is a Linux (xen irq machinery) issue? Meaning it should >>>>> end up calling PHYSDEV_unmap_pirq as part of the suspend process? >>>> I am not sure. Without looking at the code, I am only speculating. >>>> >>>> Beyond that, the main question is about the expected behaviour. Do we >>>> expect dom0/U to unmap its irqs and remap them after resume? What do we >>>> expect from domains which are unaware of the host sleep action? >>> BTW this is the case: domain 1 isn''t fully aware of sleep. It have some PCI >>> devices assigned. The only action taken there before suspend is shutdown >>> network interfaces (without this system hanged during suspend). >>> >> What do you mean here by shutting down the network interfaces? Are the >> devices being assigned back to dom0? > No, just simple ip link set eth0 down. Seems to be enough to suspend succeed, > at least on most hardware...In which case repeat map_pirq hypercalls will fail with -EINVAL because the pirq is already set up. It is probably worth putting a printk in map_pirq and unmap_pirq to see exactly what is happening across the sleep/resume cycle. ~Andrew
Marek Marczykowski
2013-Mar-28 19:03 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28.03.2013 17:13, Jan Beulich wrote:>>>> On 27.03.13 at 15:31, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >> deterministic): >> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log > > This is mostly identical to > http://lists.xen.org/archives/html/xen-devel/2013-01/msg02175.html, > and hence I would assume that the patch Ben posted (v4 came > through yesterday) would be fixing this. Care to give this a try?With this, together with your previous patch ("x86: irq_move_cleanup_interrupt() must ignore legacy vectors") I can''t hit previous IRQ setup problem (at least for few tries). But it still doesn''t solve original problem - after suspend system temperature goes high, apparently only CPU0 is online. If I pin some domain vCPU to non-0 CPU before suspend, I hit ASSERT() on resume: (XEN) Finishing wakeup from ACPI S3 state. (XEN) Enabling non-boot CPUs ... (XEN) Suppress EOI broadcast on CPU#1 (XEN) masked ExtINT on CPU#1 (XEN) Suppress EOI broadcast on CPU#2 (XEN) masked ExtINT on CPU#2 (XEN) Suppress EOI broadcast on CPU#3 (XEN) masked ExtINT on CPU#3 (XEN) Restoring affinity for d2v3 (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at sched_credit.c:481 xl cpupool-list -c: Name CPU list Pool-0 0 xl cpupool-cpu-add Pool-0 1 -> -EBUSY -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Mar-29 00:26 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 28.03.2013 18:50, Andrew Cooper wrote:> On 28/03/2013 17:44, Marek Marczykowski wrote: >> On 28.03.2013 18:41, Andrew Cooper wrote: >>> On 27/03/2013 17:15, Marek Marczykowski wrote: >>>> On 27.03.2013 17:56, Andrew Cooper wrote: >>>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote: >>>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote: >>>>>>> On 27/03/2013 14:46, Andrew Cooper wrote: >>>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote: >>>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote: >>>>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>>>>>>> So vector e9 doesn''t appear to be programmed in anywhere. >>>>>>>>>> Quite obviously, as it''s the 8259A vector for IRQ 9. The question >>>>>>>>>> really is why an IRQ appears on that vector in the first place. The >>>>>>>>>> 8259A resume code _should_ leave all IRQs masked on a fully >>>>>>>>>> IO-APIC system (see my question raised yesterday). >>>>>>>>>> >>>>>>>>>> And that''s also why I suggested, for an experiment, to fiddle with >>>>>>>>>> the loop exit condition to exclude legacy vectors (which wouldn''t >>>>>>>>>> be a final solution, but would at least tell us whether the direction >>>>>>>>>> is the right one). In the end, besides understanding why an >>>>>>>>>> interrupt on vector E9 gets raised at all, we may also need to >>>>>>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs, >>>>>>>>>> but that would need to happen earlier than in >>>>>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3 >>>>>>>>>> apparently doesn''t have this problem, we may need to go hunt for >>>>>>>>>> a change that isn''t directly connected to this, yet deals with the >>>>>>>>>> problem as a side effect (at least I don''t recall any particular fix >>>>>>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs >>>>>>>>>> (once to their IO-APIC vector, and once to their legacy vector, >>>>>>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ). >>>>>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn''t hit that >>>>>>>>> BUG/ASSERT. But still it doesn''t work - only CPU0 used by scheduler, also some >>>>>>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1). >>>>>>>>> >>>>>>>>> Messages from resume (different tries): >>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log >>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log >>>>>>>>> >>>>>>>>> Also one time I''ve got fatal page fault error, earlier in resume (it isn''t >>>>>>>>> deterministic): >>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log >>>>>>>>> >>>>>>>> This pagefault is a Null structure pointer dereference, likely the >>>>>>>> scheduling data. At a first glance, it looks related to the assertion >>>>>>>> failures I have been seeing sporadically in testing, but unable to >>>>>>>> reproduce reliably. There seems to be something quite dodgy with >>>>>>>> interaction of vcpu_wake and scheduling loops. >>>>>>>> >>>>>>>> The other logs indicate that dom0 appears to have a domain id of 1, >>>>>>>> which is sure to cause problems. >>>>>>> Actually - ignore this >>>>>>> >>>>>>> >From the log, >>>>>>> >>>>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>>>> [ 113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>>>> domain >>>>>>> (XEN) physdev.c:153: dom0: can''t create irq for msi! >>>>>>> [ 113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752 >>>>>>> domain >>>>>>> >>>>>>> and later >>>>>>> >>>>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>>>> [ 121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>>>> [ 121.954080] error enable msi for guest 1 status ffffffea >>>>>>> (XEN) physdev.c:153: dom1: can''t create irq for msi! >>>>>>> [ 122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain >>>>>>> [ 122.044421] error enable msi for guest 1 status ffffffea >>>>>>> >>>>>>> I think that there is a separate bug where mapped irqs are not unmapped >>>>>>> on the suspend path. >>>>>> You thinking this is a Linux (xen irq machinery) issue? Meaning it should >>>>>> end up calling PHYSDEV_unmap_pirq as part of the suspend process? >>>>> I am not sure. Without looking at the code, I am only speculating. >>>>> >>>>> Beyond that, the main question is about the expected behaviour. Do we >>>>> expect dom0/U to unmap its irqs and remap them after resume? What do we >>>>> expect from domains which are unaware of the host sleep action? >>>> BTW this is the case: domain 1 isn''t fully aware of sleep. It have some PCI >>>> devices assigned. The only action taken there before suspend is shutdown >>>> network interfaces (without this system hanged during suspend). >>>> >>> What do you mean here by shutting down the network interfaces? Are the >>> devices being assigned back to dom0? >> No, just simple ip link set eth0 down. Seems to be enough to suspend succeed, >> at least on most hardware... > > In which case repeat map_pirq hypercalls will fail with -EINVAL because > the pirq is already set up. It is probably worth putting a printk in > map_pirq and unmap_pirq to see exactly what is happening across the > sleep/resume cycle.No unmap/map is done during sleep/resume cycle regarding that domain (have two mapped pirqs). Even for dom0 I see only one unmap/map during suspend/resume. For most devices this doesn''t break anything. Few exceptions needs module reload after resume (e.g. sky2), but not sure about the reason (no additional logs, simply no link detected). -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ben Guthro
2013-Apr-01 13:53 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote:> (XEN) Restoring affinity for d2v3 > (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at > sched_credit.c:481I think the "fix-suspend-scheduler-*" patches posted here are applicable here: http://markmail.org/message/llj3oyhgjzvw3t23 Specifically, I think you need this bit: diff --git a/xen/common/cpu.c b/xen/common/cpu.c index 630881e..e20868c 100644 --- a/xen/common/cpu.c +++ b/xen/common/cpu.c @@ -5,6 +5,7 @@ #include <xen/init.h> #include <xen/sched.h> #include <xen/stop_machine.h> +#include <xen/sched-if.h> unsigned int __read_mostly nr_cpu_ids = NR_CPUS; #ifndef nr_cpumask_bits @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) BUG_ON(error == -EBUSY); printk("Error taking CPU%d up: %d\n", cpu, error); } + if (system_state == SYS_STATE_resume) + cpumask_set_cpu(cpu, cpupool0->cpu_valid); } cpumask_clear(&frozen_cpus);
Marek Marczykowski
2013-Apr-02 01:13 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 01.04.2013 15:53, Ben Guthro wrote:> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski > <marmarek@invisiblethingslab.com> wrote: >> (XEN) Restoring affinity for d2v3 >> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at >> sched_credit.c:481 > > > I think the "fix-suspend-scheduler-*" patches posted here are applicable here: > http://markmail.org/message/llj3oyhgjzvw3t23 > > > Specifically, I think you need this bit: > > diff --git a/xen/common/cpu.c b/xen/common/cpu.c > index 630881e..e20868c 100644 > --- a/xen/common/cpu.c > +++ b/xen/common/cpu.c > @@ -5,6 +5,7 @@ > #include <xen/init.h> > #include <xen/sched.h> > #include <xen/stop_machine.h> > +#include <xen/sched-if.h> > > unsigned int __read_mostly nr_cpu_ids = NR_CPUS; > #ifndef nr_cpumask_bits > @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) > BUG_ON(error == -EBUSY); > printk("Error taking CPU%d up: %d\n", cpu, error); > } > + if (system_state == SYS_STATE_resume) > + cpumask_set_cpu(cpu, cpupool0->cpu_valid); > } > > cpumask_clear(&frozen_cpus); >Indeed, this makes things better, but still not ideal. Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more preferred than others (xl vcpu-list). For example if I start 4 busy loops in dom0, I got (even after some time): [user@dom0 ~]$ xl vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity dom0 0 0 0 r-- 98.5 any cpu dom0 0 1 0 --- 181.3 any cpu dom0 0 2 2 r-- 262.4 any cpu dom0 0 3 3 r-- 230.8 any cpu netvm 1 0 0 -b- 18.4 any cpu netvm 1 1 0 -b- 9.1 any cpu netvm 1 2 0 -b- 7.1 any cpu netvm 1 3 0 -b- 5.4 any cpu firewallvm 2 0 0 -b- 10.7 any cpu firewallvm 2 1 0 -b- 3.0 any cpu firewallvm 2 2 0 -b- 2.5 any cpu firewallvm 2 3 3 -b- 3.6 any cpu If I remove some CPU from Pool-0 and re-add it, things back to normal for this particular CPU (so I got two equally used CPUs) - to fully restore system I must remove all but CPU0 from Pool-0 and add it again. Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1. This probably could be fixed by your "xen: Re-upload processor PM data to hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps here). But I don''t think it is a right way. It isn''t necessary on other systems (with somehow older hardware). It must be something missing on resume path. The question is what... Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?). Unfortunately I don''t know x86 details so good to follow that code... -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Apr-02 14:05 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Tue, Apr 02, 2013 at 03:13:56AM +0200, Marek Marczykowski wrote:> On 01.04.2013 15:53, Ben Guthro wrote: > > On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski > > <marmarek@invisiblethingslab.com> wrote: > >> (XEN) Restoring affinity for d2v3 > >> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at > >> sched_credit.c:481 > > > > > > I think the "fix-suspend-scheduler-*" patches posted here are applicable here: > > http://markmail.org/message/llj3oyhgjzvw3t23 > > > > > > Specifically, I think you need this bit: > > > > diff --git a/xen/common/cpu.c b/xen/common/cpu.c > > index 630881e..e20868c 100644 > > --- a/xen/common/cpu.c > > +++ b/xen/common/cpu.c > > @@ -5,6 +5,7 @@ > > #include <xen/init.h> > > #include <xen/sched.h> > > #include <xen/stop_machine.h> > > +#include <xen/sched-if.h> > > > > unsigned int __read_mostly nr_cpu_ids = NR_CPUS; > > #ifndef nr_cpumask_bits > > @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) > > BUG_ON(error == -EBUSY); > > printk("Error taking CPU%d up: %d\n", cpu, error); > > } > > + if (system_state == SYS_STATE_resume) > > + cpumask_set_cpu(cpu, cpupool0->cpu_valid); > > } > > > > cpumask_clear(&frozen_cpus); > > > > Indeed, this makes things better, but still not ideal. > Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more > preferred than others (xl vcpu-list). For example if I start 4 busy loops in > dom0, I got (even after some time): > [user@dom0 ~]$ xl vcpu-list > Name ID VCPU CPU State Time(s) CPU Affinity > dom0 0 0 0 r-- 98.5 any cpu > dom0 0 1 0 --- 181.3 any cpu > dom0 0 2 2 r-- 262.4 any cpu > dom0 0 3 3 r-- 230.8 any cpu > netvm 1 0 0 -b- 18.4 any cpu > netvm 1 1 0 -b- 9.1 any cpu > netvm 1 2 0 -b- 7.1 any cpu > netvm 1 3 0 -b- 5.4 any cpu > firewallvm 2 0 0 -b- 10.7 any cpu > firewallvm 2 1 0 -b- 3.0 any cpu > firewallvm 2 2 0 -b- 2.5 any cpu > firewallvm 2 3 3 -b- 3.6 any cpu > > If I remove some CPU from Pool-0 and re-add it, things back to normal for this > particular CPU (so I got two equally used CPUs) - to fully restore system I > must remove all but CPU0 from Pool-0 and add it again. > > Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1. > This probably could be fixed by your "xen: Re-upload processor PM data to > hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps > here). But I don''t think it is a right way. It isn''t necessary on other > systems (with somehow older hardware). It must be something missing on resume > path. The question is what...The xen-acpi-processor should probably also have the cpu hotplug notification in it to deal with this - so that you don''t need to do the reload.> > Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check > if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?). > Unfortunately I don''t know x86 details so good to follow that code... > > -- > Best Regards / Pozdrawiam, > Marek Marczykowski > Invisible Things Lab >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Marek Marczykowski
2013-Apr-15 22:09 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 02.04.2013 03:13, Marek Marczykowski wrote:> On 01.04.2013 15:53, Ben Guthro wrote: >> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski >> <marmarek@invisiblethingslab.com> wrote: >>> (XEN) Restoring affinity for d2v3 >>> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at >>> sched_credit.c:481 >> >> >> I think the "fix-suspend-scheduler-*" patches posted here are applicable here: >> http://markmail.org/message/llj3oyhgjzvw3t23 >> >> >> Specifically, I think you need this bit: >> >> diff --git a/xen/common/cpu.c b/xen/common/cpu.c >> index 630881e..e20868c 100644 >> --- a/xen/common/cpu.c >> +++ b/xen/common/cpu.c >> @@ -5,6 +5,7 @@ >> #include <xen/init.h> >> #include <xen/sched.h> >> #include <xen/stop_machine.h> >> +#include <xen/sched-if.h> >> >> unsigned int __read_mostly nr_cpu_ids = NR_CPUS; >> #ifndef nr_cpumask_bits >> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) >> BUG_ON(error == -EBUSY); >> printk("Error taking CPU%d up: %d\n", cpu, error); >> } >> + if (system_state == SYS_STATE_resume) >> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); >> } >> >> cpumask_clear(&frozen_cpus); >> > > Indeed, this makes things better, but still not ideal. > Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more > preferred than others (xl vcpu-list). For example if I start 4 busy loops in > dom0, I got (even after some time): > [user@dom0 ~]$ xl vcpu-list > Name ID VCPU CPU State Time(s) CPU Affinity > dom0 0 0 0 r-- 98.5 any cpu > dom0 0 1 0 --- 181.3 any cpu > dom0 0 2 2 r-- 262.4 any cpu > dom0 0 3 3 r-- 230.8 any cpu > netvm 1 0 0 -b- 18.4 any cpu > netvm 1 1 0 -b- 9.1 any cpu > netvm 1 2 0 -b- 7.1 any cpu > netvm 1 3 0 -b- 5.4 any cpu > firewallvm 2 0 0 -b- 10.7 any cpu > firewallvm 2 1 0 -b- 3.0 any cpu > firewallvm 2 2 0 -b- 2.5 any cpu > firewallvm 2 3 3 -b- 3.6 any cpu > > If I remove some CPU from Pool-0 and re-add it, things back to normal for this > particular CPU (so I got two equally used CPUs) - to fully restore system I > must remove all but CPU0 from Pool-0 and add it again. > > Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1. > This probably could be fixed by your "xen: Re-upload processor PM data to > hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps > here). But I don''t think it is a right way. It isn''t necessary on other > systems (with somehow older hardware). It must be something missing on resume > path. The question is what... > > Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check > if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?). > Unfortunately I don''t know x86 details so good to follow that code...Summarize ACPI S3 issues: I. Fixed issues: 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore legacy vectors" commit 2. Assertion failure on resume with vcpu affinity used, fixes by "x86/S3: Restore broken vcpu affinity on resume" commit II. Not (fully) fixed issues: 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the issue, but it isn''t applied to xen-unstable 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some timers are not restarted after resume? 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but it isn''t in upstream linux (nor Konrad''s acpi-s3 branches). -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ben Guthro
2013-Apr-15 23:36 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Mon, Apr 15, 2013 at 11:09 PM, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote:> On 02.04.2013 03:13, Marek Marczykowski wrote: >> On 01.04.2013 15:53, Ben Guthro wrote: >>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski >>> <marmarek@invisiblethingslab.com> wrote: >>>> (XEN) Restoring affinity for d2v3 >>>> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at >>>> sched_credit.c:481 >>> >>> >>> I think the "fix-suspend-scheduler-*" patches posted here are applicable here: >>> http://markmail.org/message/llj3oyhgjzvw3t23 >>> >>> >>> Specifically, I think you need this bit: >>> >>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c >>> index 630881e..e20868c 100644 >>> --- a/xen/common/cpu.c >>> +++ b/xen/common/cpu.c >>> @@ -5,6 +5,7 @@ >>> #include <xen/init.h> >>> #include <xen/sched.h> >>> #include <xen/stop_machine.h> >>> +#include <xen/sched-if.h> >>> >>> unsigned int __read_mostly nr_cpu_ids = NR_CPUS; >>> #ifndef nr_cpumask_bits >>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) >>> BUG_ON(error == -EBUSY); >>> printk("Error taking CPU%d up: %d\n", cpu, error); >>> } >>> + if (system_state == SYS_STATE_resume) >>> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); >>> } >>> >>> cpumask_clear(&frozen_cpus); >>> >> >> Indeed, this makes things better, but still not ideal. >> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more >> preferred than others (xl vcpu-list). For example if I start 4 busy loops in >> dom0, I got (even after some time): >> [user@dom0 ~]$ xl vcpu-list >> Name ID VCPU CPU State Time(s) CPU Affinity >> dom0 0 0 0 r-- 98.5 any cpu >> dom0 0 1 0 --- 181.3 any cpu >> dom0 0 2 2 r-- 262.4 any cpu >> dom0 0 3 3 r-- 230.8 any cpu >> netvm 1 0 0 -b- 18.4 any cpu >> netvm 1 1 0 -b- 9.1 any cpu >> netvm 1 2 0 -b- 7.1 any cpu >> netvm 1 3 0 -b- 5.4 any cpu >> firewallvm 2 0 0 -b- 10.7 any cpu >> firewallvm 2 1 0 -b- 3.0 any cpu >> firewallvm 2 2 0 -b- 2.5 any cpu >> firewallvm 2 3 3 -b- 3.6 any cpu >> >> If I remove some CPU from Pool-0 and re-add it, things back to normal for this >> particular CPU (so I got two equally used CPUs) - to fully restore system I >> must remove all but CPU0 from Pool-0 and add it again. >> >> Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1. >> This probably could be fixed by your "xen: Re-upload processor PM data to >> hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps >> here). But I don''t think it is a right way. It isn''t necessary on other >> systems (with somehow older hardware). It must be something missing on resume >> path. The question is what... >> >> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check >> if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?). >> Unfortunately I don''t know x86 details so good to follow that code... > > Summarize ACPI S3 issues: > > I. Fixed issues: > > 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore legacy > vectors" commit > 2. Assertion failure on resume with vcpu affinity used, fixes by "x86/S3: > Restore broken vcpu affinity on resume" commit > > > II. Not (fully) fixed issues: > > 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the > issue, but it isn''t applied to xen-unstable > 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). > Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some > timers are not restarted after resume?Marek, Please try the patch from this thread to see if it solves your 2 issues above: http://markmail.org/thread/35ecqimv7bwq3k6d This patch was NAK''ed due to cpupool breakage...but in my testing, it solved both of these problems. I don''t know how to properly solve it in a cpupool compatible way... but I also haven''t put much additional effort into doing so.> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by > "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but it > isn''t in upstream linux (nor Konrad''s acpi-s3 branches).I don''t recall seeing any ACK / NAK from Konrad on this. Original post: https://patchwork.kernel.org/patch/2033981/ Konrad - do you have any thoughts about incorporating this into a future merge window? Ben
konrad wilk
2013-Apr-15 23:51 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by >> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but it >> isn''t in upstream linux (nor Konrad''s acpi-s3 branches). > I don''t recall seeing any ACK / NAK from Konrad on this. > > Original post: > https://patchwork.kernel.org/patch/2033981/ > > Konrad - do you have any thoughts about incorporating this into a > future merge window?Hey Ben, I seem to have missed it. I think the patch is missing a change to pr_backup->acpi_id = i, otherwise it would resend the C-states with the same APIC ID. Also the upstream version does kfree(pr_backup) at some point. But more importantly, do you know why it is needed? Is Xen hypervisor "loosing" this information because they go offline and then they are onlined again?
Ben Guthro
2013-Apr-16 00:19 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk <konrad.wilk@oracle.com> wrote:> >>> 3. ACPI C-states are only present for CPU0 (after resume of course), >>> fixed by >>> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, >>> but it >>> isn''t in upstream linux (nor Konrad''s acpi-s3 branches). >> >> I don''t recall seeing any ACK / NAK from Konrad on this. >> >> Original post: >> https://patchwork.kernel.org/patch/2033981/ >> >> Konrad - do you have any thoughts about incorporating this into a >> future merge window? > > > Hey Ben, > I seem to have missed it. > I think the patch is missing a change to pr_backup->acpi_id = i, otherwise > it would resend > the C-states with the same APIC ID. Also the upstream version does > kfree(pr_backup) at some point.Hmm. I''ll look into this, and re-submit.> > But more importantly, do you know why it is needed? Is Xen hypervisor > "loosing" this information because they go offline and then they are onlined > again?It was a while ago...the first of a number of 4.2 S3 related performance issues that we chasing reports from users / automated QA that the end result was "slow performance on S3 in XP" As it turns out - this didn''t fix the performance problem...but it also didn''t seem right. I''m not sure if it is because the non-boot cpus are offlined...but it would seem to make logical sense. Ben
Ben Guthro
2013-Apr-16 00:46 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Tue, Apr 16, 2013 at 1:19 AM, Ben Guthro <ben@guthro.net> wrote:> On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk <konrad.wilk@oracle.com> wrote: >> >>>> 3. ACPI C-states are only present for CPU0 (after resume of course), >>>> fixed by >>>> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, >>>> but it >>>> isn''t in upstream linux (nor Konrad''s acpi-s3 branches). >>> >>> I don''t recall seeing any ACK / NAK from Konrad on this. >>> >>> Original post: >>> https://patchwork.kernel.org/patch/2033981/ >>> >>> Konrad - do you have any thoughts about incorporating this into a >>> future merge window? >> >> >> Hey Ben, >> I seem to have missed it. >> I think the patch is missing a change to pr_backup->acpi_id = i, otherwise >> it would resend >> the C-states with the same APIC ID. Also the upstream version does >> kfree(pr_backup) at some point. > > Hmm. I''ll look into this, and re-submit.At the risk of seeming a bit dim, could you elaborate a bit here? I''m looking at the function again, and perhaps I''m missing something. Since xen_acpi_processor_resume() was a subset of what was done in xen_acpi_processor_init() - I trimmed a number of things unused in the functionality I was using. This included the pr_backup related things (both alloc & free) I''m not seeing exactly what you are suggesting I am missing, if I don''t even have a pr_backup. This usually means I overlooked something embarrassingly obvious. If you would be so kind as to point this out so I can slap my forehead, I''d appreciate it. Thanks Ben> >> >> But more importantly, do you know why it is needed? Is Xen hypervisor >> "loosing" this information because they go offline and then they are onlined >> again? > > It was a while ago...the first of a number of 4.2 S3 related > performance issues that we chasing reports from users / automated QA > that the end result was "slow performance on S3 in XP" > > As it turns out - this didn''t fix the performance problem...but it > also didn''t seem right. > > I''m not sure if it is because the non-boot cpus are offlined...but it > would seem to make logical sense. > > Ben
Marek Marczykowski
2013-Apr-16 01:02 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 16.04.2013 01:36, Ben Guthro wrote:> On Mon, Apr 15, 2013 at 11:09 PM, Marek Marczykowski > <marmarek@invisiblethingslab.com> wrote: >> On 02.04.2013 03:13, Marek Marczykowski wrote: >>> On 01.04.2013 15:53, Ben Guthro wrote: >>>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski >>>> <marmarek@invisiblethingslab.com> wrote: >>>>> (XEN) Restoring affinity for d2v3 >>>>> (XEN) Assertion ''!cpus_empty(cpus) && cpu_isset(cpu, cpus)'' failed at >>>>> sched_credit.c:481 >>>> >>>> >>>> I think the "fix-suspend-scheduler-*" patches posted here are applicable here: >>>> http://markmail.org/message/llj3oyhgjzvw3t23 >>>> >>>> >>>> Specifically, I think you need this bit: >>>> >>>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c >>>> index 630881e..e20868c 100644 >>>> --- a/xen/common/cpu.c >>>> +++ b/xen/common/cpu.c >>>> @@ -5,6 +5,7 @@ >>>> #include <xen/init.h> >>>> #include <xen/sched.h> >>>> #include <xen/stop_machine.h> >>>> +#include <xen/sched-if.h> >>>> >>>> unsigned int __read_mostly nr_cpu_ids = NR_CPUS; >>>> #ifndef nr_cpumask_bits >>>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void) >>>> BUG_ON(error == -EBUSY); >>>> printk("Error taking CPU%d up: %d\n", cpu, error); >>>> } >>>> + if (system_state == SYS_STATE_resume) >>>> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); >>>> } >>>> >>>> cpumask_clear(&frozen_cpus); >>>> >>> >>> Indeed, this makes things better, but still not ideal. >>> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more >>> preferred than others (xl vcpu-list). For example if I start 4 busy loops in >>> dom0, I got (even after some time): >>> [user@dom0 ~]$ xl vcpu-list >>> Name ID VCPU CPU State Time(s) CPU Affinity >>> dom0 0 0 0 r-- 98.5 any cpu >>> dom0 0 1 0 --- 181.3 any cpu >>> dom0 0 2 2 r-- 262.4 any cpu >>> dom0 0 3 3 r-- 230.8 any cpu >>> netvm 1 0 0 -b- 18.4 any cpu >>> netvm 1 1 0 -b- 9.1 any cpu >>> netvm 1 2 0 -b- 7.1 any cpu >>> netvm 1 3 0 -b- 5.4 any cpu >>> firewallvm 2 0 0 -b- 10.7 any cpu >>> firewallvm 2 1 0 -b- 3.0 any cpu >>> firewallvm 2 2 0 -b- 2.5 any cpu >>> firewallvm 2 3 3 -b- 3.6 any cpu >>> >>> If I remove some CPU from Pool-0 and re-add it, things back to normal for this >>> particular CPU (so I got two equally used CPUs) - to fully restore system I >>> must remove all but CPU0 from Pool-0 and add it again. >>> >>> Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1. >>> This probably could be fixed by your "xen: Re-upload processor PM data to >>> hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps >>> here). But I don''t think it is a right way. It isn''t necessary on other >>> systems (with somehow older hardware). It must be something missing on resume >>> path. The question is what... >>> >>> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and check >>> if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?). >>> Unfortunately I don''t know x86 details so good to follow that code... >> >> Summarize ACPI S3 issues: >> >> I. Fixed issues: >> >> 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore legacy >> vectors" commit >> 2. Assertion failure on resume with vcpu affinity used, fixes by "x86/S3: >> Restore broken vcpu affinity on resume" commit >> >> >> II. Not (fully) fixed issues: >> >> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the >> issue, but it isn''t applied to xen-unstable >> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). >> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some >> timers are not restarted after resume? > > Marek, > Please try the patch from this thread to see if it solves your 2 issues above: > http://markmail.org/thread/35ecqimv7bwq3k6d > > This patch was NAK''ed due to cpupool breakage...but in my testing, it > solved both of these problems. > > I don''t know how to properly solve it in a cpupool compatible way... > but I also haven''t put much additional effort into doing so.Indeed this makes problem disappear. -- Best Regards / Pozdrawiam, Marek Marczykowski Invisible Things Lab _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
konrad wilk
2013-Apr-16 03:20 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On 4/15/2013 8:46 PM, Ben Guthro wrote:> On Tue, Apr 16, 2013 at 1:19 AM, Ben Guthro <ben@guthro.net> wrote: >> On Tue, Apr 16, 2013 at 12:51 AM, konrad wilk <konrad.wilk@oracle.com> wrote: >>>>> 3. ACPI C-states are only present for CPU0 (after resume of course), >>>>> fixed by >>>>> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, >>>>> but it >>>>> isn''t in upstream linux (nor Konrad''s acpi-s3 branches). >>>> I don''t recall seeing any ACK / NAK from Konrad on this. >>>> >>>> Original post: >>>> https://patchwork.kernel.org/patch/2033981/ >>>> >>>> Konrad - do you have any thoughts about incorporating this into a >>>> future merge window? >>> >>> Hey Ben, >>> I seem to have missed it. >>> I think the patch is missing a change to pr_backup->acpi_id = i, otherwise >>> it would resend >>> the C-states with the same APIC ID. Also the upstream version does >>> kfree(pr_backup) at some point. >> Hmm. I''ll look into this, and re-submit. > At the risk of seeming a bit dim, could you elaborate a bit here?Part of what xen-acpi-processor has to deal with is the ''dom0_max_vcpus='' case. Which means that when ''acpi_processor_get_performance_info'' is called to parse ACPI C-states it will limit itself to only the ''online'' CPUs it sees. Meaning that all the other ones (which might be physically present) which Linux does not see are skipped. As such there is this: 545 if (!pr_backup) { 546 pr_backup = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL); 547 if (pr_backup) 548 memcpy(pr_backup, _pr, sizeof(struct acpi_processor)); 549 } And then later 552 rc = check_acpi_ids(pr_backup); which walks the ACPI namespace checking whether it has uploaded the ACPI-IDs for all the CPUs. If there are some that are missing (b/c dom0_max_vcpus=X) was used, then it uploads the pr_backup with the ACPI ID altered. What I think you ought to try is just to call check_acpi_ids after the for_cpu_online() loop with the pr_backup. Hm, you could actually make this even easier. Just move this code: 539 for_each_possible_cpu(i) { 540 struct acpi_processor *_pr; 541 _pr = per_cpu(processors, i /* APIC ID */); 542 if (!_pr) 543 continue; 544 545 if (!pr_backup) { 546 pr_backup = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL); 547 if (pr_backup) 548 memcpy(pr_backup, _pr, sizeof(struct acpi_processor)); 549 } 550 (void)upload_pm_data(_pr); 551 } 552 rc = check_acpi_ids(pr_backup); in its own function. Then make both the module loading _and_ the syscore resume call said function. Viola! Naturally the kfree(pr_backup) and pr_backup = NULL have to be eliminated from the module_init function.. and the module_exit needs the pr_backup moved past the syscore_unregister.
Jan Beulich
2013-Apr-16 08:47 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 16.04.13 at 00:09, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: > II. Not (fully) fixed issues: > > 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the > issue, but it isn''t applied to xen-unstable > 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). > Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some > timers are not restarted after resume?So I understand there is a patch dealing with this, but I''m not clear whether that''s known to break CPU pools?> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by > "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but > it isn''t in upstream linux (nor Konrad''s acpi-s3 branches).Perhaps this rather ought to be fixed in the hypervisor (to not forget the respective information; perhaps also for P-states)? After all that''s another case where S3 is different from soft or hard offlining an individual CPU (in particular we can expect the same CPU to come back up during resume, whereas namely a hot- unplugged one could get replaced by a [slightly] different one). Jan
Ben Guthro
2013-Apr-16 11:49 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 16.04.13 at 00:09, Marek Marczykowski <marmarek@invisiblethingslab.com> wrote: >> II. Not (fully) fixed issues: >> >> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the >> issue, but it isn''t applied to xen-unstable >> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). >> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some >> timers are not restarted after resume? > > So I understand there is a patch dealing with this, but I''m not clear > whether that''s known to break CPU pools?All cpus will end up in cpu pool 0 after S3. I''m not sure that is "broken" - but it probably isn''t ideal either. IMO - it is better than the alternative state...but Juergen seems to disagree.> >> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by >> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but >> it isn''t in upstream linux (nor Konrad''s acpi-s3 branches). > > Perhaps this rather ought to be fixed in the hypervisor (to not > forget the respective information; perhaps also for P-states)? > After all that''s another case where S3 is different from soft or hard > offlining an individual CPU (in particular we can expect the same > CPU to come back up during resume, whereas namely a hot- > unplugged one could get replaced by a [slightly] different one). > > Jan >
Jan Beulich
2013-Apr-16 11:57 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net> wrote: > On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 16.04.13 at 00:09, Marek Marczykowski <marmarek@invisiblethingslab.com> > wrote: >>> II. Not (fully) fixed issues: >>> >>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the >>> issue, but it isn''t applied to xen-unstable >>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). >>> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some >>> timers are not restarted after resume? >> >> So I understand there is a patch dealing with this, but I''m not clear >> whether that''s known to break CPU pools? > > All cpus will end up in cpu pool 0 after S3. > I''m not sure that is "broken" - but it probably isn''t ideal either. > > IMO - it is better than the alternative state...but Juergen seems to > disagree.But it can''t be that difficult to save/restore pool association on top of said patch? Jan
Ben Guthro
2013-Apr-16 12:09 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
On Tue, Apr 16, 2013 at 7:57 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net> wrote: >> On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 16.04.13 at 00:09, Marek Marczykowski <marmarek@invisiblethingslab.com> >> wrote: >>>> II. Not (fully) fixed issues: >>>> >>>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the >>>> issue, but it isn''t applied to xen-unstable >>>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). >>>> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some >>>> timers are not restarted after resume? >>> >>> So I understand there is a patch dealing with this, but I''m not clear >>> whether that''s known to break CPU pools? >> >> All cpus will end up in cpu pool 0 after S3. >> I''m not sure that is "broken" - but it probably isn''t ideal either. >> >> IMO - it is better than the alternative state...but Juergen seems to >> disagree. > > But it can''t be that difficult to save/restore pool association on top > of said patch?I took a brief look, in the hopes of taking a similar tack as with the vcpu affinity restoration. However, it seems to be a slightly more difficult problem. In the vcpu affinity, there was an existing structure to stash away the information we needed after resume. In a pcpu, there is no such associated metadata...the SMP processor id is just an integer. So - where would we store the pool information temporarily across the S3 process? Ben
Jan Beulich
2013-Apr-16 12:51 UTC
Re: High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
>>> On 16.04.13 at 14:09, Ben Guthro <ben@guthro.net> wrote: > On Tue, Apr 16, 2013 at 7:57 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 16.04.13 at 13:49, Ben Guthro <ben@guthro.net> wrote: >>> On Tue, Apr 16, 2013 at 4:47 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>>> On 16.04.13 at 00:09, Marek Marczykowski <marmarek@invisiblethingslab.com> >>> wrote: >>>>> II. Not (fully) fixed issues: >>>>> >>>>> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the >>>>> issue, but it isn''t applied to xen-unstable >>>>> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing). >>>>> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some >>>>> timers are not restarted after resume? >>>> >>>> So I understand there is a patch dealing with this, but I''m not clear >>>> whether that''s known to break CPU pools? >>> >>> All cpus will end up in cpu pool 0 after S3. >>> I''m not sure that is "broken" - but it probably isn''t ideal either. >>> >>> IMO - it is better than the alternative state...but Juergen seems to >>> disagree. >> >> But it can''t be that difficult to save/restore pool association on top >> of said patch? > > I took a brief look, in the hopes of taking a similar tack as with the > vcpu affinity restoration. > However, it seems to be a slightly more difficult problem. > In the vcpu affinity, there was an existing structure to stash away > the information we needed after resume. > > In a pcpu, there is no such associated metadata...the SMP processor id > is just an integer. > So - where would we store the pool information temporarily across the > S3 process?Do it the other way around - the CPU pools have a mask of valid CPUs. You could latch those pre-suspend for each of the pools (e.g. by again introducing a second mask hanging off the same structure). (Also adding Juergen to Cc in case he has other thoughts.) Jan
Apparently Analagous Threads
- IO-APIC: tweak debug key info formatting
- cpuidle and un-eoid interrupts at the local apic
- [xen-unstable] Commit 2ca9fbd739b8a72b16dd790d0fff7b75f5488fb8 AMD IOMMU: allocate IRTE entries instead of using a static mapping, makes dom0 boot process stall several times.
- xen-4.1: PV domain hanging at startup, jiffies stopped
- Freeze with 2.6.32.19 and xen-4.0.1rc5