Konrad Rzeszutek Wilk
2011-Nov-08 21:15 UTC
[Xen-devel] [PATCH] x86/acpi fixes for 3.2 (v1) impacting distributions.
I am posting three patches that are impacting distributions (both Ubuntu and Fedora Core 16) when running the Linux v3.1 (or later) under Xen. The first one is a regression: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to In 3.1 we set pm_idle to something else besides the default_idle which is not good. We want to use the default_halt as it does a yield hypercall, while the other pm_idle do not. Worst yet, when we would migrate a guest we could be using the wrong pm_idle code (on AMD boxes). The two other ones are more controversial and I am not sure if the path I had choosen is the "best" to fix the corruption problem. The "Right Way" would be to wrap pte_flags with a pvops call, but that has serious performance drawback implications. Ad nauseum details are in the patch: [PATCH 2/3] x86/cpa: Use pte_attrs instead of pte_flags on and the last one is not that important, but nonethless if somebody is running CONFIG_CPA_DEBUG and with a radeon or nouveau card they might get sporadic: "CPA (x) bad PTE" messages. This patch fixes that. [PATCH 3/3] x86/paravirt: Use pte_val instead of pte_flags on CPA The patches are also located in git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-x86 arch/x86/include/asm/pgtable.h | 5 +++++ arch/x86/kernel/process.c | 5 +++++ arch/x86/mm/pageattr-test.c | 6 +++++- arch/x86/mm/pageattr.c | 2 +- include/linux/cpuidle.h | 2 ++ 5 files changed, 18 insertions(+), 2 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-08 21:15 UTC
[Xen-devel] [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
. however we ignore the disable_cpuidle() directive and in select_idle_routine() set it to type preferred by the CPU. This is a regression introduced by commit a0bfa1373859e9d11dc92561a8667588803e42d8 Author: Len Brown <len.brown@intel.com> Date: Fri Apr 1 19:34:59 2011 -0400 cpuidle: stop depending on pm_idle specifically this patch: commit d91ee5863b71e8c90eaf6035bff3078a85e2e7b5 Author: Len Brown <len.brown@intel.com> Date: Fri Apr 1 18:28:35 2011 -0400 cpuidle: replace xen access to x86 pm_idle and default_idle ..scribble on pm_idle and access default_idle, have it simply disable_cpuidle() so acpi_idle will not load and architecture default HLT will be used. .. but the default HLT does not get used. Instead we end up in the situation that select_idle_routine() is called from arch/x86/kernel/cpu/common.c and if we called cpuidle_disable(), then the pm_idle is not set, and we end up setting pm_idle to mwait_idle or amd_e400_idle. This patch uses the cpuidle_disabled() functionality and makes select_idle_routine() adhere to that. Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: "H. Peter Anvin" <hpa@zytor.com> CC: x86@kernel.org CC: Len Brown <len.brown@intel.com> CC: Borislav Petkov <bp@alien8.de> CC: Tejun Heo <tj@kernel.org> CC: Thomas Renninger <trenn@suse.de> CC: stable@kernel.org --- arch/x86/kernel/process.c | 5 +++++ include/linux/cpuidle.h | 2 ++ 2 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index e7e3b01..1f7f8c8 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -14,6 +14,7 @@ #include <linux/utsname.h> #include <trace/events/power.h> #include <linux/hw_breakpoint.h> +#include <linux/cpuidle.h> #include <asm/cpu.h> #include <asm/system.h> #include <asm/apic.h> @@ -587,6 +588,10 @@ void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) if (pm_idle) return; + if (cpuidle_disabled()) { + pm_idle = default_idle; + return; + } if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { /* * One CPU supports mwait => All CPUs supports mwait diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index b51629e..123fe9e 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -122,6 +122,7 @@ struct cpuidle_driver { }; #ifdef CONFIG_CPU_IDLE +extern int cpuidle_disabled(void); extern void disable_cpuidle(void); extern int cpuidle_idle_call(void); @@ -137,6 +138,7 @@ extern int cpuidle_enable_device(struct cpuidle_device *dev); extern void cpuidle_disable_device(struct cpuidle_device *dev); #else +static inline int cpuidle_disabled(void) { return 1; } static inline void disable_cpuidle(void) { } static inline int cpuidle_idle_call(void) { return -ENODEV; } -- 1.7.7.1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-08 21:15 UTC
[Xen-devel] [PATCH 2/3] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
When using the paravirt interface, most of the page operations are wrapped in the pvops interface. The one that is not is the pte_flags. The reason being that for most cases, the "raw" PTE flag values for baremetal and whatever pvops platform is running (in this case) - share the same bit meaning. Except for PAT. Under Linux, the PAT MSR is written to be: PAT4 PAT0 +---+----+----+----+-----+----+----+ WC | WC | WB | UC | UC- | WC | WB | <= Linux +---+----+----+----+-----+----+----+ WC | WT | WB | UC | UC- | WT | WB | <= BIOS +---+----+----+----+-----+----+----+ WC | WP | WC | UC | UC- | WT | WB | <= Xen +---+----+----+----+-----+----+----+ The lookup of this index table translates to looking up Bit 7, Bit 4, and Bit 3 of PTE: PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3). If all bits are off, then we are using PAT0. If bit 3 turned on, then we are using PAT1, if bit 3 and bit 4, then PAT2.. Back to the PAT MSR table: As you can see, the PAT1 translates to PAT4 under Xen. Under Linux we only use PAT0, PAT1, and PAT2 for the caching as: WB = none (so PAT0) WC = PWT (bit 3 on) UC = PWT | PCD (bit 3 and 4 are on). But to make it work with Xen, we end up doing for WC a translation: PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3 And to translate back (when the paravirt pte_val is used) we would: PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7. This works quite well, except if code uses the pte_flags, as pte_flags reads the raw value and does not go through the paravirt. Which means that if (when running under Xen): 1) we allocate some pages. 2) call set_pages_array_wc, which ends up calling: __page_change_att_set_clr(.., __pgprot(__PAGE_WC), /* set */ , __pgprot(__PAGE_MASK), /* clear */ which ends up reading the _raw_ PTE flags and _only_ look at the _PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and __PAGE_WC (0x8) set. read raw *pte -> 0x67 *pte = 0x67 & ^0x18 | 0x8 *pte = 0x67 & 0xfffffe7 | 0x8 *pte = 0x6f [now set_pte_atomic is called, and 0x6f is written in, but under xen_make_pte, the bit 3 is translated to bit 7, so it ends up writting 0xa7, which is correct] 3) do something to them. 4) call set_pages_array_wb __page_change_att_set_clr(.., __pgprot(__PAGE_WB), /* set */ , __pgprot(__PAGE_MASK), /* clear */ which ends up reading the _raw_ PTE and _only_ look at the _PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and __PAGE_WB (0x0) set: read raw *pte -> 0xa7 *pte = 0xa7 & &0x18 | 0 *pte = 0xa7 & 0xfffffe7 | 0 *pte = 0xa7 [we check whether the old PTE is different from the new one if (pte_val(old_pte) != pte_val(new_pte)) { set_pte_atomic(kpte, new_pte); ... and find out that 0xA7 == 0xA7 so we do not write the new PTE value in] End result is that we failed at removing the WC caching bit! 5) free them. [and have pages with PAT4 (bit 7) set, so other subsystems end up using the pages that have the write combined bit set resulting in crashes. Yikes!]. The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA code with newly introduced pte_attrs which can go through the pvops interface to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is not set, it would end calling native_pte_val. The other way to fix this is by wrapping pte_flags and go through the pvops interface and it really is the Right Thing to do. The problem is, that past experience with mprotect stuff demonstrates that it be really expensive in inner loops, and pte_flags() is used in some very perf-critical areas. Example code to run this and see the various mysterious subsystems/applications crashing MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>"); MODULE_DESCRIPTION("wb_to_wc_and_back"); MODULE_LICENSE("GPL"); MODULE_VERSION(WB_TO_WC); static int thread(void *arg) { struct page *a[MAX_PAGES]; unsigned int i, j; do { for (j = 0, i = 0;i < MAX_PAGES; i++, j++) { a[i] = alloc_page(GFP_KERNEL); if (!a[i]) break; } set_pages_array_wc(a, j); set_current_state(TASK_INTERRUPTIBLE); schedule_timeout_interruptible(HZ); for (i = 0; i < j; i++) { unsigned long *addr = page_address(a[i]); if (addr) { memset(addr, 0xc2, PAGE_SIZE); } } set_pages_array_wb(a, j); for (i = 0; i< MAX_PAGES; i++) { if (a[i]) __free_page(a[i]); a[i] = NULL; } } while (!kthread_should_stop()); return 0; } static struct task_struct *t; static int __init wb_to_wc_init(void) { t = kthread_run(thread, NULL, "wb_to_wc_and_back"); return 0; } static void __exit wb_to_wc_exit(void) { if (t) kthread_stop(t); } module_init(wb_to_wc_init); module_exit(wb_to_wc_exit); This fixes RH BZ #742032 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@kernel.org --- arch/x86/include/asm/pgtable.h | 5 +++++ arch/x86/mm/pageattr.c | 2 +- 2 files changed, 6 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 18601c8..34027b0 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -349,6 +349,11 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) return __pgprot(preservebits | addbits); } +static inline pgprot_t pte_attrs(pte_t pte) +{ + return __pgprot(pte_val(pte) & PTE_FLAGS_MASK); +} + #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK) #define canon_pgprot(p) __pgprot(massage_pgprot(p)) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index f9e5267..efa8040 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -651,7 +651,7 @@ repeat: if (level == PG_LEVEL_4K) { pte_t new_pte; - pgprot_t new_prot = pte_pgprot(old_pte); + pgprot_t new_prot = pte_attrs(old_pte); unsigned long pfn = pte_pfn(old_pte); pgprot_val(new_prot) &= ~pgprot_val(cpa->mask_clr); -- 1.7.7.1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-08 21:15 UTC
[Xen-devel] [PATCH 3/3] x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
For details refer to patch "x86/paravirt: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations." which explains that some pages have the _PAGE_PWT bit set in the _PAGE_PSE field when running under Xen. When pageattr_test is running it uses pte_flags to check whether it succedded in setting _PAGE_UNUSED1 bit, but also whether the page had _PAGE_PSE. This can happen when one of the randomly selected pages to be tested is a page that has been set to be _PAGE_WC as under Xen, that field is under _PAGE_PSE. Since the ''pte_huge'' call is using the pte_flags(x) macro, which extracts the "raw" contents of the PTE, the translation of _PAGE_PSE -> _PAGE_PWT does not happen and we incorrectly identify the PTE as bad. Using the ''pte_val'' instead of ''pte_flags'' fixes the problem and this patch does that. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@kernel.org --- arch/x86/mm/pageattr-test.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/x86/mm/pageattr-test.c b/arch/x86/mm/pageattr-test.c index b008656..da853e3 100644 --- a/arch/x86/mm/pageattr-test.c +++ b/arch/x86/mm/pageattr-test.c @@ -38,6 +38,10 @@ static int pte_testbit(pte_t pte) { return pte_flags(pte) & _PAGE_UNUSED1; } +static int pte_testhuge(pte_t pte) +{ + return pte_val(pte) & _PAGE_PSE; +} struct split_state { long lpg, gpg, spg, exec; @@ -180,7 +184,7 @@ static int pageattr_test(void) } pte = lookup_address(addr[i], &level); - if (!pte || !pte_testbit(*pte) || pte_huge(*pte)) { + if (!pte || !pte_testbit(*pte) || pte_testhuge(*pte)) { printk(KERN_ERR "CPA %lx: bad pte %Lx\n", addr[i], pte ? (u64)pte_val(*pte) : 0ULL); failed++; -- 1.7.7.1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Deepthi Dharwar
2011-Nov-09 10:19 UTC
[Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
On Wednesday 09 November 2011 02:45 AM, Konrad Rzeszutek Wilk wrote:> . however we ignore the disable_cpuidle() directive and > in select_idle_routine() set it to type preferred by the CPU. > > This is a regression introduced by > > commit a0bfa1373859e9d11dc92561a8667588803e42d8 > Author: Len Brown <len.brown@intel.com> > Date: Fri Apr 1 19:34:59 2011 -0400 > > cpuidle: stop depending on pm_idle > > specifically this patch: > > commit d91ee5863b71e8c90eaf6035bff3078a85e2e7b5 > Author: Len Brown <len.brown@intel.com> > Date: Fri Apr 1 18:28:35 2011 -0400 > > cpuidle: replace xen access to x86 pm_idle and default_idle > > ..scribble on pm_idle and access default_idle, > have it simply disable_cpuidle() so acpi_idle will not load and > architecture default HLT will be used. > > .. but the default HLT does not get used. Instead we end up in the > situation that select_idle_routine() is called from arch/x86/kernel/cpu/common.c > and if we called cpuidle_disable(), then the pm_idle is not set, and > we end up setting pm_idle to mwait_idle or amd_e400_idle. > > This patch uses the cpuidle_disabled() functionality and > makes select_idle_routine() adhere to that. > > Reported-by: Stefan Bader <stefan.bader@canonical.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Ingo Molnar <mingo@redhat.com> > CC: "H. Peter Anvin" <hpa@zytor.com> > CC: x86@kernel.org > CC: Len Brown <len.brown@intel.com> > CC: Borislav Petkov <bp@alien8.de> > CC: Tejun Heo <tj@kernel.org> > CC: Thomas Renninger <trenn@suse.de> > CC: stable@kernel.org > --- > arch/x86/kernel/process.c | 5 +++++ > include/linux/cpuidle.h | 2 ++ > 2 files changed, 7 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index e7e3b01..1f7f8c8 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -14,6 +14,7 @@ > #include <linux/utsname.h> > #include <trace/events/power.h> > #include <linux/hw_breakpoint.h> > +#include <linux/cpuidle.h> > #include <asm/cpu.h> > #include <asm/system.h> > #include <asm/apic.h> > @@ -587,6 +588,10 @@ void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) > if (pm_idle) > return; > > + if (cpuidle_disabled()) { > + pm_idle = default_idle; > + return; > + }The above check is needed to initialise pm_idle to default_idle if cpuidle is disabled but this requirement here is Zen specific. On other architectures, if cpuidle is disabled on boot then we rather would want mwait_idle or amd_e400_idle to be enabled than default_idle. Can we make this check Zen specific ? ... if(ZEN_ARCH && cpuidle_disabled()) ...> if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { > /* > * One CPU supports mwait => All CPUs supports mwait > diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h > index b51629e..123fe9e 100644 > --- a/include/linux/cpuidle.h > +++ b/include/linux/cpuidle.h > @@ -122,6 +122,7 @@ struct cpuidle_driver { > }; > > #ifdef CONFIG_CPU_IDLE > +extern int cpuidle_disabled(void); > extern void disable_cpuidle(void); > extern int cpuidle_idle_call(void); > > @@ -137,6 +138,7 @@ extern int cpuidle_enable_device(struct cpuidle_device *dev); > extern void cpuidle_disable_device(struct cpuidle_device *dev); > > #else > +static inline int cpuidle_disabled(void) { return 1; } > static inline void disable_cpuidle(void) { } > static inline int cpuidle_idle_call(void) { return -ENODEV; } >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Deepthi Dharwar
2011-Nov-09 10:51 UTC
[Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
Just correcting the typo s/zen/xen/g . On Wednesday 09 November 2011 03:49 PM, Deepthi Dharwar wrote:> On Wednesday 09 November 2011 02:45 AM, Konrad Rzeszutek Wilk wrote: >> . however we ignore the disable_cpuidle() directive and >> in select_idle_routine() set it to type preferred by the CPU. >> >> This is a regression introduced by >> >> commit a0bfa1373859e9d11dc92561a8667588803e42d8 >> Author: Len Brown <len.brown@intel.com> >> Date: Fri Apr 1 19:34:59 2011 -0400 >> >> cpuidle: stop depending on pm_idle >> >> specifically this patch: >> >> commit d91ee5863b71e8c90eaf6035bff3078a85e2e7b5 >> Author: Len Brown <len.brown@intel.com> >> Date: Fri Apr 1 18:28:35 2011 -0400 >> >> cpuidle: replace xen access to x86 pm_idle and default_idle >> >> ..scribble on pm_idle and access default_idle, >> have it simply disable_cpuidle() so acpi_idle will not load and >> architecture default HLT will be used. >> >> .. but the default HLT does not get used. Instead we end up in the >> situation that select_idle_routine() is called from arch/x86/kernel/cpu/common.c >> and if we called cpuidle_disable(), then the pm_idle is not set, and >> we end up setting pm_idle to mwait_idle or amd_e400_idle. >> >> This patch uses the cpuidle_disabled() functionality and >> makes select_idle_routine() adhere to that. >> >> Reported-by: Stefan Bader <stefan.bader@canonical.com> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> >> CC: Thomas Gleixner <tglx@linutronix.de> >> CC: Ingo Molnar <mingo@redhat.com> >> CC: "H. Peter Anvin" <hpa@zytor.com> >> CC: x86@kernel.org >> CC: Len Brown <len.brown@intel.com> >> CC: Borislav Petkov <bp@alien8.de> >> CC: Tejun Heo <tj@kernel.org> >> CC: Thomas Renninger <trenn@suse.de> >> CC: stable@kernel.org >> --- >> arch/x86/kernel/process.c | 5 +++++ >> include/linux/cpuidle.h | 2 ++ >> 2 files changed, 7 insertions(+), 0 deletions(-) >> >> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c >> index e7e3b01..1f7f8c8 100644 >> --- a/arch/x86/kernel/process.c >> +++ b/arch/x86/kernel/process.c >> @@ -14,6 +14,7 @@ >> #include <linux/utsname.h> >> #include <trace/events/power.h> >> #include <linux/hw_breakpoint.h> >> +#include <linux/cpuidle.h> >> #include <asm/cpu.h> >> #include <asm/system.h> >> #include <asm/apic.h> >> @@ -587,6 +588,10 @@ void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) >> if (pm_idle) >> return; >> >> + if (cpuidle_disabled()) { >> + pm_idle = default_idle; >> + return; >> + } > > > The above check is needed to initialise pm_idle to default_idle if > cpuidle is disabled but this requirement here is xen specific. > On other architectures, if cpuidle is disabled on boot then we > rather would want mwait_idle or amd_e400_idle to be enabled than > default_idle. Can we make this check xen specific ? > > ... if(XEN_ARCH && cpuidle_disabled()) ... > > > > >> if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { >> /* >> * One CPU supports mwait => All CPUs supports mwait >> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h >> index b51629e..123fe9e 100644 >> --- a/include/linux/cpuidle.h >> +++ b/include/linux/cpuidle.h >> @@ -122,6 +122,7 @@ struct cpuidle_driver { >> }; >> >> #ifdef CONFIG_CPU_IDLE >> +extern int cpuidle_disabled(void); >> extern void disable_cpuidle(void); >> extern int cpuidle_idle_call(void); >> >> @@ -137,6 +138,7 @@ extern int cpuidle_enable_device(struct cpuidle_device *dev); >> extern void cpuidle_disable_device(struct cpuidle_device *dev); >> >> #else >> +static inline int cpuidle_disabled(void) { return 1; } >> static inline void disable_cpuidle(void) { } >> static inline int cpuidle_idle_call(void) { return -ENODEV; } >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-09 14:44 UTC
[Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
. snip..> > > > ..scribble on pm_idle and access default_idle, > > have it simply disable_cpuidle() so acpi_idle will not load and > > architecture default HLT will be used. > > > > .. but the default HLT does not get used. Instead we end up in theHey Deepthi,> > + if (cpuidle_disabled()) { > > + pm_idle = default_idle; > > + return; > > + } > > > The above check is needed to initialise pm_idle to default_idle if > cpuidle is disabled but this requirement here is Zen specific. > On other architectures, if cpuidle is disabled on boot then we > rather would want mwait_idle or amd_e400_idle to be enabled than > default_idle. Can we make this check Zen specific ?We do? Why? I would have thought that if you want to disable the cpuidle you would want the default_idle. Perhaps there is another way do this is: diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index becd6d9..04b10a4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -38,6 +38,7 @@ int cpuidle_disabled(void) void disable_cpuidle(void) { off = 1; + pm_idle = default_idle; } which would do almost the same thing as this patch (well, except that if you run cpuidle.off=1 you still end up with amd_e400_idle instead of default_idle, so it is not the complete solution).> > ... if(ZEN_ARCH && cpuidle_disabled()) ...That would not work very well. For one thing you would need to call ''xen_domain()'', and pull in a lots of header files. Second of, it looks quite ugly and kernel folks like pretty code. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-10 04:06 UTC
Re: [Xen-devel] [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
On Tue, Nov 08, 2011 at 04:15:12PM -0500, Konrad Rzeszutek Wilk wrote:> . however we ignore the disable_cpuidle() directive and > in select_idle_routine() set it to type preferred by the CPU. > > This is a regression introduced by > > commit a0bfa1373859e9d11dc92561a8667588803e42d8 > Author: Len Brown <len.brown@intel.com> > Date: Fri Apr 1 19:34:59 2011 -0400 > > cpuidle: stop depending on pm_idle > > specifically this patch: > > commit d91ee5863b71e8c90eaf6035bff3078a85e2e7b5 > Author: Len Brown <len.brown@intel.com> > Date: Fri Apr 1 18:28:35 2011 -0400 > > cpuidle: replace xen access to x86 pm_idle and default_idle > > ..scribble on pm_idle and access default_idle, > have it simply disable_cpuidle() so acpi_idle will not load and > architecture default HLT will be used. > > .. but the default HLT does not get used. Instead we end up in the > situation that select_idle_routine() is called from arch/x86/kernel/cpu/common.c > and if we called cpuidle_disable(), then the pm_idle is not set, and > we end up setting pm_idle to mwait_idle or amd_e400_idle. > > This patch uses the cpuidle_disabled() functionality and > makes select_idle_routine() adhere to that. > > Reported-by: Stefan Bader <stefan.bader@canonical.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>And just found that there is a BZ for this as well: https://bugzilla.redhat.com/show_bug.cgi?id=739499 And this patch fixes the Linux kernel to boot under Amazon EC2. Another option which I played with today was to do this instead: diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 46d6d21..45136f2 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -448,6 +448,6 @@ void __init xen_arch_setup(void) #endif disable_cpuidle(); boot_option_idle_override = IDLE_HALT; - + pm_idle = default_idle; fiddle_vdso(); } But that defeats the whole purpose of the patch series that Len developed where we would _not_ scribble on the pm_idle. <sigh> Thoughts?> > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Ingo Molnar <mingo@redhat.com> > CC: "H. Peter Anvin" <hpa@zytor.com> > CC: x86@kernel.org > CC: Len Brown <len.brown@intel.com> > CC: Borislav Petkov <bp@alien8.de> > CC: Tejun Heo <tj@kernel.org> > CC: Thomas Renninger <trenn@suse.de> > CC: stable@kernel.org > --- > arch/x86/kernel/process.c | 5 +++++ > include/linux/cpuidle.h | 2 ++ > 2 files changed, 7 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index e7e3b01..1f7f8c8 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -14,6 +14,7 @@ > #include <linux/utsname.h> > #include <trace/events/power.h> > #include <linux/hw_breakpoint.h> > +#include <linux/cpuidle.h> > #include <asm/cpu.h> > #include <asm/system.h> > #include <asm/apic.h> > @@ -587,6 +588,10 @@ void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) > if (pm_idle) > return; > > + if (cpuidle_disabled()) { > + pm_idle = default_idle; > + return; > + } > if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { > /* > * One CPU supports mwait => All CPUs supports mwait > diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h > index b51629e..123fe9e 100644 > --- a/include/linux/cpuidle.h > +++ b/include/linux/cpuidle.h > @@ -122,6 +122,7 @@ struct cpuidle_driver { > }; > > #ifdef CONFIG_CPU_IDLE > +extern int cpuidle_disabled(void); > extern void disable_cpuidle(void); > extern int cpuidle_idle_call(void); > > @@ -137,6 +138,7 @@ extern int cpuidle_enable_device(struct cpuidle_device *dev); > extern void cpuidle_disable_device(struct cpuidle_device *dev); > > #else > +static inline int cpuidle_disabled(void) { return 1; } > static inline void disable_cpuidle(void) { } > static inline int cpuidle_idle_call(void) { return -ENODEV; } > > -- > 1.7.7.1 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Deepthi Dharwar
2011-Nov-11 11:41 UTC
[Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
On Wednesday 09 November 2011 08:14 PM, Konrad Rzeszutek Wilk wrote:> . snip.. >>> >>> ..scribble on pm_idle and access default_idle, >>> have it simply disable_cpuidle() so acpi_idle will not load and >>> architecture default HLT will be used. >>> >>> .. but the default HLT does not get used. Instead we end up in the > > Hey Deepthi, > >>> + if (cpuidle_disabled()) { >>> + pm_idle = default_idle; >>> + return; >>> + } >> >> >> The above check is needed to initialise pm_idle to default_idle if >> cpuidle is disabled but this requirement here is Zen specific. >> On other architectures, if cpuidle is disabled on boot then we >> rather would want mwait_idle or amd_e400_idle to be enabled than >> default_idle. Can we make this check Zen specific ? > > We do? Why? I would have thought that if you want to disable the cpuidle > you would want the default_idle.Well I was with a view that if cpuidle is disabled, mwait and arm_e400 would take precedence over default_idle. But if is ok to have default_idle instead, then go ahead.> Perhaps there is another way do this is: > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index becd6d9..04b10a4 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -38,6 +38,7 @@ int cpuidle_disabled(void) > void disable_cpuidle(void) > { > off = 1; > + pm_idle = default_idle; > } >Brining pm_idle pointer back to cpuidle code is going a step back coz we wanted to complete remove using pm_idle going forward. As having a pointer exported into various files is not a good thing. That''s the reason we started the clean up in the first place :)> which would do almost the same thing as this patch (well, except > that if you run cpuidle.off=1 you still end up with amd_e400_idle > instead of default_idle, so it is not the complete solution). > >> >> ... if(ZEN_ARCH && cpuidle_disabled()) ... > > That would not work very well. For one thing you would need to call > ''xen_domain()'', and pull in a lots of header files. Second of, it > looks quite ugly and kernel folks like pretty code. >Yes, I agree to this. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Len Brown
2011-Nov-13 06:00 UTC
Re: [Xen-devel] [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
> And just found that there is a BZ for this as well: > https://bugzilla.redhat.com/show_bug.cgi?id=739499 > > And this patch fixes the Linux kernel to boot under Amazon EC2.doesn''t matter. Working around an Amazon EC2 bug in a newly compiled upstream kernel isn''t going to help with any of the kernels that don''t include that workaround. Amazon EC2 should not advertise MWAIT support that it does not have, and all kernels should run on it w/o any workaround. Len Brown, Intel Open Source Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-14 14:37 UTC
Re: [Xen-devel] [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
On Sun, Nov 13, 2011 at 01:00:15AM -0500, Len Brown wrote:> > And just found that there is a BZ for this as well: > > https://bugzilla.redhat.com/show_bug.cgi?id=739499 > > > > And this patch fixes the Linux kernel to boot under Amazon EC2. > > doesn''t matter. > > Working around an Amazon EC2 bug in a newly compiled upstream kernel > isn''t going to help with any of the kernels that don''t include that workaround.Prior to 3.1 we did have a solution for this - we would use the default_idle and not pick and/choose one based on the CPUID flags. It was not really a choice made by "hypervisor wrongly advertises the MWAIT flag", but rather that we want to use the safe_halt, which ends up calling the yield hypercall. That is the optimal solution irregardless of what version of hypervisor is being used.> > Amazon EC2 should not advertise MWAIT support that it does not have, > and all kernels should run on it w/o any workaround. > > Len Brown, Intel Open Source Technology Center > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-15 14:40 UTC
Re: [Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
> Well I was with a view that if cpuidle is disabled, mwait and arm_e400 > would take precedence over default_idle. But if is ok to have default_idle instead, > then go ahead. > > > Perhaps there is another way do this is: > > > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > > index becd6d9..04b10a4 100644 > > --- a/drivers/cpuidle/cpuidle.c > > +++ b/drivers/cpuidle/cpuidle.c > > @@ -38,6 +38,7 @@ int cpuidle_disabled(void) > > void disable_cpuidle(void) > > { > > off = 1; > > + pm_idle = default_idle; > > } > > > Brining pm_idle pointer back to cpuidle code is going a step back coz > we wanted to complete remove using pm_idle going forward. As having > a pointer exported into various files is not a good thing. That''s the > reason we started the clean up in the first place :)Perhaps then something along these lines, where we still set default_idle but don''t fiddle with the pm_idle (and can still rip out the EXPORT_SYMBOL_GPL(default_idle) in 2012): (Hadn''t tested this yet). diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h index c2ff2a1..2d2f01c 100644 --- a/arch/x86/include/asm/system.h +++ b/arch/x86/include/asm/system.h @@ -401,6 +401,7 @@ extern unsigned long arch_align_stack(unsigned long sp); extern void free_init_pages(char *what, unsigned long begin, unsigned long end); void default_idle(void); +bool set_pm_idle_to_default(void); void stop_this_cpu(void *dummy); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index e7e3b01..bfb113f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -403,6 +403,14 @@ void default_idle(void) EXPORT_SYMBOL(default_idle); #endif +bool set_pm_idle_to_default() +{ + if (!pm_idle) { + pm_idle = default_idle; + return true; + } + return false; +} void stop_this_cpu(void *dummy) { local_irq_disable(); diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 46d6d21..7506181 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -448,6 +448,6 @@ void __init xen_arch_setup(void) #endif disable_cpuidle(); boot_option_idle_override = IDLE_HALT; - + WARN_ON(!set_pm_idle_to_default()); fiddle_vdso(); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Nov-22 13:46 UTC
Re: [Xen-devel] Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
On Tue, Nov 15, 2011 at 09:40:04AM -0500, Konrad Rzeszutek Wilk wrote:> > Well I was with a view that if cpuidle is disabled, mwait and arm_e400 > > would take precedence over default_idle. But if is ok to have default_idle instead, > > then go ahead. > > > > > Perhaps there is another way do this is: > > > > > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > > > index becd6d9..04b10a4 100644 > > > --- a/drivers/cpuidle/cpuidle.c > > > +++ b/drivers/cpuidle/cpuidle.c > > > @@ -38,6 +38,7 @@ int cpuidle_disabled(void) > > > void disable_cpuidle(void) > > > { > > > off = 1; > > > + pm_idle = default_idle; > > > } > > > > > Brining pm_idle pointer back to cpuidle code is going a step back coz > > we wanted to complete remove using pm_idle going forward. As having > > a pointer exported into various files is not a good thing. That''s the > > reason we started the clean up in the first place :) > > Perhaps then something along these lines, where we still set default_idle > but don''t fiddle with the pm_idle (and can still rip out the > EXPORT_SYMBOL_GPL(default_idle) in 2012): > > (Hadn''t tested this yet).Now have tested it.> > diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h > index c2ff2a1..2d2f01c 100644 > --- a/arch/x86/include/asm/system.h > +++ b/arch/x86/include/asm/system.h > @@ -401,6 +401,7 @@ extern unsigned long arch_align_stack(unsigned long sp); > extern void free_init_pages(char *what, unsigned long begin, unsigned long end); > > void default_idle(void); > +bool set_pm_idle_to_default(void); > > void stop_this_cpu(void *dummy); > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index e7e3b01..bfb113f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -403,6 +403,14 @@ void default_idle(void) > EXPORT_SYMBOL(default_idle); > #endif > > +bool set_pm_idle_to_default() > +{ > + if (!pm_idle) { > + pm_idle = default_idle; > + return true; > + } > + return false; > +} > void stop_this_cpu(void *dummy) > { > local_irq_disable(); > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > index 46d6d21..7506181 100644 > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -448,6 +448,6 @@ void __init xen_arch_setup(void) > #endif > disable_cpuidle(); > boot_option_idle_override = IDLE_HALT; > - > + WARN_ON(!set_pm_idle_to_default()); > fiddle_vdso(); > } > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel
Deepthi Dharwar
2011-Nov-23 11:06 UTC
Re: [PATCH 1/3] cpuidle: If disable_cpuidle() is called, set pm_idle to default_idle.
Hi Konrad, On 11/22/2011 07:16 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 15, 2011 at 09:40:04AM -0500, Konrad Rzeszutek Wilk wrote: >>> Well I was with a view that if cpuidle is disabled, mwait and arm_e400 >>> would take precedence over default_idle. But if is ok to have default_idle instead, >>> then go ahead. >>> >>>> Perhaps there is another way do this is: >>>> >>>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >>>> index becd6d9..04b10a4 100644 >>>> --- a/drivers/cpuidle/cpuidle.c >>>> +++ b/drivers/cpuidle/cpuidle.c >>>> @@ -38,6 +38,7 @@ int cpuidle_disabled(void) >>>> void disable_cpuidle(void) >>>> { >>>> off = 1; >>>> + pm_idle = default_idle; >>>> } >>>> >>> Brining pm_idle pointer back to cpuidle code is going a step back coz >>> we wanted to complete remove using pm_idle going forward. As having >>> a pointer exported into various files is not a good thing. That''s the >>> reason we started the clean up in the first place :) >> >> Perhaps then something along these lines, where we still set default_idle >> but don''t fiddle with the pm_idle (and can still rip out the >> EXPORT_SYMBOL_GPL(default_idle) in 2012): >> >> (Hadn''t tested this yet). > > Now have tested it. >As long as you dont bring back exporting pm_idle and using it in cpuidle code it should be ok. So one should not use pm_idle in disable_cpuidle function.>> >> diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h >> index c2ff2a1..2d2f01c 100644 >> --- a/arch/x86/include/asm/system.h >> +++ b/arch/x86/include/asm/system.h >> @@ -401,6 +401,7 @@ extern unsigned long arch_align_stack(unsigned long sp); >> extern void free_init_pages(char *what, unsigned long begin, unsigned long end); >> >> void default_idle(void); >> +bool set_pm_idle_to_default(void); >> >> void stop_this_cpu(void *dummy); >> >> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c >> index e7e3b01..bfb113f 100644 >> --- a/arch/x86/kernel/process.c >> +++ b/arch/x86/kernel/process.c >> @@ -403,6 +403,14 @@ void default_idle(void) >> EXPORT_SYMBOL(default_idle); >> #endif >> >> +bool set_pm_idle_to_default() >> +{ >> + if (!pm_idle) { >> + pm_idle = default_idle; >> + return true; >> + } >> + return false; >> +} >> void stop_this_cpu(void *dummy) >> { >> local_irq_disable(); >> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c >> index 46d6d21..7506181 100644 >> --- a/arch/x86/xen/setup.c >> +++ b/arch/x86/xen/setup.c >> @@ -448,6 +448,6 @@ void __init xen_arch_setup(void) >> #endif >> disable_cpuidle(); >> boot_option_idle_override = IDLE_HALT; >> - >> + WARN_ON(!set_pm_idle_to_default()); >> fiddle_vdso(); >> } >>This looks good, where we see setting of default_idle only in Xen setup code and not the in the generic cpuidle code path. This would allow other architectures to use mwait or arm_e400 to be enabled when cpuidle is disabled.>> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >Cheers, Deepthi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Dec-02 23:31 UTC
Re: [PATCH 2/3] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
> The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA > code with newly introduced pte_attrs which can go through the pvops interface > to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is > not set, it would end calling native_pte_val. > > The other way to fix this is by wrapping pte_flags and go through the pvops > interface and it really is the Right Thing to do. The problem is, that past > experience with mprotect stuff demonstrates that it be really expensive in inner > loops, and pte_flags() is used in some very perf-critical areas.I did not get to verify the mprotect stuff as I need to chase down the details of it, but I did run some benchmarks using kernbench on three different boxes: AMD A8-3850 (8GB) - tst005 Intel i3-2100 (8GB) - tst007 Nehelem EX (32logical cpus) (32GB) - tst010 I''ve put all the kernebench results in https://www.dumpdata.com/results/baseline_pte_flags_pte_attrs/ (and the chart for the AMD is attached). The boxes have a fresh install of F16, with a 3.2-rc3 variant kernel using the .config that F16 came with. I just hit Enter when oldconfig asked me to choose. The baseline is virgin v3.2-rc3. The pte_attrs is the patch that this email is replaying too (on top of v3.2-rc3). The pte_flags are two patches that wrap pte_flags in paravirt and use alternative_asm to patch the code (on top of v3.2-rc3). The patches are in the URL mentioned or in my git branch as devel/pte_attrs.v1 ( git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git). I am also attaching them in this email. The summary is that I could only get the numbers to show some difference when the maximum load was run - and _only_ on the AMD machine. The small SandyBridge and the big SandyBridge had no trouble with. The AMD machine the difference was 13% worst if pte_flags (so alternative_asm) was used instead of pte_attrs. The way I did these tests is to bootup with ''init=/bin/bash'', remount / as rw, activate swap disk and run kernbench on the v3.2-rc3 linux tree. Then unplug the machine for a tea break and then repeat the cycle with a different kernel.