Andre Przywara
2012-May-22 16:07 UTC
[PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
Hi, while testing some APERF/MPERF semantics I discovered that this feature is enabled in Xen Dom0, but is not reliable. The Linux kernel''s scheduler uses this feature if it sees the CPUID bit, leading to costly RDMSR traps (a few 100,000s during a kernel compile) and bogus values due to VCPU migration during the measurement. The attached patch explicitly disables this CPU capability inside the Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with the patch applied. I am not sure if the PVOPS code is the right place to fix this, we could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). Also when the Dom0 VCPUs are pinned, we could allow this, but I am not sure if it''s worth to do so. Awaiting your comments. Regards, Andre. P.S. Of course this doesn''t fix pure userland software like cpupower, but I would consider this in the user''s responsibility to not use these tools in Dom0, but instead use xenpm. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jeremy Fitzhardinge
2012-May-22 16:52 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/22/2012 09:07 AM, Andre Przywara wrote:> Hi, > > while testing some APERF/MPERF semantics I discovered that this > feature is enabled in Xen Dom0, but is not reliable. > The Linux kernel''s scheduler uses this feature if it sees the CPUID > bit, leading to costly RDMSR traps (a few 100,000s during a kernel > compile) and bogus values due to VCPU migration during the measurement. > The attached patch explicitly disables this CPU capability inside the > Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with > the patch applied. > I am not sure if the PVOPS code is the right place to fix this, we > could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). > Also when the Dom0 VCPUs are pinned, we could allow this, but I am not > sure if it''s worth to do so.Seems reasonable to me. Do all those RDMSR traps have a measurable performance effect? Also, is there a symbolic constant for that bit? J
Malcolm Crossley
2012-May-22 17:08 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 22/05/12 17:52, Jeremy Fitzhardinge wrote:> On 05/22/2012 09:07 AM, Andre Przywara wrote: >> Hi, >> >> while testing some APERF/MPERF semantics I discovered that this >> feature is enabled in Xen Dom0, but is not reliable. >> The Linux kernel''s scheduler uses this feature if it sees the CPUID >> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >> compile) and bogus values due to VCPU migration during the measurement. >> The attached patch explicitly disables this CPU capability inside the >> Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with >> the patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we >> could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not >> sure if it''s worth to do so. > Seems reasonable to me. Do all those RDMSR traps have a measurable > performance effect? > > Also, is there a symbolic constant for that bit? > > J > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-develHi, I''ve attached a patch which masks the matching CPUID leaves in the Xen pv_cpuid function. Should the logic in pv_cpuid be changed to only pass through explictly allowed CPUID leaves rather than masking them using case statements? Malcolm _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2012-May-22 17:18 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:> Hi, > > while testing some APERF/MPERF semantics I discovered that this > feature is enabled in Xen Dom0, but is not reliable. > The Linux kernel''s scheduler uses this feature if it sees the CPUID > bit, leading to costly RDMSR traps (a few 100,000s during a kernel > compile) and bogus values due to VCPU migration during theCan you point me to the Linux scheduler code that does this? Thanks.> measurement. > The attached patch explicitly disables this CPU capability inside > the Linux kernel, I couldn''t measure any APERF/MPERF reads anymore > with the patch applied. > I am not sure if the PVOPS code is the right place to fix this, we > could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). > Also when the Dom0 VCPUs are pinned, we could allow this, but I am > not sure if it''s worth to do so. > > Awaiting your comments. > > Regards, > Andre. > > P.S. Of course this doesn''t fix pure userland software like > cpupower, but I would consider this in the user''s responsibility toWhich would not work anymore as the cpufreq support is disabled when it boots under Xen.> not use these tools in Dom0, but instead use xenpm. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany> commit e802e47d85314b4541288e4a19d057e2ea885a28 > Author: Andre Przywara <andre.przywara@amd.com> > Date: Tue May 22 15:13:07 2012 +0200 > > filter APERFMPERF feature in Xen to avoid kernel internal usage > > Signed-off-by: Andre Przywara <andre.przywara@amd.com> > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > index 95dccce..71252d5 100644 > --- a/arch/x86/xen/enlighten.c > +++ b/arch/x86/xen/enlighten.c > @@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx, > *dx = cpuid_leaf5_edx_val; > return; > > + case 6: > + /* Disabling APERFMPERF for kernel usage */ > + maskecx = ~(1U << 0); > + break; > + > case 0xb: > /* Suppress extended topology stuff */ > maskebx = 0;> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andre Przywara
2012-May-22 20:46 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/22/2012 06:52 PM, Jeremy Fitzhardinge wrote:> On 05/22/2012 09:07 AM, Andre Przywara wrote: >> Hi, >> >> while testing some APERF/MPERF semantics I discovered that this >> feature is enabled in Xen Dom0, but is not reliable. >> The Linux kernel''s scheduler uses this feature if it sees the CPUID >> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >> compile) and bogus values due to VCPU migration during the measurement. >> The attached patch explicitly disables this CPU capability inside the >> Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with >> the patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we >> could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not >> sure if it''s worth to do so. > > Seems reasonable to me. Do all those RDMSR traps have a measurable > performance effect?I haven''t tested this. I was more concerned about the invalid readouts from the MSRs from Dom0 side. One of our tests complained that the second of two consecutive APERF reads was actually lower than the first, violating the strict monotony requirement. Even if these negative values would be rare, I assume the differences are somewhat random if the readouts happen to come from different pCPUs. Start "watch cpupower monitor" and compile a kernel in Dom0 to see what I mean...> > Also, is there a symbolic constant for that bit?APERFMPERF is one of the scattered bits (arch/x86/kernel/cpu/scattered.c), so the existing constant uses the artificial Linux numbering. But right, for the next version I''d create a symbolic constant. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712
Konrad Rzeszutek Wilk
2012-May-22 21:00 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: > >On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: > >>Hi, > >> > >>while testing some APERF/MPERF semantics I discovered that this > >>feature is enabled in Xen Dom0, but is not reliable. > >>The Linux kernel''s scheduler uses this feature if it sees the CPUID > >>bit, leading to costly RDMSR traps (a few 100,000s during a kernel > >>compile) and bogus values due to VCPU migration during the > > > >Can you point me to the Linux scheduler code that does this? Thanks. > > arch/x86/kernel/cpu/sched.c contains code to read out and compute > APERF/MPERF registers. I added a Xen debug-key to dump a usage > counter added in traps.c and thus could prove that it is actually > the kernel that accesses these registers. > As far as I understood this the idea is to learn about boosting and > down-clocking (P-states) to get a fairer view on the actual > computing time a process consumed.Looks like its looking for this: X86_FEATURE_APERFMPERF Perhaps masking that should do it? Something along this in enlighten.c: cpuid_leaf1_edx_mask ~((1 << X86_FEATURE_MCE) | /* disable MCE */ (1 << X86_FEATURE_MCA) | /* disable MCA */ (1 << X86_FEATURE_MTRR) | /* disable MTRR */ (1 << X86_FEATURE_ACC)); /* thermal monitoring would be more appropiate? Or is that attribute on a different leaf?> > >>measurement. > >>The attached patch explicitly disables this CPU capability inside > >>the Linux kernel, I couldn''t measure any APERF/MPERF reads anymore > >>with the patch applied. > >>I am not sure if the PVOPS code is the right place to fix this, we > >>could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). > >>Also when the Dom0 VCPUs are pinned, we could allow this, but I am > >>not sure if it''s worth to do so. > >> > >>Awaiting your comments. > >> > >>Regards, > >>Andre. > >> > >>P.S. Of course this doesn''t fix pure userland software like > >>cpupower, but I would consider this in the user''s responsibility to > > > >Which would not work anymore as the cpufreq support is disabled > >when it boots under Xen. > > Do you mean with "anymore" in a future kernel? I tested this on > 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not > enabled, but cpupower uses the /dev/cpu/<n>/msr device file to > directly read the MSRs. So I get this output if run on an idle Dom0:Ahh. Neat. Will have to play with that.
Andre Przywara
2012-May-22 21:02 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >> Hi, >> >> while testing some APERF/MPERF semantics I discovered that this >> feature is enabled in Xen Dom0, but is not reliable. >> The Linux kernel''s scheduler uses this feature if it sees the CPUID >> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >> compile) and bogus values due to VCPU migration during the > > Can you point me to the Linux scheduler code that does this? Thanks.arch/x86/kernel/cpu/sched.c contains code to read out and compute APERF/MPERF registers. I added a Xen debug-key to dump a usage counter added in traps.c and thus could prove that it is actually the kernel that accesses these registers. As far as I understood this the idea is to learn about boosting and down-clocking (P-states) to get a fairer view on the actual computing time a process consumed.>> measurement. >> The attached patch explicitly disables this CPU capability inside >> the Linux kernel, I couldn''t measure any APERF/MPERF reads anymore >> with the patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we >> could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am >> not sure if it''s worth to do so. >> >> Awaiting your comments. >> >> Regards, >> Andre. >> >> P.S. Of course this doesn''t fix pure userland software like >> cpupower, but I would consider this in the user''s responsibility to > > Which would not work anymore as the cpufreq support is disabled > when it boots under Xen.Do you mean with "anymore" in a future kernel? I tested this on 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not enabled, but cpupower uses the /dev/cpu/<n>/msr device file to directly read the MSRs. So I get this output if run on an idle Dom0: |Mperf CPU | C0 | Cx | Freq 0| 0.10| 99.90| 3473 .... Regards, Andre.> >> not use these tools in Dom0, but instead use xenpm. >> >> -- >> Andre Przywara >> AMD-Operating System Research Center (OSRC), Dresden, Germany > >> commit e802e47d85314b4541288e4a19d057e2ea885a28 >> Author: Andre Przywara<andre.przywara@amd.com> >> Date: Tue May 22 15:13:07 2012 +0200 >> >> filter APERFMPERF feature in Xen to avoid kernel internal usage >> >> Signed-off-by: Andre Przywara<andre.przywara@amd.com> >> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >> index 95dccce..71252d5 100644 >> --- a/arch/x86/xen/enlighten.c >> +++ b/arch/x86/xen/enlighten.c >> @@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx, >> *dx = cpuid_leaf5_edx_val; >> return; >> >> + case 6: >> + /* Disabling APERFMPERF for kernel usage */ >> + maskecx = ~(1U<< 0); >> + break; >> + >> case 0xb: >> /* Suppress extended topology stuff */ >> maskebx = 0; > >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > >-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12
Andre Przywara
2012-May-22 22:44 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote:> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: >> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: >>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >>>> Hi, >>>> >>>> while testing some APERF/MPERF semantics I discovered that this >>>> feature is enabled in Xen Dom0, but is not reliable. >>>> The Linux kernel''s scheduler uses this feature if it sees the CPUID >>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >>>> compile) and bogus values due to VCPU migration during the >>> >>> Can you point me to the Linux scheduler code that does this? Thanks. >> >> arch/x86/kernel/cpu/sched.c contains code to read out and compute >> APERF/MPERF registers. I added a Xen debug-key to dump a usage >> counter added in traps.c and thus could prove that it is actually >> the kernel that accesses these registers. >> As far as I understood this the idea is to learn about boosting and >> down-clocking (P-states) to get a fairer view on the actual >> computing time a process consumed. > > Looks like its looking for this: > > X86_FEATURE_APERFMPERF > > Perhaps masking that should do it? Something along this in enlighten.c: > > cpuid_leaf1_edx_mask > ~((1<< X86_FEATURE_MCE) | /* disable MCE */ > (1<< X86_FEATURE_MCA) | /* disable MCA */ > (1<< X86_FEATURE_MTRR) | /* disable MTRR */ > (1<< X86_FEATURE_ACC)); /* thermal monitoring > > would be more appropiate? > > Or is that attribute on a different leaf?Right, it is bit 0 on level 6. That''s why I couldn''t use any of the predefined masks and I didn''t feel like inventing a new one just for this single bit. We could as well explicitly use clear_cpu_cap somewhere, but I didn''t find any code place in the Xen tree already doing this, instead it looks like it belongs to where I put it (we handle leaf 5 in a special way already here)>> >>>> measurement. >>>> The attached patch explicitly disables this CPU capability inside >>>> the Linux kernel, I couldn''t measure any APERF/MPERF reads anymore >>>> with the patch applied. >>>> I am not sure if the PVOPS code is the right place to fix this, we >>>> could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >>>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am >>>> not sure if it''s worth to do so. >>>> >>>> Awaiting your comments. >>>> >>>> Regards, >>>> Andre. >>>> >>>> P.S. Of course this doesn''t fix pure userland software like >>>> cpupower, but I would consider this in the user''s responsibility to >>> >>> Which would not work anymore as the cpufreq support is disabled >>> when it boots under Xen. >> >> Do you mean with "anymore" in a future kernel? I tested this on >> 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not >> enabled, but cpupower uses the /dev/cpu/<n>/msr device file to >> directly read the MSRs. So I get this output if run on an idle Dom0: > > Ahh. Neat. Will have to play with that.Bad news is we cannot forbid cpupower querying the feature directly using the CPUID instruction in PV guests. Only we could patch it to use /proc/cpuinfo readout instead, as this reflects the kernel view of available features. With my patch aperfmperf is no longer there. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712
Jan Beulich
2012-May-23 07:34 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
>>> On 22.05.12 at 18:07, Andre Przywara <andre.przywara@amd.com> wrote: > while testing some APERF/MPERF semantics I discovered that this feature > is enabled in Xen Dom0, but is not reliable. > The Linux kernel''s scheduler uses this feature if it sees the CPUID bit, > leading to costly RDMSR traps (a few 100,000s during a kernel compile) > and bogus values due to VCPU migration during the measurement. > The attached patch explicitly disables this CPU capability inside the > Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with the > patch applied. > I am not sure if the PVOPS code is the right place to fix this, we could > as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). > Also when the Dom0 VCPUs are pinned, we could allow this, but I am not > sure if it''s worth to do so. > > Awaiting your comments.First of all I''m of the opinion that this indeed should not be masked in the hypervisor - there''s no reason to disallow the guest to read these registers (but we should of course deny writes as long as Xen is controlling P-states, which we do). Next I''d like to note that in our kernels we simply don''t build arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being suppressed, there''s no consumer of the feature flag in our kernels. So I would think that your suggested change is appropriate, but I''m adding Konrad to Cc as these days he''s the one to pick this up. Jan
Jan Beulich
2012-May-23 08:10 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
>>> On 22.05.12 at 19:08, Malcolm Crossley <malcolm.crossley@citrix.com> wrote: > On 22/05/12 17:52, Jeremy Fitzhardinge wrote: >> On 05/22/2012 09:07 AM, Andre Przywara wrote: >>> Hi, >>> >>> while testing some APERF/MPERF semantics I discovered that this >>> feature is enabled in Xen Dom0, but is not reliable. >>> The Linux kernel''s scheduler uses this feature if it sees the CPUID >>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >>> compile) and bogus values due to VCPU migration during the measurement. >>> The attached patch explicitly disables this CPU capability inside the >>> Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with >>> the patch applied. >>> I am not sure if the PVOPS code is the right place to fix this, we >>> could as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not >>> sure if it''s worth to do so. >> Seems reasonable to me. Do all those RDMSR traps have a measurable >> performance effect? >> >> Also, is there a symbolic constant for that bit? >> >> J >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > Hi, > > I''ve attached a patch which masks the matching CPUID leaves in the Xen > pv_cpuid function. > Should the logic in pv_cpuid be changed to only pass through explictly > allowed CPUID leaves rather than masking > them using case statements? > > MalcolmAs said in another reply, I don''t think we should mask the feature in the hypervisor (and certainly not when "cpufreq=dom0-kernel"). Furthermore your patch does this only for Dom0. Jan
Andre Przywara
2012-May-23 09:14 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/23/2012 09:34 AM, Jan Beulich wrote:>>>> On 22.05.12 at 18:07, Andre Przywara<andre.przywara@amd.com> wrote: >> while testing some APERF/MPERF semantics I discovered that this feature >> is enabled in Xen Dom0, but is not reliable. >> The Linux kernel''s scheduler uses this feature if it sees the CPUID bit, >> leading to costly RDMSR traps (a few 100,000s during a kernel compile) >> and bogus values due to VCPU migration during the measurement. >> The attached patch explicitly disables this CPU capability inside the >> Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with the >> patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we could >> as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not >> sure if it''s worth to do so. >> >> Awaiting your comments. > > First of all I''m of the opinion that this indeed should not be > masked in the hypervisor - there''s no reason to disallow the > guest to read these registers (but we should of course deny > writes as long as Xen is controlling P-states, which we do).Ok. Thanks for the acknowledgment.> Next I''d like to note that in our kernels we simply don''t build > arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being > suppressed, there''s no consumer of the feature flag in our > kernels.With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked upstream as well as the repos on kernel.opensuse.org. In all of them sched.o is unconditionally included in the Makefile. So is there a build patch to exclude this file for builds of distro Xen kernels? Regards, Andre.> So I would think that your suggested change is appropriate, > but I''m adding Konrad to Cc as these days he''s the one to pick > this up. > > Jan-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
Jan Beulich
2012-May-23 09:43 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
>>> On 23.05.12 at 11:14, Andre Przywara <andre.przywara@amd.com> wrote: > On 05/23/2012 09:34 AM, Jan Beulich wrote: >> Next I''d like to note that in our kernels we simply don''t build >> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being >> suppressed, there''s no consumer of the feature flag in our >> kernels. > > With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked > upstream as well as the repos on kernel.opensuse.org. In all of them > sched.o is unconditionally included in the Makefile. > So is there a build patch to exclude this file for builds of distro Xen > kernels?Did you perhaps overlook disabled-obj-$(CONFIG_XEN) := hypervisor.o mshyperv.o perfctr-watchdog.o \ perf_event.o sched.o vmware.o in that very Makefile? Jan
Andre Przywara
2012-May-23 09:52 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/23/2012 11:43 AM, Jan Beulich wrote:>>>> On 23.05.12 at 11:14, Andre Przywara<andre.przywara@amd.com> wrote: >> On 05/23/2012 09:34 AM, Jan Beulich wrote: >>> Next I''d like to note that in our kernels we simply don''t build >>> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being >>> suppressed, there''s no consumer of the feature flag in our >>> kernels. >> >> With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked >> upstream as well as the repos on kernel.opensuse.org. In all of them >> sched.o is unconditionally included in the Makefile. >> So is there a build patch to exclude this file for builds of distro Xen >> kernels? > > Did you perhaps overlook > > disabled-obj-$(CONFIG_XEN) := hypervisor.o mshyperv.o perfctr-watchdog.o \ > perf_event.o sched.o vmware.o > > in that very Makefile?No, I just utterly ignored it ;-) Thanks for the hint. But this works only for SuSE since you build different kernels for Xen and without Xen, right? Starting with 3.x I now have only a single kernel image which I use for both native and as Dom0. So the patch to disable APERFMPERF is still useful, at least for upstream, right? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
Jan Beulich
2012-May-23 10:01 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
>>> On 23.05.12 at 11:52, Andre Przywara <andre.przywara@amd.com> wrote: > But this works only for SuSE since you build different kernels for Xen > and without Xen, right? > Starting with 3.x I now have only a single kernel image which I use for > both native and as Dom0. > > So the patch to disable APERFMPERF is still useful, at least for > upstream, right?Oh yes, absolutely. I just wanted to show (not the least to myself) that we''re already taking care of this issue. Jan
Andrew Cooper
2012-May-23 11:11 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 23/05/12 08:34, Jan Beulich wrote:>>>> On 22.05.12 at 18:07, Andre Przywara <andre.przywara@amd.com> wrote: >> while testing some APERF/MPERF semantics I discovered that this feature >> is enabled in Xen Dom0, but is not reliable. >> The Linux kernel''s scheduler uses this feature if it sees the CPUID bit, >> leading to costly RDMSR traps (a few 100,000s during a kernel compile) >> and bogus values due to VCPU migration during the measurement. >> The attached patch explicitly disables this CPU capability inside the >> Linux kernel, I couldn''t measure any APERF/MPERF reads anymore with the >> patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we could >> as well do it in the HV''s xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not >> sure if it''s worth to do so. >> >> Awaiting your comments. > First of all I''m of the opinion that this indeed should not be > masked in the hypervisor - there''s no reason to disallow the > guest to read these registers (but we should of course deny > writes as long as Xen is controlling P-states, which we do).I am sorry but I am going to have to disagree with you on this point. We should not be advertising this feature to any guest at all if we can''t provide an implementation which works as native expects. Else we are failing in our job of virtualisation. There is ''dom0_vcpus_pin''[1] which identity pins dom0 vcpus, and prevents update of the affinity masks, and appears to conditionally allow access to certain MSRs. I think it would be fine to expose this feature iff dom0s vcpus are pinned in this fashion. That way, the measurement should succeed, even if dom0 only has read access to the MSRs. The same logic applies to domU guests, although there is currently no way I can see to get domU domains into a state where advertising it would be safe. ~Andrew [1] I am however about to submit a patch (for inclusion after the feature freeze) which adds more semantics to this command line option. There is currently no way ask Xen to pin dom0s vcpus from domain create time but allow their affinity masks to be updated later. The XenServer performance team have been experimenting and have found performance benefits from being able to do this.> > Next I''d like to note that in our kernels we simply don''t build > arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being > suppressed, there''s no consumer of the feature flag in our > kernels. > > So I would think that your suggested change is appropriate, > but I''m adding Konrad to Cc as these days he''s the one to pick > this up. > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Jan Beulich
2012-May-23 12:18 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
>>> On 23.05.12 at 13:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 23/05/12 08:34, Jan Beulich wrote: >> First of all I''m of the opinion that this indeed should not be >> masked in the hypervisor - there''s no reason to disallow the >> guest to read these registers (but we should of course deny >> writes as long as Xen is controlling P-states, which we do). > > I am sorry but I am going to have to disagree with you on this point. > > We should not be advertising this feature to any guest at all if we > can''t provide an implementation which works as native expects. Else we > are failing in our job of virtualisation.That''s perhaps a matter of the position you take - for HVM, I would agree with yours, but there''s many more aspects (not the least related to accessing other MSRs) that we fail to "properly" virtualize for PV guests - my position is that it is the nature of PV that guest kernels have to be aware of being virtualized (and hence stay away from doing certain things unless [they think] they know what they''re doing).> There is ''dom0_vcpus_pin''[1] which identity pins dom0 vcpus, and > prevents update of the affinity masks, and appears to conditionally > allow access to certain MSRs. I think it would be fine to expose this > feature iff dom0s vcpus are pinned in this fashion. That way, the > measurement should succeed, even if dom0 only has read access to the MSRs.Restricting it to this case would be too restrictive - it really makes sense at any time where the vCPU''s affinity has exactly one bit set (or to be precise, the intersection of it and the set of online pCPU-s). Jan
Andrew Cooper
2012-May-23 13:21 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 23/05/12 13:18, Jan Beulich wrote:>>>> On 23.05.12 at 13:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 23/05/12 08:34, Jan Beulich wrote: >>> First of all I''m of the opinion that this indeed should not be >>> masked in the hypervisor - there''s no reason to disallow the >>> guest to read these registers (but we should of course deny >>> writes as long as Xen is controlling P-states, which we do). >> I am sorry but I am going to have to disagree with you on this point. >> >> We should not be advertising this feature to any guest at all if we >> can''t provide an implementation which works as native expects. Else we >> are failing in our job of virtualisation. > That''s perhaps a matter of the position you take - for HVM, I > would agree with yours, but there''s many more aspects (not > the least related to accessing other MSRs) that we fail to > "properly" virtualize for PV guests - my position is that it is the > nature of PV that guest kernels have to be aware of being > virtualized (and hence stay away from doing certain things > unless [they think] they know what they''re doing). > >> There is ''dom0_vcpus_pin''[1] which identity pins dom0 vcpus, and >> prevents update of the affinity masks, and appears to conditionally >> allow access to certain MSRs. I think it would be fine to expose this >> feature iff dom0s vcpus are pinned in this fashion. That way, the >> measurement should succeed, even if dom0 only has read access to the MSRs. > Restricting it to this case would be too restrictive - it really > makes sense at any time where the vCPU''s affinity has exactly > one bit set (or to be precise, the intersection of it and the set > of online pCPU-s). > > Jan >That is unfortunately too lax. You also need to be able to guarantee that the affinity mask is not updated (and vcpu rescheduled) while in the middle of a measurement. Xen cant sensibly work out if or when a guest is taking a measurement, nor can dom0. So the only safe solution I can see is for Xen to prevent the affinity masks from ever being updated. With more thought, this would also preclude migration of a guest to another host. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Konrad Rzeszutek Wilk
2012-May-23 13:26 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote:> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote: > >On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: > >>On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: > >>>On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: > >>>>Hi, > >>>> > >>>>while testing some APERF/MPERF semantics I discovered that this > >>>>feature is enabled in Xen Dom0, but is not reliable. > >>>>The Linux kernel''s scheduler uses this feature if it sees the CPUID > >>>>bit, leading to costly RDMSR traps (a few 100,000s during a kernel > >>>>compile) and bogus values due to VCPU migration during the > >>> > >>>Can you point me to the Linux scheduler code that does this? Thanks. > >> > >>arch/x86/kernel/cpu/sched.c contains code to read out and compute > >>APERF/MPERF registers. I added a Xen debug-key to dump a usage > >>counter added in traps.c and thus could prove that it is actually > >>the kernel that accesses these registers. > >>As far as I understood this the idea is to learn about boosting and > >>down-clocking (P-states) to get a fairer view on the actual > >>computing time a process consumed. > > > >Looks like its looking for this: > > > >X86_FEATURE_APERFMPERF > > > >Perhaps masking that should do it? Something along this in enlighten.c: > > > > cpuid_leaf1_edx_mask > > ~((1<< X86_FEATURE_MCE) | /* disable MCE */ > > (1<< X86_FEATURE_MCA) | /* disable MCA */ > > (1<< X86_FEATURE_MTRR) | /* disable MTRR */ > > (1<< X86_FEATURE_ACC)); /* thermal monitoring > > > >would be more appropiate? > > > >Or is that attribute on a different leaf? > > Right, it is bit 0 on level 6. That''s why I couldn''t use any of the > predefined masks and I didn''t feel like inventing a new one just for > this single bit. > We could as well explicitly use clear_cpu_cap somewhere, but I > didn''t find any code place in the Xen tree already doing this, > instead it looks like it belongs to where I put it (we handle leaf 5 > in a special way already here)OK, can you resend the patch please, looking similar to what you sent earlier, but do use a #define if possible (you can have the #define in that file) and an comment explaining why this is neccessary - and point to the Linux source code that uses this. Thanks! .. snip..> >>>>P.S. Of course this doesn''t fix pure userland software like > >>>>cpupower, but I would consider this in the user''s responsibility to > >>> > >>>Which would not work anymore as the cpufreq support is disabled > >>>when it boots under Xen. > >> > >>Do you mean with "anymore" in a future kernel? I tested this on > >>3.4.0 and cpupower monitor worked fine. Right, cpufreq is not > >>enabled, but cpupower uses the /dev/cpu/<n>/msr device file to > >>directly read the MSRs. So I get this output if run on an idle Dom0: > > > >Ahh. Neat. Will have to play with that. > > Bad news is we cannot forbid cpupower querying the feature directly > using the CPUID instruction in PV guests. Only we could patch it to > use /proc/cpuinfo readout instead, as this reflects the kernel view > of available features. With my patch aperfmperf is no longer there.Looks like a patch to cpupower should be cooked up too?
Andre Przywara
2012-May-23 13:31 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/23/2012 03:21 PM, Andrew Cooper wrote:> On 23/05/12 13:18, Jan Beulich wrote: >>>>> On 23.05.12 at 13:11, Andrew Cooper<andrew.cooper3@citrix.com> wrote: >>> On 23/05/12 08:34, Jan Beulich wrote: >>>> First of all I''m of the opinion that this indeed should not be >>>> masked in the hypervisor - there''s no reason to disallow the >>>> guest to read these registers (but we should of course deny >>>> writes as long as Xen is controlling P-states, which we do). >>> I am sorry but I am going to have to disagree with you on this point. >>> >>> We should not be advertising this feature to any guest at all if we >>> can''t provide an implementation which works as native expects. Else we >>> are failing in our job of virtualisation. >> That''s perhaps a matter of the position you take - for HVM, I >> would agree with yours, but there''s many more aspects (not >> the least related to accessing other MSRs) that we fail to >> "properly" virtualize for PV guests - my position is that it is the >> nature of PV that guest kernels have to be aware of being >> virtualized (and hence stay away from doing certain things >> unless [they think] they know what they''re doing). >> >>> There is ''dom0_vcpus_pin''[1] which identity pins dom0 vcpus, and >>> prevents update of the affinity masks, and appears to conditionally >>> allow access to certain MSRs. I think it would be fine to expose this >>> feature iff dom0s vcpus are pinned in this fashion. That way, the >>> measurement should succeed, even if dom0 only has read access to the MSRs. >> Restricting it to this case would be too restrictive - it really >> makes sense at any time where the vCPU''s affinity has exactly >> one bit set (or to be precise, the intersection of it and the set >> of online pCPU-s). >> >> Jan >> > > That is unfortunately too lax. You also need to be able to guarantee > that the affinity mask is not updated (and vcpu rescheduled) while in > the middle of a measurement. Xen cant sensibly work out if or when a > guest is taking a measurement, nor can dom0. So the only safe solution > I can see is for Xen to prevent the affinity masks from ever being > updated. With more thought, this would also preclude migration of a > guest to another host.Iff we really care about this feature, we could as well emulate it: On every VCPU migration we calculate the difference between the two pCPU''s values of APERF and MPERF. On the trap this value is added to the current MSR value. Similar to what is done with the TSC in HVM. We trap on every MSR access anyway, so the performance impact is only four HV rdmsrs on every VCPU migration. Only I am not sure if this is really a problem we should solve or if wouldn''t be easier for us and clearer to the user to just discourage those accesses. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
Andre Przywara
2012-May-24 13:24 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/23/2012 03:26 PM, Konrad Rzeszutek Wilk wrote:> On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote: >> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote: >>> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: >>>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: >>>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >>>>>> Hi, >>>>>> >>>>>> while testing some APERF/MPERF semantics I discovered that this >>>>>> feature is enabled in Xen Dom0, but is not reliable. >>>>>> The Linux kernel''s scheduler uses this feature if it sees the CPUID >>>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >>>>>> compile) and bogus values due to VCPU migration during the >>>>> >>>>> Can you point me to the Linux scheduler code that does this? Thanks. >>>> >>>> arch/x86/kernel/cpu/sched.c contains code to read out and compute >>>> APERF/MPERF registers. I added a Xen debug-key to dump a usage >>>> counter added in traps.c and thus could prove that it is actually >>>> the kernel that accesses these registers. >>>> As far as I understood this the idea is to learn about boosting and >>>> down-clocking (P-states) to get a fairer view on the actual >>>> computing time a process consumed. >>> >>> Looks like its looking for this: >>> >>> X86_FEATURE_APERFMPERF >>> >>> Perhaps masking that should do it? Something along this in enlighten.c: >>> >>> cpuid_leaf1_edx_mask >>> ~((1<< X86_FEATURE_MCE) | /* disable MCE */ >>> (1<< X86_FEATURE_MCA) | /* disable MCA */ >>> (1<< X86_FEATURE_MTRR) | /* disable MTRR */ >>> (1<< X86_FEATURE_ACC)); /* thermal monitoring >>> >>> would be more appropiate? >>> >>> Or is that attribute on a different leaf? >> >> Right, it is bit 0 on level 6. That''s why I couldn''t use any of the >> predefined masks and I didn''t feel like inventing a new one just for >> this single bit. >> We could as well explicitly use clear_cpu_cap somewhere, but I >> didn''t find any code place in the Xen tree already doing this, >> instead it looks like it belongs to where I put it (we handle leaf 5 >> in a special way already here) > > OK, can you resend the patch please, looking similar to what you sent > earlier, but do use a #define if possible (you can have the #define > in that file) and an comment explaining why this is neccessary - > and point to the Linux source code that uses this.Well, I was about to do this and wanted to see if this has any performance impact - only to discover that 3.4 does not trigger it anymore. After some debugging it turns out the guy reading APERF/MPERF was not arch/x86/kernel/cpu/sched.c, but drivers/cpufreq/mperf.c. So with disabling cpufreq the only real user is gone already. So the patch is kind of pointless as it on 3.4 with cpufreq already disabled. Remains to be investigated why sched.c is not called (I added a usage counter, it stays at zero). To avoid future mis-uses of APERF/MPERF by the kernel, I''d like to add the patch anyway. I will send it again when I have a clearer picture of this. .. snip..> Looks like a patch to cpupower should be cooked up too?I will contact the author. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712
Andre Przywara
2012-May-29 10:54 UTC
Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
On 05/24/2012 03:24 PM, Andre Przywara wrote:> On 05/23/2012 03:26 PM, Konrad Rzeszutek Wilk wrote: >> On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote: >>> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote: >>>> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: >>>>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: >>>>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >>>>>>> Hi, >>>>>>> >>>>>>> while testing some APERF/MPERF semantics I discovered that this >>>>>>> feature is enabled in Xen Dom0, but is not reliable. >>>>>>> The Linux kernel''s scheduler uses this feature if it sees the CPUID >>>>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >>>>>>> compile) and bogus values due to VCPU migration during the >>>>>> >>>>>> Can you point me to the Linux scheduler code that does this? Thanks. >>>>> >>>>> arch/x86/kernel/cpu/sched.c contains code to read out and compute >>>>> APERF/MPERF registers. I added a Xen debug-key to dump a usage >>>>> counter added in traps.c and thus could prove that it is actually >>>>> the kernel that accesses these registers. >>>>> As far as I understood this the idea is to learn about boosting and >>>>> down-clocking (P-states) to get a fairer view on the actual >>>>> computing time a process consumed. >>>> >>>> Looks like its looking for this: >>>> >>>> X86_FEATURE_APERFMPERF >>>> >>>> Perhaps masking that should do it? Something along this in enlighten.c: >>>> >>>> cpuid_leaf1_edx_mask >>>> ~((1<< X86_FEATURE_MCE) | /* disable MCE */ >>>> (1<< X86_FEATURE_MCA) | /* disable MCA */ >>>> (1<< X86_FEATURE_MTRR) | /* disable MTRR */ >>>> (1<< X86_FEATURE_ACC)); /* thermal monitoring >>>> >>>> would be more appropiate? >>>> >>>> Or is that attribute on a different leaf? >>> >>> Right, it is bit 0 on level 6. That''s why I couldn''t use any of the >>> predefined masks and I didn''t feel like inventing a new one just for >>> this single bit. >>> We could as well explicitly use clear_cpu_cap somewhere, but I >>> didn''t find any code place in the Xen tree already doing this, >>> instead it looks like it belongs to where I put it (we handle leaf 5 >>> in a special way already here) >> >> OK, can you resend the patch please, looking similar to what you sent >> earlier, but do use a #define if possible (you can have the #define >> in that file) and an comment explaining why this is neccessary - >> and point to the Linux source code that uses this. > > Well, I was about to do this and wanted to see if this has any > performance impact - only to discover that 3.4 does not trigger it > anymore. After some debugging it turns out the guy reading APERF/MPERF > was not arch/x86/kernel/cpu/sched.c, but drivers/cpufreq/mperf.c. So > with disabling cpufreq the only real user is gone already. > So the patch is kind of pointless as it on 3.4 with cpufreq already > disabled. Remains to be investigated why sched.c is not called (I added > a usage counter, it stays at zero).The scheduler code accessing the MSRs is disabled by default: kernel/sched/features.h: SCHED_FEAT(ARCH_POWER, false) I enabled this and saw the traps. So the only kernel user remaining is/was cpufreq (and /dev/cpu/<n>/msr).> To avoid future mis-uses of APERF/MPERF by the kernel, I''d like to add > the patch anyway. I will send it again when I have a clearer picture of > this.The patch will arrive in a minute. Although this is kind of pointless for 3.4, I put in a stable tag. Also I find the removed aperfmperf flag from /proc/cpuinfo useful, this can be handy for Xen aware power management tools.> .. snip.. >> Looks like a patch to cpupower should be cooked up too? > > I will contact the author.Have done this. We are still discussing the best solution with Thomas Renninger and Jan Beulich, but a simple warning (and maybe a hint to xenpm) looks the best for the time being. Not sure if proper emulation/virtualization is worth the effort. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany