Jan Beulich
2011-Oct-20 13:36 UTC
[Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
This patch set makes some first steps towards eliminating the old cpumask accessors, replacing them by such that don''t require the full NR_CPUS bits to be allocated (which obviously can be pretty wasteful when NR_CPUS is high, but the actual number is low or moderate). 01: introduce and use nr_cpu_ids and nr_cpumask_bits 02: eliminate cpumask accessors referencing NR_CPUS 03: eliminate direct assignments of CPU masks 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically 05: allocate CPU sibling and core maps dynamically 06: allow efficient allocation of multiple CPU masks at once One reason I put the following ones together was to reduce the differences between the disassembly of hypervisors built for 4095 and 2047 CPUs, which I looked at to determine the places where cpumask_t variables get copied without using cpumask_copy() (a job where grep is of no help). Hence consider these patch optional, but recommended. 07: cpufreq: allocate CPU masks dynamically 08: x86/p2m: allocate CPU masks dynamically 09: cpupools: allocate CPU masks dynamically 10: credit: allocate CPU masks dynamically 11: x86/hpet: allocate CPU masks dynamically 12: cpumask <=> xenctl_cpumap: allocate CPU masks and byte maps dynamically Signed-off-by: Jan Beulich <jbeulich@suse.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Oct-20 15:09 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote:> This patch set makes some first steps towards eliminating the old cpumask > accessors, replacing them by such that don''t require the full NR_CPUS > bits to be allocated (which obviously can be pretty wasteful when > NR_CPUS is high, but the actual number is low or moderate). > > 01: introduce and use nr_cpu_ids and nr_cpumask_bits > 02: eliminate cpumask accessors referencing NR_CPUS > 03: eliminate direct assignments of CPU masks > 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically > 05: allocate CPU sibling and core maps dynamicallyI''m not sure about this. We can save ~500 bytes per cpumask_t when NR_CPUS=4096 and actual nr_cpus<64. But how many cpumask_t''s do we typically have dynamically allocated all at once? Let''s say we waste 2kB per VCPU and per IRQ, and we have a massive system with ~1k VCPUs and ~1k IRQs -- we''d save ~4MB in that extreme case. But such a large system probably actually will have a lot of CPUs. And also a lot of memory, such that 4MB is quite insignificant. I suppose there is a second argument that it shrinks the containing structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe helps reduce our order!=0 allocations? By the way, I think we could avoid the NR_CPUS copying overhead everywhere by having the cpumask.h functions respect nr_cpu_ids, but continuing to return NR_CPUS for sentinel value (e.g., end of loop; or no bit found)? This would not need to change tonnes of code. It only gets part of the benefit (reducing cpu time overhead) but is more palatable?> 06: allow efficient allocation of multiple CPU masks at onceThat is utterly hideous and for insignificant saving.> One reason I put the following ones together was to reduce the > differences between the disassembly of hypervisors built for 4095 > and 2047 CPUs, which I looked at to determine the places where > cpumask_t variables get copied without using cpumask_copy() (a > job where grep is of no help). Hence consider these patch optional, > but recommended. > > 07: cpufreq: allocate CPU masks dynamically > 08: x86/p2m: allocate CPU masks dynamically > 09: cpupools: allocate CPU masks dynamically > 10: credit: allocate CPU masks dynamically > 11: x86/hpet: allocate CPU masks dynamically > 12: cpumask <=> xenctl_cpumap: allocate CPU masks and byte maps dynamicallyQuestionable. Any subsystem that allocates no more than a handful of cpumask_t''s is possibly just as well left alone... I''m not dead set against them if we deicde that 01-05 are actually worth pursuing, however. -- Keir> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-20 15:19 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 20.10.11 at 17:09, Keir Fraser <keir.xen@gmail.com> wrote: > On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote: > >> This patch set makes some first steps towards eliminating the old cpumask >> accessors, replacing them by such that don''t require the full NR_CPUS >> bits to be allocated (which obviously can be pretty wasteful when >> NR_CPUS is high, but the actual number is low or moderate). >> >> 01: introduce and use nr_cpu_ids and nr_cpumask_bits >> 02: eliminate cpumask accessors referencing NR_CPUS >> 03: eliminate direct assignments of CPU masks >> 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically >> 05: allocate CPU sibling and core maps dynamically > > I''m not sure about this. We can save ~500 bytes per cpumask_t when > NR_CPUS=4096 and actual nr_cpus<64. But how many cpumask_t''s do we typically > have dynamically allocated all at once? Let''s say we waste 2kB per VCPU and > per IRQ, and we have a massive system with ~1k VCPUs and ~1k IRQs -- we''d > save ~4MB in that extreme case. But such a large system probably actually > will have a lot of CPUs. And also a lot of memory, such that 4MB is quite > insignificant.It''s not only the memory savings, but the time savings in manipulating less space.> I suppose there is a second argument that it shrinks the containing > structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe > helps reduce our order!=0 allocations?Yes - that''s what made me start taking over these Linux bits. What I sent here just continues on that route. I was really hoping that we wouldn''t leave this in a half baked state.> By the way, I think we could avoid the NR_CPUS copying overhead everywhere > by having the cpumask.h functions respect nr_cpu_ids, but continuing to > return NR_CPUS for sentinel value (e.g., end of loop; or no bit found)? This > would not need to change tonnes of code. It only gets part of the benefit > (reducing cpu time overhead) but is more palatable?That would be possible, but would again leave is in a somewhat incomplete state. (Note that I did leave NR_CPUS in the stop- machine logic).>> 06: allow efficient allocation of multiple CPU masks at once > > That is utterly hideous and for insignificant saving.I was afraid you would say that, and I''m not fully convinced either. But I wanted to give it a try to see how bad it is. The more significant saving here really comes from not allocating the CPU masks at all for unused irq_desc-s. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-20 15:25 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 20.10.11 at 17:19, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 20.10.11 at 17:09, Keir Fraser <keir.xen@gmail.com> wrote: >> I suppose there is a second argument that it shrinks the containing >> structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe >> helps reduce our order!=0 allocations? > > Yes - that''s what made me start taking over these Linux bits. What I > sent here just continues on that route. I was really hoping that we > wouldn''t leave this in a half baked state.I forgot to say: The new accessors produce a little more legible code with cpumask_var_t and cpumask_t * (the second we converted to already wherever possible) than with cpumask_t, because of the need to use & on the latter (and as you will see they also allow removing a number of ugly looking * operators, which I always considered only temporary). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Oct-20 15:49 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 20/10/11 16:19, Jan Beulich wrote:>>>> On 20.10.11 at 17:09, Keir Fraser <keir.xen@gmail.com> wrote: >> On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote: >> >>> This patch set makes some first steps towards eliminating the old cpumask >>> accessors, replacing them by such that don''t require the full NR_CPUS >>> bits to be allocated (which obviously can be pretty wasteful when >>> NR_CPUS is high, but the actual number is low or moderate). >>> >>> 01: introduce and use nr_cpu_ids and nr_cpumask_bits >>> 02: eliminate cpumask accessors referencing NR_CPUS >>> 03: eliminate direct assignments of CPU masks >>> 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically >>> 05: allocate CPU sibling and core maps dynamically >> I''m not sure about this. We can save ~500 bytes per cpumask_t when >> NR_CPUS=4096 and actual nr_cpus<64. But how many cpumask_t''s do we typically >> have dynamically allocated all at once? Let''s say we waste 2kB per VCPU and >> per IRQ, and we have a massive system with ~1k VCPUs and ~1k IRQs -- we''d >> save ~4MB in that extreme case. But such a large system probably actually >> will have a lot of CPUs. And also a lot of memory, such that 4MB is quite >> insignificant. > It''s not only the memory savings, but the time savings in manipulating > less space. > >> I suppose there is a second argument that it shrinks the containing >> structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe >> helps reduce our order!=0 allocations? > Yes - that''s what made me start taking over these Linux bits. What I > sent here just continues on that route. I was really hoping that we > wouldn''t leave this in a half baked state. > >> By the way, I think we could avoid the NR_CPUS copying overhead everywhere >> by having the cpumask.h functions respect nr_cpu_ids, but continuing to >> return NR_CPUS for sentinel value (e.g., end of loop; or no bit found)? This >> would not need to change tonnes of code. It only gets part of the benefit >> (reducing cpu time overhead) but is more palatable? > That would be possible, but would again leave is in a somewhat > incomplete state. (Note that I did leave NR_CPUS in the stop- > machine logic). > >>> 06: allow efficient allocation of multiple CPU masks at once >> That is utterly hideous and for insignificant saving. > I was afraid you would say that, and I''m not fully convinced > either. But I wanted to give it a try to see how bad it is. The > more significant saving here really comes from not allocating > the CPU masks at all for unused irq_desc-s. > > JanThe saving of not allocating masks for unused irq_desc''s (and irq_cfg''s) will be significant in the general case. (3 * NR_UNUSED_IRQs * sizeof(mask)) where the average system is wasting most of 224 IRQs per CPU. However, I am against moving the masks out of irq_desc (perhaps this is the C++ coder inside me). Would an acceptable alternative be to change irq_desc to use cpumask_var_t''s and allocate them on first use? (I have not spent long thinking about this, so it is possible that the extra checks for Null pointers on the irq path might be counter productive?) ~Andrew> > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-20 15:57 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 20.10.11 at 17:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > The saving of not allocating masks for unused irq_desc''s (and irq_cfg''s) > will be significant in the general case. (3 * NR_UNUSED_IRQs * > sizeof(mask)) where the average system is wasting most of 224 IRQs per CPU. > > However, I am against moving the masks out of irq_desc (perhaps this is > the C++ coder inside me).I didn''t propose moving them out of there.> Would an acceptable alternative be to change irq_desc to use > cpumask_var_t''s and allocate them on first use? (I have not spent long > thinking about this, so it is possible that the extra checks for Null > pointers on the irq path might be counter productive?)That''s what it does (or really it allocates them for all GSIs at boot, and then in create_irq() for the dynamically allocated ones). There''s no reason to worry about NULL checks - any access to a never initialized irq_desc is a bug anyway. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Oct-20 16:04 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 20/10/2011 16:19, "Jan Beulich" <JBeulich@suse.com> wrote:>>> 06: allow efficient allocation of multiple CPU masks at once >> >> That is utterly hideous and for insignificant saving. > > I was afraid you would say that, and I''m not fully convinced > either. But I wanted to give it a try to see how bad it is. The > more significant saving here really comes from not allocating > the CPU masks at all for unused irq_desc-s.Aren''t we planning to dynamically allocate irq_desc-s? That would seem the nicer solution here. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Oct-20 16:06 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 20/10/2011 16:19, "Jan Beulich" <JBeulich@suse.com> wrote:>> I suppose there is a second argument that it shrinks the containing >> structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe >> helps reduce our order!=0 allocations? > > Yes - that''s what made me start taking over these Linux bits. What I > sent here just continues on that route. I was really hoping that we > wouldn''t leave this in a half baked state.Has Linux gone wholesale down this route, then? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-21 07:01 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 20.10.11 at 18:06, Keir Fraser <keir@xen.org> wrote: > On 20/10/2011 16:19, "Jan Beulich" <JBeulich@suse.com> wrote: > >>> I suppose there is a second argument that it shrinks the containing >>> structures (struct domain, struct vcpu, struct irq_desc, ...) and maybe >>> helps reduce our order!=0 allocations? >> >> Yes - that''s what made me start taking over these Linux bits. What I >> sent here just continues on that route. I was really hoping that we >> wouldn''t leave this in a half baked state. > > Has Linux gone wholesale down this route, then?Yes - the old ones are left in there is something forces CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS (which we can avoid altogether). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Oct-21 07:08 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote:> This patch set makes some first steps towards eliminating the old cpumask > accessors, replacing them by such that don''t require the full NR_CPUS > bits to be allocated (which obviously can be pretty wasteful when > NR_CPUS is high, but the actual number is low or moderate). > > 01: introduce and use nr_cpu_ids and nr_cpumask_bits > 02: eliminate cpumask accessors referencing NR_CPUS > 03: eliminate direct assignments of CPU masks > 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically > 05: allocate CPU sibling and core maps dynamically01-05/07-12: Acked-by: Keir Fraser <keir@xen.org>> 06: allow efficient allocation of multiple CPU masks at onceNot this one. -- Keir> One reason I put the following ones together was to reduce the > differences between the disassembly of hypervisors built for 4095 > and 2047 CPUs, which I looked at to determine the places where > cpumask_t variables get copied without using cpumask_copy() (a > job where grep is of no help). Hence consider these patch optional, > but recommended. > > 07: cpufreq: allocate CPU masks dynamically > 08: x86/p2m: allocate CPU masks dynamically > 09: cpupools: allocate CPU masks dynamically > 10: credit: allocate CPU masks dynamically > 11: x86/hpet: allocate CPU masks dynamically > 12: cpumask <=> xenctl_cpumap: allocate CPU masks and byte maps dynamically > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-21 07:10 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 20.10.11 at 18:04, Keir Fraser <keir@xen.org> wrote: > On 20/10/2011 16:19, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>> 06: allow efficient allocation of multiple CPU masks at once >>> >>> That is utterly hideous and for insignificant saving. >> >> I was afraid you would say that, and I''m not fully convinced >> either. But I wanted to give it a try to see how bad it is. The >> more significant saving here really comes from not allocating >> the CPU masks at all for unused irq_desc-s. > > Aren''t we planning to dynamically allocate irq_desc-s? That would seem the > nicer solution here.Yes, I would hope so. But irrespective of that, allocating e.g. 512 bits (times 4) just to use, say, 20-30 of them is bad - again, not so much from a memory wasting pov, but rather from the fact that this needlessly causes a larger cache and TLB footprint. I actually think that ultimately we should try to remove all non-dynamically allocated CPU masks (including statics, per-CPU ones, and local variables - the latter being particularly important as they cause pretty big stack frames, despite there now being at most one [with the rare exception of two] of them per function, which will continue to grow with higher NR_CPUS values). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-21 07:14 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 21.10.11 at 09:08, Keir Fraser <keir.xen@gmail.com> wrote: > On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote: > >> This patch set makes some first steps towards eliminating the old cpumask >> accessors, replacing them by such that don''t require the full NR_CPUS >> bits to be allocated (which obviously can be pretty wasteful when >> NR_CPUS is high, but the actual number is low or moderate). >> >> 01: introduce and use nr_cpu_ids and nr_cpumask_bits >> 02: eliminate cpumask accessors referencing NR_CPUS >> 03: eliminate direct assignments of CPU masks >> 04: x86: allocate IRQ actions'' cpu_eoi_map dynamically >> 05: allocate CPU sibling and core maps dynamically > > 01-05/07-12: > Acked-by: Keir Fraser <keir@xen.org> > >> 06: allow efficient allocation of multiple CPU masks at once > > Not this one.Okay, I''ll re-work (and re-send) it then to use non-array allocations. Jan>> One reason I put the following ones together was to reduce the >> differences between the disassembly of hypervisors built for 4095 >> and 2047 CPUs, which I looked at to determine the places where >> cpumask_t variables get copied without using cpumask_copy() (a >> job where grep is of no help). Hence consider these patch optional, >> but recommended. >> >> 07: cpufreq: allocate CPU masks dynamically >> 08: x86/p2m: allocate CPU masks dynamically >> 09: cpupools: allocate CPU masks dynamically >> 10: credit: allocate CPU masks dynamically >> 11: x86/hpet: allocate CPU masks dynamically >> 12: cpumask <=> xenctl_cpumap: allocate CPU masks and byte maps dynamically >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-21 07:49 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 21.10.11 at 09:08, Keir Fraser <keir.xen@gmail.com> wrote: > On 20/10/2011 14:36, "Jan Beulich" <JBeulich@suse.com> wrote: > 01-05/07-12: > Acked-by: Keir Fraser <keir@xen.org> > >> 07: cpufreq: allocate CPU masks dynamicallyI didn''t apply this one, as it wants http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01180.html to be applied first, but there was no reaction from AMD so far. Mark? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Oct-21 08:00 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
On 21/10/2011 08:10, "Jan Beulich" <JBeulich@suse.com> wrote:>> Aren''t we planning to dynamically allocate irq_desc-s? That would seem the >> nicer solution here. > > Yes, I would hope so. But irrespective of that, allocating e.g. 512 bits > (times 4) just to use, say, 20-30 of them is bad - again, not so much > from a memory wasting pov, but rather from the fact that this > needlessly causes a larger cache and TLB footprint.Oh, I don''t mind you changing irq_desc to use cpumask_var_t-s. It''s the extra step of using an array to save a few pointers, that I reject.> I actually think that ultimately we should try to remove all > non-dynamically allocated CPU masks (including statics, per-CPU > ones, and local variables - the latter being particularly important as > they cause pretty big stack frames, despite there now being at > most one [with the rare exception of two] of them per function, > which will continue to grow with higher NR_CPUS values).OTOH they are probably in some cases used in contexts where dynamic allocation (and possibility of failure) is not nice? And we can compare this effort with just increasing per-cpu stack sizes, for example? I''m not particularly against moving in this direction, but there might be cases where it isn''t worth the effort, or there are better solutions. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Oct-21 08:23 UTC
Re: [Xen-devel] [PATCH 00/12] cpumask handling scalability improvements
>>> On 21.10.11 at 10:00, Keir Fraser <keir.xen@gmail.com> wrote: > On 21/10/2011 08:10, "Jan Beulich" <JBeulich@suse.com> wrote: > >>> Aren''t we planning to dynamically allocate irq_desc-s? That would seem the >>> nicer solution here. >> >> Yes, I would hope so. But irrespective of that, allocating e.g. 512 bits >> (times 4) just to use, say, 20-30 of them is bad - again, not so much >> from a memory wasting pov, but rather from the fact that this >> needlessly causes a larger cache and TLB footprint. > > Oh, I don''t mind you changing irq_desc to use cpumask_var_t-s. It''s the > extra step of using an array to save a few pointers, that I reject.That''s what I understood.>> I actually think that ultimately we should try to remove all >> non-dynamically allocated CPU masks (including statics, per-CPU >> ones, and local variables - the latter being particularly important as >> they cause pretty big stack frames, despite there now being at >> most one [with the rare exception of two] of them per function, >> which will continue to grow with higher NR_CPUS values). > > OTOH they are probably in some cases used in contexts where dynamic > allocation (and possibility of failure) is not nice? And we can compare this > effort with just increasing per-cpu stack sizes, for example? > > I''m not particularly against moving in this direction, but there might be > cases where it isn''t worth the effort, or there are better solutions.Indeed, in some cases it may mean moving from stack variables to e.g. per-CPU pre-allocated ones. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel