George, while hunting down direct assignments of cpumask_t variables (which I''m trying to eliminate so that hypervisor binaries built for many CPUs won''t have undue memory overhead on "normal" [smaller] systems), I stumbled across memory references that at the first glance looked buggy to me due to their huge immediates. As it turns out, they''re not buggy but a result of credit2''s struct csched_private - particularly having a NR_CPUS sized array of struct csched_runqueue_data, which in turn has three cpumask_t-s, totaling to a structure size of about 6.5Mb when building for 4095 CPUs (the current upper limit in Xen). Apart from the possibility of allocating the arrays (and maybe also the cpumask_t-s) separately (for which I can come up with a patch on top of what I'' currently putting together) - is it really necessary to have all these, the more that there can be multiple instances of the structure with CPU pools? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Oct 13, 2011 at 10:42 AM, Jan Beulich <JBeulich@suse.com> wrote:> Apart from the possibility of allocating the arrays (and maybe also the > cpumask_t-s) separately (for which I can come up with a patch on top > of what I'' currently putting together) - is it really necessary to have > all these, the more that there can be multiple instances of the structure > with CPU pools?I''m not quite sure what it is that you''re asking. Do you mean, are all of the things in each runqueue structure necessary? Specifically, I guess, the cpumask_t structures (because the rest of the structure isn''t significantly larger than the per-cpu structure for credit1)? At first blush, all of those cpu masks are necessary. The assignment of cpus to runqueues may be arbitrary, so we need a cpu mask for that. In theory, "idle" and "tickled" only need bits for the cpus actually assigned to this runqueue (which should be 2-8 under normal circumstances). But then we would need some kind of mechanism to translate "mask just for these cpus" to "mask of all cpus" in order to use the normal cpumask mechanisms, which seems like a lot of extra complexity just to save a few bytes. Surely a system with 4096 logical cpus can afford 6 megabytes of memory for scheduling? For one thing, the number of runqueues in credit2 is actually meant to be smaller than the number of logical cpus -- it''s meant to be one per L2 cache, which should have between 2 and 8 logical cpus, depending on the architecture. I just put NR_CPUS because it was easier to get working. Making that an array of pointers, which is allocated on an as-needed basis, should reduce that requirement a great deal. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/10/2011 11:11, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:> For one thing, the number of runqueues in credit2 is actually meant to > be smaller than the number of logical cpus -- it''s meant to be one per > L2 cache, which should have between 2 and 8 logical cpus, depending on > the architecture. I just put NR_CPUS because it was easier to get > working. Making that an array of pointers, which is allocated on an > as-needed basis, should reduce that requirement a great deal.That would suffice. If we can put per-cpu stuff in the per_cpu() data area then even better. The fact that credit2 burns a couple of kB per CPU isn''t a problem at all, as long as it does it only for active CPUs. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.11 at 12:57, Keir Fraser <keir@xen.org> wrote: > On 13/10/2011 11:11, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote: > >> For one thing, the number of runqueues in credit2 is actually meant to >> be smaller than the number of logical cpus -- it''s meant to be one per >> L2 cache, which should have between 2 and 8 logical cpus, depending on >> the architecture. I just put NR_CPUS because it was easier to get >> working. Making that an array of pointers, which is allocated on an >> as-needed basis, should reduce that requirement a great deal. > > That would suffice. If we can put per-cpu stuff in the per_cpu() data area > then even better.That might not be possible, as there can be more than one instance of that scheduler.> The fact that credit2 burns a couple of kB per CPU isn''t a > problem at all, as long as it does it only for active CPUs.Fully agree. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.11 at 12:11, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Thu, Oct 13, 2011 at 10:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >> Apart from the possibility of allocating the arrays (and maybe also the >> cpumask_t-s) separately (for which I can come up with a patch on top >> of what I'' currently putting together) - is it really necessary to have >> all these, the more that there can be multiple instances of the structure >> with CPU pools? > > I''m not quite sure what it is that you''re asking. Do you mean, are > all of the things in each runqueue structure necessary? Specifically, > I guess, the cpumask_t structures (because the rest of the structure > isn''t significantly larger than the per-cpu structure for credit1)?No, it''s really the NR_CPUS-sized array of struct csched_runqueue_data. Credit1 otoh has *no* NR_CPUS sized arrays at all.> At first blush, all of those cpu masks are necessary. The assignment > of cpus to runqueues may be arbitrary, so we need a cpu mask for that. > In theory, "idle" and "tickled" only need bits for the cpus actually > assigned to this runqueue (which should be 2-8 under normal > circumstances). But then we would need some kind of mechanism to > translate "mask just for these cpus" to "mask of all cpus" in order to > use the normal cpumask mechanisms, which seems like a lot of extra > complexity just to save a few bytes. Surely a system with 4096 > logical cpus can afford 6 megabytes of memory for scheduling?I''m not concerned about the total amount if run on a system that large. I''m more concerned about this being a single chunk (possibly allocated post-boot, where we''re really aiming at having no allocations larger than a page at all) and this size being allocated even when running on a much smaller system (i.e. depending only on compile time parameters).> For one thing, the number of runqueues in credit2 is actually meant to > be smaller than the number of logical cpus -- it''s meant to be one per > L2 cache, which should have between 2 and 8 logical cpus, depending on > the architecture. I just put NR_CPUS because it was easier to get > working. Making that an array of pointers, which is allocated on an > as-needed basis, should reduce that requirement a great deal.That would help, but would probably not suffice (since a NR_CPUS sized array of pointers is still going to be larger than a page). We may need to introduce dynamic per-CPU data allocation for this... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/13/2011 02:24 PM, Jan Beulich wrote:>>>> On 13.10.11 at 12:11, George Dunlap<George.Dunlap@eu.citrix.com> wrote: >> On Thu, Oct 13, 2011 at 10:42 AM, Jan Beulich<JBeulich@suse.com> wrote: >>> Apart from the possibility of allocating the arrays (and maybe also the >>> cpumask_t-s) separately (for which I can come up with a patch on top >>> of what I'' currently putting together) - is it really necessary to have >>> all these, the more that there can be multiple instances of the structure >>> with CPU pools? >> I''m not quite sure what it is that you''re asking. Do you mean, are >> all of the things in each runqueue structure necessary? Specifically, >> I guess, the cpumask_t structures (because the rest of the structure >> isn''t significantly larger than the per-cpu structure for credit1)? > No, it''s really the NR_CPUS-sized array of struct csched_runqueue_data. > Credit1 otoh has *no* NR_CPUS sized arrays at all. > >> At first blush, all of those cpu masks are necessary. The assignment >> of cpus to runqueues may be arbitrary, so we need a cpu mask for that. >> In theory, "idle" and "tickled" only need bits for the cpus actually >> assigned to this runqueue (which should be 2-8 under normal >> circumstances). But then we would need some kind of mechanism to >> translate "mask just for these cpus" to "mask of all cpus" in order to >> use the normal cpumask mechanisms, which seems like a lot of extra >> complexity just to save a few bytes. Surely a system with 4096 >> logical cpus can afford 6 megabytes of memory for scheduling? > I''m not concerned about the total amount if run on a system that > large. I''m more concerned about this being a single chunk (possibly > allocated post-boot, where we''re really aiming at having no > allocations larger than a page at all) and this size being allocated > even when running on a much smaller system (i.e. depending only > on compile time parameters). > >> For one thing, the number of runqueues in credit2 is actually meant to >> be smaller than the number of logical cpus -- it''s meant to be one per >> L2 cache, which should have between 2 and 8 logical cpus, depending on >> the architecture. I just put NR_CPUS because it was easier to get >> working. Making that an array of pointers, which is allocated on an >> as-needed basis, should reduce that requirement a great deal. > That would help, but would probably not suffice (since a NR_CPUS > sized array of pointers is still going to be larger than a page). We > may need to introduce dynamic per-CPU data allocation for this...Couldn''t the run-queue data be dynamically allocated and the pcpu-data of credit2 contain a pointer to it? Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.11 at 14:54, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote: > On 10/13/2011 02:24 PM, Jan Beulich wrote: >>>>> On 13.10.11 at 12:11, George Dunlap<George.Dunlap@eu.citrix.com> wrote: >>> On Thu, Oct 13, 2011 at 10:42 AM, Jan Beulich<JBeulich@suse.com> wrote: >>>> Apart from the possibility of allocating the arrays (and maybe also the >>>> cpumask_t-s) separately (for which I can come up with a patch on top >>>> of what I'' currently putting together) - is it really necessary to have >>>> all these, the more that there can be multiple instances of the structure >>>> with CPU pools? >>> I''m not quite sure what it is that you''re asking. Do you mean, are >>> all of the things in each runqueue structure necessary? Specifically, >>> I guess, the cpumask_t structures (because the rest of the structure >>> isn''t significantly larger than the per-cpu structure for credit1)? >> No, it''s really the NR_CPUS-sized array of struct csched_runqueue_data. >> Credit1 otoh has *no* NR_CPUS sized arrays at all. >> >>> At first blush, all of those cpu masks are necessary. The assignment >>> of cpus to runqueues may be arbitrary, so we need a cpu mask for that. >>> In theory, "idle" and "tickled" only need bits for the cpus actually >>> assigned to this runqueue (which should be 2-8 under normal >>> circumstances). But then we would need some kind of mechanism to >>> translate "mask just for these cpus" to "mask of all cpus" in order to >>> use the normal cpumask mechanisms, which seems like a lot of extra >>> complexity just to save a few bytes. Surely a system with 4096 >>> logical cpus can afford 6 megabytes of memory for scheduling? >> I''m not concerned about the total amount if run on a system that >> large. I''m more concerned about this being a single chunk (possibly >> allocated post-boot, where we''re really aiming at having no >> allocations larger than a page at all) and this size being allocated >> even when running on a much smaller system (i.e. depending only >> on compile time parameters). >> >>> For one thing, the number of runqueues in credit2 is actually meant to >>> be smaller than the number of logical cpus -- it''s meant to be one per >>> L2 cache, which should have between 2 and 8 logical cpus, depending on >>> the architecture. I just put NR_CPUS because it was easier to get >>> working. Making that an array of pointers, which is allocated on an >>> as-needed basis, should reduce that requirement a great deal. >> That would help, but would probably not suffice (since a NR_CPUS >> sized array of pointers is still going to be larger than a page). We >> may need to introduce dynamic per-CPU data allocation for this... > > Couldn''t the run-queue data be dynamically allocated and the pcpu-data of > credit2 contain a pointer to it?Not if the per-CPU data is also per scheduler instance (which I can''t easily tell whether it is). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/13/2011 04:17 PM, Jan Beulich wrote:>>>> On 13.10.11 at 14:54, Juergen Gross<juergen.gross@ts.fujitsu.com> wrote: >> On 10/13/2011 02:24 PM, Jan Beulich wrote: >>>>>> On 13.10.11 at 12:11, George Dunlap<George.Dunlap@eu.citrix.com> wrote: >>>> On Thu, Oct 13, 2011 at 10:42 AM, Jan Beulich<JBeulich@suse.com> wrote: >>>>> Apart from the possibility of allocating the arrays (and maybe also the >>>>> cpumask_t-s) separately (for which I can come up with a patch on top >>>>> of what I'' currently putting together) - is it really necessary to have >>>>> all these, the more that there can be multiple instances of the structure >>>>> with CPU pools? >>>> I''m not quite sure what it is that you''re asking. Do you mean, are >>>> all of the things in each runqueue structure necessary? Specifically, >>>> I guess, the cpumask_t structures (because the rest of the structure >>>> isn''t significantly larger than the per-cpu structure for credit1)? >>> No, it''s really the NR_CPUS-sized array of struct csched_runqueue_data. >>> Credit1 otoh has *no* NR_CPUS sized arrays at all. >>> >>>> At first blush, all of those cpu masks are necessary. The assignment >>>> of cpus to runqueues may be arbitrary, so we need a cpu mask for that. >>>> In theory, "idle" and "tickled" only need bits for the cpus actually >>>> assigned to this runqueue (which should be 2-8 under normal >>>> circumstances). But then we would need some kind of mechanism to >>>> translate "mask just for these cpus" to "mask of all cpus" in order to >>>> use the normal cpumask mechanisms, which seems like a lot of extra >>>> complexity just to save a few bytes. Surely a system with 4096 >>>> logical cpus can afford 6 megabytes of memory for scheduling? >>> I''m not concerned about the total amount if run on a system that >>> large. I''m more concerned about this being a single chunk (possibly >>> allocated post-boot, where we''re really aiming at having no >>> allocations larger than a page at all) and this size being allocated >>> even when running on a much smaller system (i.e. depending only >>> on compile time parameters). >>> >>>> For one thing, the number of runqueues in credit2 is actually meant to >>>> be smaller than the number of logical cpus -- it''s meant to be one per >>>> L2 cache, which should have between 2 and 8 logical cpus, depending on >>>> the architecture. I just put NR_CPUS because it was easier to get >>>> working. Making that an array of pointers, which is allocated on an >>>> as-needed basis, should reduce that requirement a great deal. >>> That would help, but would probably not suffice (since a NR_CPUS >>> sized array of pointers is still going to be larger than a page). We >>> may need to introduce dynamic per-CPU data allocation for this... >> Couldn''t the run-queue data be dynamically allocated and the pcpu-data of >> credit2 contain a pointer to it? > Not if the per-CPU data is also per scheduler instance (which I can''t > easily tell whether it is).Each cpu has only one dynamically allocated scheduler pcpu-data area which is anchored in the per_cpu area of that cpu. Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel