There are a couple of issues with building the Hypervisor with max_phys_cpus=128 for x86_64. (Note that this was on a 3.1 base, but unstable appears to have the same issue, at least with the first part). First is a build assertion due to the size of the page_info structure and the shadow_page_info structures get out of sync due to the presence of cpumask_t in the page info structure. A possible fix is to tack on the following to the end of shadow_page_info structure: --- xen/arch/x86/mm/shadow/private.h.orig 2007-12-06 12:48:38.000000000 -0500 +++ xen/arch/x86/mm/shadow/private.h 2008-08-12 12:52:49.000000000 -0400 @@ -243,6 +243,12 @@ struct shadow_page_info /* For non-pinnable shadows, a higher entry that points at us */ paddr_t up; }; +#if NR_CPUS > 64 + /* Need to add some padding to match struct page_info size, + * if cpumask_t is larger than a long + */ + u8 padding[sizeof(cpumask_t)-sizeof(long)]; +#endif }; /* The structure above *must* be the same size as a struct page_info The other issue is at runtime with a fault when trying to bring up cpu 126. Seems the GDT space reserved is not quite enough to hold the per cpu entries. Crude fix (awaiting test results, so not sure that this is sufficient.): --- xen/include/asm-x86/desc.h.orig 2007-12-06 12:48:39.000000000 -0500 +++ xen/include/asm-x86/desc.h 2008-07-31 13:19:52.000000000 -0400 @@ -5,7 +5,11 @@ * Xen reserves a memory page of GDT entries. * No guest GDT entries exist beyond the Xen reserved area. */ +#if MAX_PHYS_CPUS > 64 +#define NR_RESERVED_GDT_PAGES 2 +#else #define NR_RESERVED_GDT_PAGES 1 +#endif #define NR_RESERVED_GDT_BYTES (NR_RESERVED_GDT_PAGES * PAGE_SIZE) #define NR_RESERVED_GDT_ENTRIES (NR_RESERVED_GDT_BYTES / 8) Bill _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Both seem to be hacks to get to 128 CPUs, without consideration of how to go beyond that, or perhaps even drop the fixed (compile-time) limit altogether. Since we have to expect to be run on larger systems not too far into the future, I think it rather needs to be explored how to address these issues (and any potential others) in a fully scalable way. Jan>>> Bill Burns <bburns@redhat.com> 12.08.08 20:41 >>>There are a couple of issues with building the Hypervisor with max_phys_cpus=128 for x86_64. (Note that this was on a 3.1 base, but unstable appears to have the same issue, at least with the first part). First is a build assertion due to the size of the page_info structure and the shadow_page_info structures get out of sync due to the presence of cpumask_t in the page info structure. A possible fix is to tack on the following to the end of shadow_page_info structure: --- xen/arch/x86/mm/shadow/private.h.orig 2007-12-06 12:48:38.000000000 -0500 +++ xen/arch/x86/mm/shadow/private.h 2008-08-12 12:52:49.000000000 -0400 @@ -243,6 +243,12 @@ struct shadow_page_info /* For non-pinnable shadows, a higher entry that points at us */ paddr_t up; }; +#if NR_CPUS > 64 + /* Need to add some padding to match struct page_info size, + * if cpumask_t is larger than a long + */ + u8 padding[sizeof(cpumask_t)-sizeof(long)]; +#endif }; /* The structure above *must* be the same size as a struct page_info The other issue is at runtime with a fault when trying to bring up cpu 126. Seems the GDT space reserved is not quite enough to hold the per cpu entries. Crude fix (awaiting test results, so not sure that this is sufficient.): --- xen/include/asm-x86/desc.h.orig 2007-12-06 12:48:39.000000000 -0500 +++ xen/include/asm-x86/desc.h 2008-07-31 13:19:52.000000000 -0400 @@ -5,7 +5,11 @@ * Xen reserves a memory page of GDT entries. * No guest GDT entries exist beyond the Xen reserved area. */ +#if MAX_PHYS_CPUS > 64 +#define NR_RESERVED_GDT_PAGES 2 +#else #define NR_RESERVED_GDT_PAGES 1 +#endif #define NR_RESERVED_GDT_BYTES (NR_RESERVED_GDT_PAGES * PAGE_SIZE) #define NR_RESERVED_GDT_ENTRIES (NR_RESERVED_GDT_BYTES / 8) Bill _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
At 14:41 -0400 on 12 Aug (1218552070), Bill Burns wrote:> First is a build assertion due to the size of > the page_info structure and the shadow_page_info > structures get out of sync due to the presence > of cpumask_t in the page info structure. > > A possible fix is to tack on the following to > the end of shadow_page_info structure:Yep, that''ll sort it out fine. I don''t think the #if is even needed because a cpumask is always at least the size of a long. Or you could add a "cpumask_t _unused;" to the union with mbz in it since that''s where the sizes get out of sync. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Citrix Systems (R&D) Ltd. [Company #02300071, SL9 0DZ, UK.] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote:> Both seem to be hacks to get to 128 CPUs, without consideration of how > to go beyond thatI think the shadow_page_info one is a general fix for my implicit assumption that sizeof(cpumask_t) == sizeof (long). Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Citrix Systems (R&D) Ltd. [Company #02300071, SL9 0DZ, UK.] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote: >> Both seem to be hacks to get to 128 CPUs, without consideration of how >> to go beyond that > > I think the shadow_page_info one is a general fix for my implicit > assumption that sizeof(cpumask_t) == sizeof (long).Do some fields after the cpumask need to line up in both structures? Placing a dummy cpumask in the shadow_page structure might make most sense. For the other one I''ll have to think a bit. The need for GDT entries per CPU currently obviously means scaling much past a few hundred CPUs is going to be difficult. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.08.08 10:26 >>> >On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote: > >> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote: >>> Both seem to be hacks to get to 128 CPUs, without consideration of how >>> to go beyond that >> >> I think the shadow_page_info one is a general fix for my implicit >> assumption that sizeof(cpumask_t) == sizeof (long). > >Do some fields after the cpumask need to line up in both structures? Placing >a dummy cpumask in the shadow_page structure might make most sense. > >For the other one I''ll have to think a bit. The need for GDT entries per CPU >currently obviously means scaling much past a few hundred CPUs is going to >be difficult.But the cpumask-in-page_info is a scalability concern, too - systems with many CPUs will tend to have a lot of memory, and the growing overhead of the page_info array may become an issue then, too. Page clustering may be an option to reduce/eliminate the growth, though I didn''t spend much thought on this or possible alternatives. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/8/08 09:45, "Jan Beulich" <jbeulich@novell.com> wrote:> But the cpumask-in-page_info is a scalability concern, too - systems with > many CPUs will tend to have a lot of memory, and the growing overhead > of the page_info array may become an issue then, too. Page clustering > may be an option to reduce/eliminate the growth, though I didn''t spend > much thought on this or possible alternatives.An extra 8 bytes per page per 64 CPUs is hardly a concern I think. We''re talking an overhead of 32 bytes per megabyte per CPU. The concern over growing page_info array with growing memory is fallacious -- the overhead is a constant fraction of total memory, if #CPUs is held constant. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/8/08 09:47, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> On 13/8/08 09:45, "Jan Beulich" <jbeulich@novell.com> wrote: > >> But the cpumask-in-page_info is a scalability concern, too - systems with >> many CPUs will tend to have a lot of memory, and the growing overhead >> of the page_info array may become an issue then, too. Page clustering >> may be an option to reduce/eliminate the growth, though I didn''t spend >> much thought on this or possible alternatives. > > An extra 8 bytes per page per 64 CPUs is hardly a concern I think. We''re > talking an overhead of 32 bytes per megabyte per CPU.Put another way, at 512 CPUs the cpumasks would incur an overhead of <2% of total memory. It''s only really beyond that threshold that I''d be concerned. The fact is it''ll be a good while before 512 CPUs is concerning us, and we''ll have plenty of other scalability concerns, no doubt, by that point. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 13/8/08 09:22, "Tim Deegan" <Tim.Deegan@citrix.com> wrote: > >> At 09:21 +0100 on 13 Aug (1218619274), Jan Beulich wrote: >>> Both seem to be hacks to get to 128 CPUs, without consideration of how >>> to go beyond that >> I think the shadow_page_info one is a general fix for my implicit >> assumption that sizeof(cpumask_t) == sizeof (long). > > Do some fields after the cpumask need to line up in both structures? Placing > a dummy cpumask in the shadow_page structure might make most sense.Yes, there is a check that a field of page_info and a field of the shadow_page_info are at the same offset. Both compile time checks are in private.h> > For the other one I''ll have to think a bit. The need for GDT entries per CPU > currently obviously means scaling much past a few hundred CPUs is going to > be difficult.Yes, would like something better here. And as I said, we don''t know yet that just adding the additional page solves anything. Bill> > -- Keir > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/8/08 11:23, "Bill Burns" <bburns@redhat.com> wrote:>> For the other one I''ll have to think a bit. The need for GDT entries per CPU >> currently obviously means scaling much past a few hundred CPUs is going to >> be difficult. > > Yes, would like something better here. And as I said, we > don''t know yet that just adding the additional page solves > anything.How many CPUs do you currently need/want to support? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 13/8/08 11:23, "Bill Burns" <bburns@redhat.com> wrote: > >>> For the other one I''ll have to think a bit. The need for GDT entries per CPU >>> currently obviously means scaling much past a few hundred CPUs is going to >>> be difficult. >> Yes, would like something better here. And as I said, we >> don''t know yet that just adding the additional page solves >> anything. > > How many CPUs do you currently need/want to support? >Currently just looking to get 128 working. But would be nice to have some proper sizing, or even detection of running out. There is a ''last'' GDT entry or some such #define, that is never used (at least in the 3.1 code base). Bill> -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/8/08 11:53, "Bill Burns" <bburns@redhat.com> wrote:>> How many CPUs do you currently need/want to support? >> > > Currently just looking to get 128 working. > But would be nice to have some proper sizing, > or even detection of running out. There is a > ''last'' GDT entry or some such #define, that is > never used (at least in the 3.1 code base).I think your ''two pages'' change will probably work. Then we just need a run-time check when bringing a CPU online that there is space in the GDT for its entries. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel