Implement the structure that will be shared with hvmloader (with HVMs) and directly with the VMs (with PV). -dulloor Signed-off-by : Dulloor <dulloor@gmail.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jul-05 07:39 UTC
Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface
This patch is incomplete. On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote:> Implement the structure that will be shared with hvmloader (with HVMs) > and directly with the VMs (with PV). > > -dulloor > > Signed-off-by : Dulloor <dulloor@gmail.com>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
oops .. sorry, here it is. -dulloor On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> This patch is incomplete. > > > On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote: > >> Implement the structure that will be shared with hvmloader (with HVMs) >> and directly with the VMs (with PV). >> >> -dulloor >> >> Signed-off-by : Dulloor <dulloor@gmail.com> > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jul-05 10:23 UTC
Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface
> +#ifndef __XEN_PUBLIC_DOM_NUMA_X86_H__ > +#define __XEN_PUBLIC_DOM_NUMA_X86_H__ > + > +/* struct xc_cpumask : static structure */ > +#define XEN_CPUMASK_BITS_PER_BYTE 8 > +#define XEN_CPUMASK_BITS_TO_BYTES(bits) \ > + (((bits)+XEN_CPUMASK_BITS_PER_BYTE-1)/XEN_CPUMASK_BITS_PER_BYTE) > > +#define XEN_MAX_VCPUS 128 > +#define XEN_CPUMASK_DECLARE_BITMAP(name,bits) \ > + uint8_t name[XEN_CPUMASK_BITS_TO_BYTES(bits)] > +struct xen_cpumask{ XEN_CPUMASK_DECLARE_BITMAP(bits, XEN_MAX_VCPUS); }; > +#define XEN_CPUMASK_BITMAP(maskp) ((maskp)->bits)What are xc_cpumask (a libxc concept) related definitions doing in a hypervisor public header? These aren''t even used in this header file. Below I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for this bitmask stuff anyway.> +#define XEN_MAX_VNODES 4A small number to be statically defined. Better to make your structure extensible I think, perhaps including pointers out to vnode-indexed arrays?> +/* vnodes are 1GB-aligned */ > +#define XEN_MIN_VNODE_SHIFT (30) > + > +struct xen_vnode_info { > + uint8_t vnode_id; > + uint8_t mnode_id;How do vnodes and mnodes differ? Why should a guest care about or need to know about both, whatever they are?> + uint32_t nr_pages;Not an address range? Is that implicitly worked out somehow? Should be commented, but even better just a <start,end> range explicitly given?> + struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */ > +};Why not have a single integer array vcpu_to_vnode[] in the main xen_domain_numa_info structure?> +#define XEN_DOM_NUMA_INTERFACE_VERSION 0x01 > + > +#define XEN_DOM_NUMA_CONFINE 0x01 > +#define XEN_DOM_NUMA_SPLIT 0x02 > +#define XEN_DOM_NUMA_STRIPE 0x03 > +#define XEN_DOM_NUMA_DONTCARE 0x04What should the guest do with these? You''re rather light on comments in this critical interface-defining header file.> +struct xen_domain_numa_info { > + uint8_t version; > + uint8_t type; > + > + uint8_t nr_vcpus; > + uint8_t nr_vnodes; > + > + /* XXX: hvm_info_table uses 32-bit for high_mem_pgend, > + * so we should be fine 32-bits too*/ > + uint32_t nr_pages;If this is going to be visible outside HVMloader (e.g., in PV guests) then just make it a uint64_aligned_t and be done with it.> + /* Only (nr_vnodes) entries are filled */ > + struct xen_vnode_info vnode_info[XEN_MAX_VNODES]; > + /* Only (nr_vnodes*nr_vnodes) entries are filled */ > + uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES];As suggested above, make these pointers out to dynamic-sized arrays. No need for XEN_MAX_VNODES at all. -- Keir> +}; > + > +#endifOn 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote:> oops .. sorry, here it is. > > -dulloor > > On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com> > wrote: >> This patch is incomplete. >> >> >> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote: >> >>> Implement the structure that will be shared with hvmloader (with HVMs) >>> and directly with the VMs (with PV). >>> >>> -dulloor >>> >>> Signed-off-by : Dulloor <dulloor@gmail.com> >> >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 5, 2010 at 3:23 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:>> +#ifndef __XEN_PUBLIC_DOM_NUMA_X86_H__ >> +#define __XEN_PUBLIC_DOM_NUMA_X86_H__ >> + >> +/* struct xc_cpumask : static structure */ >> +#define XEN_CPUMASK_BITS_PER_BYTE 8 >> +#define XEN_CPUMASK_BITS_TO_BYTES(bits) \ >> + (((bits)+XEN_CPUMASK_BITS_PER_BYTE-1)/XEN_CPUMASK_BITS_PER_BYTE) >> >> +#define XEN_MAX_VCPUS 128 >> +#define XEN_CPUMASK_DECLARE_BITMAP(name,bits) \ >> + uint8_t name[XEN_CPUMASK_BITS_TO_BYTES(bits)] >> +struct xen_cpumask{ XEN_CPUMASK_DECLARE_BITMAP(bits, XEN_MAX_VCPUS); }; >> +#define XEN_CPUMASK_BITMAP(maskp) ((maskp)->bits) > > What are xc_cpumask (a libxc concept) related definitions doing in a > hypervisor public header? These aren''t even used in this header file. Below > I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for > this bitmask stuff anyway.Stale comment with xc_cpumask .. sorry ! I did think of the vcpu_to_vnode array, but then we use the bitmask in hvm_info anyway (with vcpu_online). I thought I could atleast fold them into a single structure. I could change that if you insist.> >> +#define XEN_MAX_VNODES 4 > > A small number to be statically defined. Better to make your structure > extensible I think, perhaps including pointers out to vnode-indexed arrays?This structure is passed in hvm_info page. Should I use offset/len for these dynamic-sized, vnode-indexed arrays ?> >> +/* vnodes are 1GB-aligned */ >> +#define XEN_MIN_VNODE_SHIFT (30) >> + >> +struct xen_vnode_info { >> + uint8_t vnode_id; >> + uint8_t mnode_id; > > How do vnodes and mnodes differ? Why should a guest care about or need to > know about both, whatever they are?vnode_id is the node-id in the guest and mnode_id refers to the real node it maps to. Actually I don''t need vnode_id. Will take that out.> >> + uint32_t nr_pages; > > Not an address range? Is that implicitly worked out somehow? Should be > commented, but even better just a <start,end> range explicitly given?The node address ranges are assumed contiguous and increasing. I will change that to <start,end> ranges.> >> + struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */ >> +}; > > Why not have a single integer array vcpu_to_vnode[] in the main > xen_domain_numa_info structure?No specific reason, except that all the vnode-related info is folded into a single structure. I will change that if you insist.> >> +#define XEN_DOM_NUMA_INTERFACE_VERSION 0x01 >> + >> +#define XEN_DOM_NUMA_CONFINE 0x01 >> +#define XEN_DOM_NUMA_SPLIT 0x02 >> +#define XEN_DOM_NUMA_STRIPE 0x03 >> +#define XEN_DOM_NUMA_DONTCARE 0x04 > > What should the guest do with these? You''re rather light on comments in this > critical interface-defining header file.I will add comments. The intent is to share this information with the hypervisor and PV guests (for ballooning).> >> +struct xen_domain_numa_info { >> + uint8_t version; >> + uint8_t type; >> + >> + uint8_t nr_vcpus; >> + uint8_t nr_vnodes; >> + >> + /* XXX: hvm_info_table uses 32-bit for high_mem_pgend, >> + * so we should be fine 32-bits too*/ >> + uint32_t nr_pages; > > If this is going to be visible outside HVMloader (e.g., in PV guests) then > just make it a uint64_aligned_t and be done with it.Will do that.> >> + /* Only (nr_vnodes) entries are filled */ >> + struct xen_vnode_info vnode_info[XEN_MAX_VNODES]; >> + /* Only (nr_vnodes*nr_vnodes) entries are filled */ >> + uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES]; > > As suggested above, make these pointers out to dynamic-sized arrays. No need > for XEN_MAX_VNODES at all.In general, I realise I should add more comments.> > -- Keir > >> +}; >> + >> +#endif > > On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote: > >> oops .. sorry, here it is. >> >> -dulloor >> >> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com> >> wrote: >>> This patch is incomplete. >>> >>> >>> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote: >>> >>>> Implement the structure that will be shared with hvmloader (with HVMs) >>>> and directly with the VMs (with PV). >>>> >>>> -dulloor >>>> >>>> Signed-off-by : Dulloor <dulloor@gmail.com> >>> >>> >>> > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jul-06 12:57 UTC
Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface
On 06/07/2010 06:57, "Dulloor" <dulloor@gmail.com> wrote:>> What are xc_cpumask (a libxc concept) related definitions doing in a >> hypervisor public header? These aren''t even used in this header file. Below >> I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for >> this bitmask stuff anyway. > > Stale comment with xc_cpumask .. sorry ! > I did think of the vcpu_to_vnode array, but then we use the bitmask in > hvm_info > anyway (with vcpu_online). I thought I could atleast fold them into a > single structure. > I could change that if you insist.I think overall vnode_to_vcpu[] is a better way round, unless the per-node vcpu maps are really particularly handy for some reason.>> A small number to be statically defined. Better to make your structure >> extensible I think, perhaps including pointers out to vnode-indexed arrays? > This structure is passed in hvm_info page. Should I use offset/len for these > dynamic-sized, vnode-indexed arrays ?The ''hvm_info page'' is a slightly restrictive concept really. Actually the hvm_info data gets plopped down at a fixed location below 1MB in the guest''s memory map, and you can just extend from there even across a page boundary. I would simply include pointers out to the dynamically-sized arrays; and their sizes should be implicit given nr_vnodes.>> How do vnodes and mnodes differ? Why should a guest care about or need to >> know about both, whatever they are? > vnode_id is the node-id in the guest and mnode_id refers to the real node > it maps to. Actually I don''t need vnode_id. Will take that out.Yes that''s a completely pointless unnecessary distinction.>> >>> + uint32_t nr_pages; >> >> Not an address range? Is that implicitly worked out somehow? Should be >> commented, but even better just a <start,end> range explicitly given? > > The node address ranges are assumed contiguous and increasing. I will > change that to <start,end> ranges.Thanks.>> >>> + struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */ >>> +}; >> >> Why not have a single integer array vcpu_to_vnode[] in the main >> xen_domain_numa_info structure? > > No specific reason, except that all the vnode-related info is > folded into a single structure. I will change that if you insist.Personally I think it it would be neater to change it. A whole bunch of cpumask machinery disappears. -- Keir>> >>> +#define XEN_DOM_NUMA_INTERFACE_VERSION 0x01 >>> + >>> +#define XEN_DOM_NUMA_CONFINE 0x01 >>> +#define XEN_DOM_NUMA_SPLIT 0x02 >>> +#define XEN_DOM_NUMA_STRIPE 0x03 >>> +#define XEN_DOM_NUMA_DONTCARE 0x04 >> >> What should the guest do with these? You''re rather light on comments in this >> critical interface-defining header file. > I will add comments. The intent is to share this information with the > hypervisor > and PV guests (for ballooning). > >> >>> +struct xen_domain_numa_info { >>> + uint8_t version; >>> + uint8_t type; >>> + >>> + uint8_t nr_vcpus; >>> + uint8_t nr_vnodes; >>> + >>> + /* XXX: hvm_info_table uses 32-bit for high_mem_pgend, >>> + * so we should be fine 32-bits too*/ >>> + uint32_t nr_pages; >> >> If this is going to be visible outside HVMloader (e.g., in PV guests) then >> just make it a uint64_aligned_t and be done with it. > > Will do that. >> >>> + /* Only (nr_vnodes) entries are filled */ >>> + struct xen_vnode_info vnode_info[XEN_MAX_VNODES]; >>> + /* Only (nr_vnodes*nr_vnodes) entries are filled */ >>> + uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES]; >> >> As suggested above, make these pointers out to dynamic-sized arrays. No need >> for XEN_MAX_VNODES at all. > > In general, I realise I should add more comments. >> >> -- Keir >> >>> +}; >>> + >>> +#endif >> >> On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote: >> >>> oops .. sorry, here it is. >>> >>> -dulloor >>> >>> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com> >>> wrote: >>>> This patch is incomplete. >>>> >>>> >>>> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote: >>>> >>>>> Implement the structure that will be shared with hvmloader (with HVMs) >>>>> and directly with the VMs (with PV). >>>>> >>>>> -dulloor >>>>> >>>>> Signed-off-by : Dulloor <dulloor@gmail.com> >>>> >>>> >>>> >> >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Thanks for the comments. I will make changes over the weekend and post v2 patches. thanks dulloor On Tue, Jul 6, 2010 at 5:57 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 06/07/2010 06:57, "Dulloor" <dulloor@gmail.com> wrote: > >>> What are xc_cpumask (a libxc concept) related definitions doing in a >>> hypervisor public header? These aren''t even used in this header file. Below >>> I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for >>> this bitmask stuff anyway. >> >> Stale comment with xc_cpumask .. sorry ! >> I did think of the vcpu_to_vnode array, but then we use the bitmask in >> hvm_info >> anyway (with vcpu_online). I thought I could atleast fold them into a >> single structure. >> I could change that if you insist. > > I think overall vnode_to_vcpu[] is a better way round, unless the per-node > vcpu maps are really particularly handy for some reason. > >>> A small number to be statically defined. Better to make your structure >>> extensible I think, perhaps including pointers out to vnode-indexed arrays? >> This structure is passed in hvm_info page. Should I use offset/len for these >> dynamic-sized, vnode-indexed arrays ? > > The ''hvm_info page'' is a slightly restrictive concept really. Actually the > hvm_info data gets plopped down at a fixed location below 1MB in the guest''s > memory map, and you can just extend from there even across a page boundary. > I would simply include pointers out to the dynamically-sized arrays; and > their sizes should be implicit given nr_vnodes. > >>> How do vnodes and mnodes differ? Why should a guest care about or need to >>> know about both, whatever they are? >> vnode_id is the node-id in the guest and mnode_id refers to the real node >> it maps to. Actually I don''t need vnode_id. Will take that out. > > Yes that''s a completely pointless unnecessary distinction. > >>> >>>> + uint32_t nr_pages; >>> >>> Not an address range? Is that implicitly worked out somehow? Should be >>> commented, but even better just a <start,end> range explicitly given? >> >> The node address ranges are assumed contiguous and increasing. I will >> change that to <start,end> ranges. > > Thanks. > >>> >>>> + struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */ >>>> +}; >>> >>> Why not have a single integer array vcpu_to_vnode[] in the main >>> xen_domain_numa_info structure? >> >> No specific reason, except that all the vnode-related info is >> folded into a single structure. I will change that if you insist. > > Personally I think it it would be neater to change it. A whole bunch of > cpumask machinery disappears. > > -- Keir > >>> >>>> +#define XEN_DOM_NUMA_INTERFACE_VERSION 0x01 >>>> + >>>> +#define XEN_DOM_NUMA_CONFINE 0x01 >>>> +#define XEN_DOM_NUMA_SPLIT 0x02 >>>> +#define XEN_DOM_NUMA_STRIPE 0x03 >>>> +#define XEN_DOM_NUMA_DONTCARE 0x04 >>> >>> What should the guest do with these? You''re rather light on comments in this >>> critical interface-defining header file. >> I will add comments. The intent is to share this information with the >> hypervisor >> and PV guests (for ballooning). >> >>> >>>> +struct xen_domain_numa_info { >>>> + uint8_t version; >>>> + uint8_t type; >>>> + >>>> + uint8_t nr_vcpus; >>>> + uint8_t nr_vnodes; >>>> + >>>> + /* XXX: hvm_info_table uses 32-bit for high_mem_pgend, >>>> + * so we should be fine 32-bits too*/ >>>> + uint32_t nr_pages; >>> >>> If this is going to be visible outside HVMloader (e.g., in PV guests) then >>> just make it a uint64_aligned_t and be done with it. >> >> Will do that. >>> >>>> + /* Only (nr_vnodes) entries are filled */ >>>> + struct xen_vnode_info vnode_info[XEN_MAX_VNODES]; >>>> + /* Only (nr_vnodes*nr_vnodes) entries are filled */ >>>> + uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES]; >>> >>> As suggested above, make these pointers out to dynamic-sized arrays. No need >>> for XEN_MAX_VNODES at all. >> >> In general, I realise I should add more comments. >>> >>> -- Keir >>> >>>> +}; >>>> + >>>> +#endif >>> >>> On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote: >>> >>>> oops .. sorry, here it is. >>>> >>>> -dulloor >>>> >>>> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com> >>>> wrote: >>>>> This patch is incomplete. >>>>> >>>>> >>>>> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote: >>>>> >>>>>> Implement the structure that will be shared with hvmloader (with HVMs) >>>>>> and directly with the VMs (with PV). >>>>>> >>>>>> -dulloor >>>>>> >>>>>> Signed-off-by : Dulloor <dulloor@gmail.com> >>>>> >>>>> >>>>> >>> >>> >>> > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Interface definition. Structure that will be shared with hvmloader (with HVMs) and directly with the VMs (with PV). -dulloor Signed-off-by : Dulloor <dulloor@gmail.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-03 12:40 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Dulloor wrote:> Interface definition. Structure that will be shared with hvmloader (with HVMs) > and directly with the VMs (with PV). > > -dulloor > > Signed-off-by : Dulloor <dulloor@gmail.com> > > +struct xen_vnode_info { > + uint8_t mnode_id; /* physical node vnode is allocated from */ > + uint32_t start; /* start of the vnode range (in pages) */ > + uint32_t end; /* end of the vnode range (in pages) */ > +};Is there a particular reason you made the start and end members 32bit only? This effectively limits the amount of memory to 16TB, which is not that far from being used (if not larger NUMA machines already shipping with this amount of memory). So can you push these variables to 64 bits?> +struct xen_domain_numa_info { > + uint8_t version; /* Interface version */ > + uint8_t type; /* VM memory allocation scheme (see above) */ > + > + uint8_t nr_vcpus; > + uint8_t nr_vnodes;The same for here. I''d like to see at least nr_vcpus to be prepared for more than 256 vCPUs. Please keep in mind that if Dom0 is also NUMA aware, by default the whole host resources are first given to Dom0, so the NUMA info should not be restricted to the size of a typical guest only. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-03 13:37 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Dulloor wrote:> Interface definition. Structure that will be shared with hvmloader (with HVMs) > and directly with the VMs (with PV). > > -dulloor > > Signed-off-by : Dulloor <dulloor@gmail.com> >> +/* vnodes are 1GB-aligned */ > +#define XEN_MIN_VNODE_SHIFT (30)Why that? Do you mean guest memory here? Isn''t that a bit restrictive? What if the remaining system resources do not allow this? What about a 5GB guest on 2 nodes? In AMD hardware there is minimum shift of 16MB, so I think 24 bit would be better. +struct xen_vnode_info { + uint8_t mnode_id; /* physical node vnode is allocated from */ + uint32_t start; /* start of the vnode range (in pages) */ + uint32_t end; /* end of the vnode range (in pages) */ +}; +> +struct xen_domain_numa_info { > + uint8_t version; /* Interface version */ > + uint8_t type; /* VM memory allocation scheme (see above) */ > + > + uint8_t nr_vcpus;Isn''t that redundant with info stored somewhere else (for instance in the hvm_info table)?> + uint8_t nr_vnodes; > + /* data[] has the following entries : > + * //Only (nr_vnodes) entries are filled, each sizeof(struct xen_vnode_info) > + * struct xen_vnode_info vnode_info[nr_vnodes];Why would the guest need that info (physical node, start and end) here? Wouldn''t be just the size of the node''s memory sufficient? Regards, Andre.> + * //Only (nr_vcpus) entries are filled, each sizeof(uint8_t) > + * uint8_t vcpu_to_vnode[nr_vcpus]; > + * //Only (nr_vnodes*nr_vnodes) entries are filled, each sizeof(uint8_t) > + * uint8_t vnode_distance[nr_vnodes*nr_vnodes]; > + */ > + uint8_t data[0]; > +}; > + > +#endif-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 03/08/2010 14:37, "Andre Przywara" <andre.przywara@amd.com> wrote:> +struct xen_vnode_info { > + uint8_t mnode_id; /* physical node vnode is allocated from */ > + uint32_t start; /* start of the vnode range (in pages) */ > + uint32_t end; /* end of the vnode range (in pages) */ > +}; > + >> +struct xen_domain_numa_info { >> + uint8_t version; /* Interface version */ >> + uint8_t type; /* VM memory allocation scheme (see above) */ >> + >> + uint8_t nr_vcpus; > Isn''t that redundant with info stored somewhere else (for instance > in the hvm_info table)? >> + uint8_t nr_vnodes; >> + /* data[] has the following entries : >> + * //Only (nr_vnodes) entries are filled, each sizeof(struct >> xen_vnode_info) >> + * struct xen_vnode_info vnode_info[nr_vnodes]; > Why would the guest need that info (physical node, start and end) here? > Wouldn''t be just the size of the node''s memory sufficient?I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id could go away. Do you think the <start,end> ranges are also unnecessary, and could really be replaced by just a size? I asked for that to be changed the other way last time round. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 5:40 AM, Andre Przywara <andre.przywara@amd.com> wrote:> Dulloor wrote: >> >> Interface definition. Structure that will be shared with hvmloader (with >> HVMs) >> and directly with the VMs (with PV). >> >> -dulloor >> >> Signed-off-by : Dulloor <dulloor@gmail.com> >> >> +struct xen_vnode_info { >> + uint8_t mnode_id; /* physical node vnode is allocated from */ >> + uint32_t start; /* start of the vnode range (in pages) */ >> + uint32_t end; /* end of the vnode range (in pages) */ >> +}; > > Is there a particular reason you made the start and end members 32bit only? > This effectively limits the amount of memory to 16TB, which is not that > far from being used (if not larger NUMA machines already shipping with > this amount of memory). > So can you push these variables to 64 bits?Yeah. Keir had asked for this too. I guess I missed it. Will do that.>> >> +struct xen_domain_numa_info { >> + uint8_t version; /* Interface version */ >> + uint8_t type; /* VM memory allocation scheme (see above) */ >> + >> + uint8_t nr_vcpus; >> + uint8_t nr_vnodes; > > The same for here. I''d like to see at least nr_vcpus to be prepared for more > than 256 vCPUs. > Please keep in mind that if Dom0 is also NUMA aware, by default the whole > host resources are first given to Dom0, so the NUMA info should not be > restricted to the size of a typical guest only.Sure. I will change nr_vcpus and nr_vnodes to 16-bits each ? And, thats a good point about Dom0. I have a patch ready in queue which allows for Dom0 to be made NUMA-aware. I would expect control and driver code in Dom0 to benefit from that.> > Regards, > Andre. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 448-3567-12 > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 6:37 AM, Andre Przywara <andre.przywara@amd.com> wrote:> Dulloor wrote: >> >> Interface definition. Structure that will be shared with hvmloader (with >> HVMs) >> and directly with the VMs (with PV). >> >> -dulloor >> >> Signed-off-by : Dulloor <dulloor@gmail.com> >> > >> +/* vnodes are 1GB-aligned */ >> +#define XEN_MIN_VNODE_SHIFT (30) > > Why that? Do you mean guest memory here? Isn''t that a bit restrictive? > What if the remaining system resources do not allow this? > What about a 5GB guest on 2 nodes? > In AMD hardware there is minimum shift of 16MB, so I think 24 bit would > be better.Linux has stricter restrictions on min vnode shift (256MB afair). And, I remember one of the emails from Jan Beulich where the minimum node size was discussed (but in another context). I will get verify my facts and reply on this.> > +struct xen_vnode_info { > + uint8_t mnode_id; /* physical node vnode is allocated from */ > + uint32_t start; /* start of the vnode range (in pages) */ > + uint32_t end; /* end of the vnode range (in pages) */ > +}; > + >> >> +struct xen_domain_numa_info { >> + uint8_t version; /* Interface version */ >> + uint8_t type; /* VM memory allocation scheme (see above) */ >> + >> + uint8_t nr_vcpus; > > Isn''t that redundant with info stored somewhere else (for instance > in the hvm_info table)?But, this being a dynamic structure, nr_vcpus and nr_vnodes determine the actual size of the populated structure. It''s just easier to use in the above helper macros.>> >> + uint8_t nr_vnodes; >> + /* data[] has the following entries : >> + * //Only (nr_vnodes) entries are filled, each sizeof(struct >> xen_vnode_info) >> + * struct xen_vnode_info vnode_info[nr_vnodes]; > > Why would the guest need that info (physical node, start and end) here? > Wouldn''t be just the size of the node''s memory sufficient?I changed that from size to (start, end) on last review. size should be sufficient since all nodes are contiguous. Will revert this back to use size.> > Regards, > Andre. >> >> + * //Only (nr_vcpus) entries are filled, each sizeof(uint8_t) >> + * uint8_t vcpu_to_vnode[nr_vcpus]; >> + * //Only (nr_vnodes*nr_vnodes) entries are filled, each >> sizeof(uint8_t) >> + * uint8_t vnode_distance[nr_vnodes*nr_vnodes]; >> + */ >> + uint8_t data[0]; >> +}; >> + >> +#endif > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 448-3567-12 > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 7:10 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 03/08/2010 14:37, "Andre Przywara" <andre.przywara@amd.com> wrote: > >> +struct xen_vnode_info { >> + uint8_t mnode_id; /* physical node vnode is allocated from */ >> + uint32_t start; /* start of the vnode range (in pages) */ >> + uint32_t end; /* end of the vnode range (in pages) */ >> +}; >> + >>> +struct xen_domain_numa_info { >>> + uint8_t version; /* Interface version */ >>> + uint8_t type; /* VM memory allocation scheme (see above) */ >>> + >>> + uint8_t nr_vcpus; >> Isn''t that redundant with info stored somewhere else (for instance >> in the hvm_info table)? >>> + uint8_t nr_vnodes; >>> + /* data[] has the following entries : >>> + * //Only (nr_vnodes) entries are filled, each sizeof(struct >>> xen_vnode_info) >>> + * struct xen_vnode_info vnode_info[nr_vnodes]; >> Why would the guest need that info (physical node, start and end) here? >> Wouldn''t be just the size of the node''s memory sufficient? > > I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id > could go away.mnode_id maps the vnode to a particular physical node. This will be used by balloon driver in the VMs when the structure is passed as NUMA enlightenment to PVs and PV on HVMs. I have a patch ready for that (once we are done with this series).> Do you think the <start,end> ranges are also unnecessary, and > could really be replaced by just a size? I asked for that to be changed the > other way last time round.Actually, we don''t need <start, end> ranges, since the nodes are always contiguous, which could be made implicit in domain builders (hvmloader and PV enlightenment). But, your previous suggestion to use <start, end> ranges could still be seen as style/consistency thing, which could be important in interfaces. Your pick :)> > -- Keir > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:>> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id >> could go away. > mnode_id maps the vnode to a particular physical node. This will be > used by balloon driver in > the VMs when the structure is passed as NUMA enlightenment to PVs and > PV on HVMs. > I have a patch ready for that (once we are done with this series).So what happens when the guest is migrated to another system with different physical node ids? Is that never to be supported? I''m not sure why you wouldn''t hide the vnode-to-mnode translation in the hypervisor. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:>> Do you think the <start,end> ranges are also unnecessary, and >> could really be replaced by just a size? I asked for that to be changed the >> other way last time round. > Actually, we don''t need <start, end> ranges, since the nodes are always > contiguous, which could be made implicit in domain builders (hvmloader > and PV enlightenment). > But, your previous suggestion to use <start, end> ranges could still > be seen as style/consistency thing, > which could be important in interfaces. Your pick :)I think <start,end> is clearer. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote: > >>> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id >>> could go away. >> mnode_id maps the vnode to a particular physical node. This will be >> used by balloon driver in >> the VMs when the structure is passed as NUMA enlightenment to PVs and >> PV on HVMs. >> I have a patch ready for that (once we are done with this series). > > So what happens when the guest is migrated to another system with different > physical node ids? Is that never to be supported? I''m not sure why you > wouldn''t hide the vnode-to-mnode translation in the hypervisor.Right now, migration is not supported when NUMA strategy is set. This is in my TODO list (along with PoD support). There are a few open questions wrt migration : - What if the destination host is not NUMA, but the guest is NUMA. Do we fake those nodes ? Or, should we not select such a destination host to begin with. - What if the destination host is not NUMA, but guest has asked to be striped across a specific number of nodes (possibly for higher aggregate memory bandwidth) ? - What if the guest has asked for a particular memory strategy (split/confined/striped), but the destination host can''t guarantee that (because of the distribution of free memory across the nodes) ? Once we answer these questions, we will know whether vnode-to-mnode translation is better exposed or not. And, if exposed, could we just renegotiate the vnode-to-mnode translation at the destination host. I have started working on this. But, I have some other patches ready to go which we might want to check-in first - PV/Dom0 NUMA patches, Ballooning support (see below). As such, the purpose of vnode-to-mnode translation is for the enlightened guests to know where their underlying memory comes from, so that over-provisioning features like ballooning are given a chance to maintain this distribution. This way all that the hypervisor cares about is to do sanity checks on increase/exchange reservation requests from the guests and the guest can decide whether to make an exact_node_request or not. Other options which would allow us to discard this translation are : - Ballooning at your risk : Let ballooning be as it is even when guests use a numa strategy(particularly split/confined). - Hypervisor-level policies : Let Xen do its best to maintain the guest nodes (using gpfn ranges in guest nodes), which I think is not a clean/flexible solution. But, what I could do is to leave out vnode_to_mnode translation for now and add it along with ballooning support (if/when we decide to add it). I will just bump up the interface version at that time. That might give us time to mull this over ?> > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 03/08/2010 18:24, "Dulloor" <dulloor@gmail.com> wrote:> But, what I could do is to leave out vnode_to_mnode translation for > now and add it along with ballooning support > (if/when we decide to add it). I will just bump up the interface > version at that time. That might give us time to mull this over ?I would rename your ''nr_vnodes'' field to ''nr_nodes'' and your ''mnode_id'' field to ''node_id''. Then I''m happy. It''s the unnecessary distinction between vnodes and mnodes that I was troubled by -- the two can be collapsed together and the interface is clearer that way. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 12:52 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 03/08/2010 18:24, "Dulloor" <dulloor@gmail.com> wrote: > >> But, what I could do is to leave out vnode_to_mnode translation for >> now and add it along with ballooning support >> (if/when we decide to add it). I will just bump up the interface >> version at that time. That might give us time to mull this over ? > > I would rename your ''nr_vnodes'' field to ''nr_nodes'' and your ''mnode_id'' > field to ''node_id''. Then I''m happy. It''s the unnecessary distinction between > vnodes and mnodes that I was troubled by -- the two can be collapsed > together and the interface is clearer that way.Sure, will do that.> > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-03 21:21 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Dulloor wrote:> On Tue, Aug 3, 2010 at 6:37 AM, Andre Przywara <andre.przywara@amd.com> wrote: >> Dulloor wrote: >>> Interface definition. Structure that will be shared with hvmloader (with >>> HVMs) >>> and directly with the VMs (with PV). >>> >>> -dulloor >>> >>> Signed-off-by : Dulloor <dulloor@gmail.com> >>> >>> +/* vnodes are 1GB-aligned */ >>> +#define XEN_MIN_VNODE_SHIFT (30) >> Why that? Do you mean guest memory here? Isn''t that a bit restrictive? >> What if the remaining system resources do not allow this? >> What about a 5GB guest on 2 nodes? >> In AMD hardware there is minimum shift of 16MB, so I think 24 bit would >> be better. > Linux has stricter restrictions on min vnode shift (256MB afair). And, > I remember > one of the emails from Jan Beulich where the minimum node size was discussed > (but in another context). I will get verify my facts and reply on this.OK. I was just asking cause I wondered how the PCI hole issue is solved (I haven''t managed to review these patches today). 256 MB looks OK to me.> >> +struct xen_vnode_info { >> + uint8_t mnode_id; /* physical node vnode is allocated from */ >> + uint32_t start; /* start of the vnode range (in pages) */ >> + uint32_t end; /* end of the vnode range (in pages) */ >> +}; >> + >>> +struct xen_domain_numa_info { >>> + uint8_t version; /* Interface version */ >>> + uint8_t type; /* VM memory allocation scheme (see above) */ >>> + >>> + uint8_t nr_vcpus; >> Isn''t that redundant with info stored somewhere else (for instance >> in the hvm_info table)? > But, this being a dynamic structure, nr_vcpus and nr_vnodes determine the > actual size of the populated structure. It''s just easier to use in the > above helper macros.Right. That is better. My concern was how to deal with possible inconsistencies. But the number of VCPUs shouldn''t be a problem.>>> + uint8_t nr_vnodes; >>> + /* data[] has the following entries : >>> + * //Only (nr_vnodes) entries are filled, each sizeof(struct >>> xen_vnode_info) >>> + * struct xen_vnode_info vnode_info[nr_vnodes]; >> Why would the guest need that info (physical node, start and end) here? >> Wouldn''t be just the size of the node''s memory sufficient? > I changed that from size to (start, end) on last review. size should > be sufficient since > all nodes are contiguous. Will revert this back to use size.start and end look fine on the first glance, but you gain nothing in using this if you only allow one entry per node. See the simple example of 4GB in 2 nodes, the SRAT looks like this: node0: 0-640K node0: 1MB - 2GB node1: 2GB - 3.5GB node1: 4GB - 4.5GB In my patches I did this hole-punching in hvmloader and only send 2G/2G via hvm_info. From an architectural point of view the Xen tools code shouldn''t deal with these internals if this can be hidden in hvmloader. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-03 21:35 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Keir Fraser wrote:> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote: > >>> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id >>> could go away. >> mnode_id maps the vnode to a particular physical node. This will be >> used by balloon driver in >> the VMs when the structure is passed as NUMA enlightenment to PVs and >> PV on HVMs. >> I have a patch ready for that (once we are done with this series). > > So what happens when the guest is migrated to another system with different > physical node ids? Is that never to be supported? I''m not sure why you > wouldn''t hide the vnode-to-mnode translation in the hypervisor.And what about if the node assignment changes at guest''s runtime to satisfy load-balancing? I think we should have the opportunity to change the assignment, although this could be costly when it involves copying guest memory to another physical location. A major virtualization product ;-) solves this by a "hot pages first, the rest in background" algorithm. I see that this is definitely a future extension, but we shouldn''t block this way already in this early stage. Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-03 21:55 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Dulloor wrote:> On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote: >> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote: >> >>>> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id >>>> could go away. >>> mnode_id maps the vnode to a particular physical node. This will be >>> used by balloon driver in >>> the VMs when the structure is passed as NUMA enlightenment to PVs and >>> PV on HVMs. >>> I have a patch ready for that (once we are done with this series). >> So what happens when the guest is migrated to another system with different >> physical node ids? Is that never to be supported? I''m not sure why you >> wouldn''t hide the vnode-to-mnode translation in the hypervisor. > > Right now, migration is not supported when NUMA strategy is set. > This is in my TODO list (along with PoD support). > > There are a few open questions wrt migration : > - What if the destination host is not NUMA, but the guest is NUMA. Do we fake > those nodes ? Or, should we not select such a destination host to begin with.I don''t see a problem with this situation. The guest has virtual nodes, these can be mapped in any way to actual physical nodes (but only by the hypervisor/Dom0, not by the guest itself). A corner case could be clearly to map all guest nodes to one single host node. In terms of performance this should be even better, if the new host can satisfy the requirement from one node, because there will be no remote accesses at all.> - What if the destination host is not NUMA, but guest has asked to be > striped across > a specific number of nodes (possibly for higher aggregate memory bandwidth) ?Most people deal with NUMA because they want to cure performance _drops_ caused by bad allocation policies. After all NUMA awareness is a performance optimization. If the user asks to migrate to another host, then we shouldn''t come with fussy argument like NUMA. In my eyes it is a question of priorities, I don''t want to deny migration because of this.> - What if the guest has asked for a particular memory strategy > (split/confined/striped), > but the destination host can''t guarantee that (because of the > distribution of free memory > across the nodes) ?I see, there is one case where the new host has more nodes than the old one, but the memory on each node is not sufficient (like migrating from a 2*8GB machine to an 8*4GB one). I think we should inform the user about this and if she persists in the migration, use some kind of interleaving to join two (or more) nodes together. Looks like future work, though.> Once we answer these questions, we will know whether vnode-to-mnode > translation is better > exposed or not. And, if exposed, could we just renegotiate the > vnode-to-mnode translation at the > destination host. I have started working on this. But, I have some > other patches ready to go > which we might want to check-in first - PV/Dom0 NUMA patches, > Ballooning support (see below). > > As such, the purpose of vnode-to-mnode translation is for the enlightened > guests to know where their underlying memory comes from, so that > over-provisioning features > like ballooning are given a chance to maintain this distribution.I was afraid you were saying that ;-) I haven''t thought about this in detail, but maybe we can make an exception for Dom0 only, because this is the most prominent and frequent user of ballooning. But I really think that DomUs should not know about or deal with host NUMA nodes. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 03/08/2010 22:55, "Andre Przywara" <andre.przywara@amd.com> wrote:>> As such, the purpose of vnode-to-mnode translation is for the enlightened >> guests to know where their underlying memory comes from, so that >> over-provisioning features >> like ballooning are given a chance to maintain this distribution. > I was afraid you were saying that ;-) I haven''t thought about this in > detail, but maybe we can make an exception for Dom0 only, because this > is the most prominent and frequent user of ballooning. But I really > think that DomUs should not know about or deal with host NUMA nodes.So long as it gets renamed to ''node_id'' in the info structure I''m okay with it. That doesn''t preclude simply setting that field to 0...nr_vnodes-1 and doing translation in the hypervisor, but also leaves us a bit of flexibility. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 3, 2010 at 10:27 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 03/08/2010 22:55, "Andre Przywara" <andre.przywara@amd.com> wrote: > >>> As such, the purpose of vnode-to-mnode translation is for the enlightened >>> guests to know where their underlying memory comes from, so that >>> over-provisioning features >>> like ballooning are given a chance to maintain this distribution. >> I was afraid you were saying that ;-) I haven''t thought about this in >> detail, but maybe we can make an exception for Dom0 only, because this >> is the most prominent and frequent user of ballooning. But I really >> think that DomUs should not know about or deal with host NUMA nodes. > > So long as it gets renamed to ''node_id'' in the info structure I''m okay with > it. That doesn''t preclude simply setting that field to 0...nr_vnodes-1 and > doing translation in the hypervisor, but also leaves us a bit of > flexibility.Yes, I like this idea. Will do that.> > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Aug-04 07:01 UTC
Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface
Dulloor wrote:> On Tue, Aug 3, 2010 at 10:27 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote: >> On 03/08/2010 22:55, "Andre Przywara" <andre.przywara@amd.com> wrote: >> >>>> As such, the purpose of vnode-to-mnode translation is for the enlightened >>>> guests to know where their underlying memory comes from, so that >>>> over-provisioning features >>>> like ballooning are given a chance to maintain this distribution. >>> I was afraid you were saying that ;-) I haven''t thought about this in >>> detail, but maybe we can make an exception for Dom0 only, because this >>> is the most prominent and frequent user of ballooning. But I really >>> think that DomUs should not know about or deal with host NUMA nodes. >> So long as it gets renamed to ''node_id'' in the info structure I''m okay with >> it. That doesn''t preclude simply setting that field to 0...nr_vnodes-1 and >> doing translation in the hypervisor, but also leaves us a bit of >> flexibility. > Yes, I like this idea. Will do that.Ack, that sounds good. So we have virtual nodes, virtual physical nodes and machine nodes. Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 04/08/2010 08:01, "Andre Przywara" <andre.przywara@amd.com> wrote:>>> So long as it gets renamed to ''node_id'' in the info structure I''m okay with >>> it. That doesn''t preclude simply setting that field to 0...nr_vnodes-1 and >>> doing translation in the hypervisor, but also leaves us a bit of >>> flexibility. >> Yes, I like this idea. Will do that. > Ack, that sounds good. So we have virtual nodes, virtual physical nodes > and machine nodes.The index into the ''node table'' in the informational structure would have no particular relevance in this scheme. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Aug-04 13:34 UTC
RE: [xen-devel][vNUMA v2][PATCH 2/8] public interface
I have to be brief right now, but since I was the one that suggested that Dulloor turn on "no_migrate" when the guest is using NUMA optimizations, I should reply. Andre, you are assuming a very static world, e.g.:> I think we should inform the user > about this and if she persists in the migration...and:> but maybe we can make an exception for Dom0 only, because this > is the most prominent and frequent user of ballooning.One of the reasons virtualization is increasingly successful is highly dynamic resource management. Please don''t assume you can "inform the user" that a migration might result in greatly decreased performance... there may be hundreds or thousands of migrations per minute in the data center. And please don''t assume that today''s horribly static "virtual physical memory" model will be the same even a year or two from now. If virtualization is to survive the looming "memory wall", much more creative mechanisms must be in place to better optimize memory utilization and these will likely result in thousands of "page-ownership" changes per guest per second.> Most people deal with NUMA because they want to cure performance > _drops_ caused by bad allocation policies. After all NUMA awareness > is a performance optimization. If the user asks to migrate to > another host, then we shouldn''t come with fussy argument like > NUMA. In my eyes it is a question of priorities, I don''t want > to deny migration because of this.Actually, most people deal with NUMA because they are publishing benchmark numbers ;-) I firmly believe that virtualization exposes many tradeoffs between performance and flexibility and these choices should be presented to the "virtualization user" as simply as possible, e.g. there should be a very high level choice between performance and flexibility. If a user is attempting to improve performance by exposing the underlying NUMA-ness of a physical machine to the guests, they are implicitly making a policy decision choosing performance over flexibility. If we then turn around and migrate the guest in such a way that performance plummets, we''ve defeated the intent of the user''s implicitly stated policy. Dan> From: Andre Przywara [mailto:andre.przywara@amd.com] > > Dulloor wrote: > > On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser > <keir.fraser@eu.citrix.com> wrote: > >> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote: > >> > >>>> I would expect guest would see nodes 0 to nr_vnodes-1, and the > mnode_id > >>>> could go away. > >>> mnode_id maps the vnode to a particular physical node. This will be > >>> used by balloon driver in > >>> the VMs when the structure is passed as NUMA enlightenment to PVs > and > >>> PV on HVMs. > >>> I have a patch ready for that (once we are done with this series). > >> So what happens when the guest is migrated to another system with > different > >> physical node ids? Is that never to be supported? I''m not sure why > you > >> wouldn''t hide the vnode-to-mnode translation in the hypervisor. > > > > Right now, migration is not supported when NUMA strategy is set. > > This is in my TODO list (along with PoD support). > > > > There are a few open questions wrt migration : > > - What if the destination host is not NUMA, but the guest is NUMA. Do > we fake > > those nodes ? Or, should we not select such a destination host to > begin with. > I don''t see a problem with this situation. The guest has virtual nodes, > these can be mapped in any way to actual physical nodes (but only by > the > hypervisor/Dom0, not by the guest itself). > A corner case could be clearly to map all guest nodes to one single > host > node. In terms of performance this should be even better, if the new > host can satisfy the requirement from one node, because there will be > no > remote accesses at all. > > - What if the destination host is not NUMA, but guest has asked to be > > striped across > > a specific number of nodes (possibly for higher aggregate memory > bandwidth) ? > Most people deal with NUMA because they want to cure performance > _drops_ > caused by bad allocation policies. After all NUMA awareness is a > performance optimization. If the user asks to migrate to another host, > then we shouldn''t come with fussy argument like NUMA. In my eyes it is > a > question of priorities, I don''t want to deny migration because of this. > > - What if the guest has asked for a particular memory strategy > > (split/confined/striped), > > but the destination host can''t guarantee that (because of the > > distribution of free memory > > across the nodes) ? > I see, there is one case where the new host has more nodes than the old > one, but the memory on each node is not sufficient (like migrating from > a 2*8GB machine to an 8*4GB one). I think we should inform the user > about this and if she persists in the migration, use some kind of > interleaving to join two (or more) nodes together. Looks like future > work, though. > > Once we answer these questions, we will know whether vnode-to-mnode > > translation is better > > exposed or not. And, if exposed, could we just renegotiate the > > vnode-to-mnode translation at the > > destination host. I have started working on this. But, I have some > > other patches ready to go > > which we might want to check-in first - PV/Dom0 NUMA patches, > > Ballooning support (see below). > > > > As such, the purpose of vnode-to-mnode translation is for the > enlightened > > guests to know where their underlying memory comes from, so that > > over-provisioning features > > like ballooning are given a chance to maintain this distribution. > I was afraid you were saying that ;-) I haven''t thought about this in > detail, but maybe we can make an exception for Dom0 only, because this > is the most prominent and frequent user of ballooning. But I really > think that DomUs should not know about or deal with host NUMA nodes. > > Regards, > Andre. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 448-3567-12 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel