thr3ads.net - Xen devel - [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Dulloor

2010-Jul-02 23:54 UTC

[Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Implement the structure that will be shared with hvmloader (with HVMs)
and directly with the VMs (with PV).

-dulloor

Signed-off-by : Dulloor <dulloor@gmail.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jul-05 07:39 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

This patch is incomplete.


On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote:
> Implement the structure that will be shared with hvmloader (with HVMs)
> and directly with the VMs (with PV).
> 
> -dulloor
> 
> Signed-off-by : Dulloor <dulloor@gmail.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Jul-05 08:52 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

oops .. sorry, here it is.

-dulloor

On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> This patch is incomplete.
>
>
> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com> wrote:
>
>> Implement the structure that will be shared with hvmloader (with HVMs)
>> and directly with the VMs (with PV).
>>
>> -dulloor
>>
>> Signed-off-by : Dulloor <dulloor@gmail.com>
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jul-05 10:23 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

> +#ifndef __XEN_PUBLIC_DOM_NUMA_X86_H__
> +#define __XEN_PUBLIC_DOM_NUMA_X86_H__
> +
> +/* struct xc_cpumask : static structure */
> +#define XEN_CPUMASK_BITS_PER_BYTE 8
> +#define XEN_CPUMASK_BITS_TO_BYTES(bits) \
> +    (((bits)+XEN_CPUMASK_BITS_PER_BYTE-1)/XEN_CPUMASK_BITS_PER_BYTE)
> 
> +#define XEN_MAX_VCPUS 128
> +#define XEN_CPUMASK_DECLARE_BITMAP(name,bits) \
> +    uint8_t name[XEN_CPUMASK_BITS_TO_BYTES(bits)]
> +struct xen_cpumask{ XEN_CPUMASK_DECLARE_BITMAP(bits, XEN_MAX_VCPUS); };
> +#define XEN_CPUMASK_BITMAP(maskp) ((maskp)->bits)
What are xc_cpumask (a libxc concept) related definitions doing in a
hypervisor public header? These aren''t even used in this header file.
Below
I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for
this bitmask stuff anyway.
> +#define XEN_MAX_VNODES 4
A small number to be statically defined. Better to make your structure
extensible I think, perhaps including pointers out to vnode-indexed arrays?
> +/* vnodes are 1GB-aligned */
> +#define XEN_MIN_VNODE_SHIFT (30)
> +
> +struct xen_vnode_info {
> +    uint8_t vnode_id;
> +    uint8_t mnode_id;
How do vnodes and mnodes differ? Why should a guest care about or need to
know about both, whatever they are?
> +    uint32_t nr_pages;
Not an address range? Is that implicitly worked out somehow? Should be
commented, but even better just a <start,end> range explicitly given?
> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
> +};
Why not have a single integer array vcpu_to_vnode[] in the main
xen_domain_numa_info structure?
> +#define XEN_DOM_NUMA_INTERFACE_VERSION  0x01
> +
> +#define XEN_DOM_NUMA_CONFINE    0x01
> +#define XEN_DOM_NUMA_SPLIT      0x02
> +#define XEN_DOM_NUMA_STRIPE     0x03
> +#define XEN_DOM_NUMA_DONTCARE   0x04
What should the guest do with these? You''re rather light on comments in
this
critical interface-defining header file.
> +struct xen_domain_numa_info {
> +    uint8_t version;
> +    uint8_t type;
> +
> +    uint8_t nr_vcpus;
> +    uint8_t nr_vnodes;
> +
> +    /* XXX: hvm_info_table uses 32-bit for high_mem_pgend,
> +     * so we should be fine 32-bits too*/
> +    uint32_t nr_pages;
If this is going to be visible outside HVMloader (e.g., in PV guests) then
just make it a uint64_aligned_t and be done with it.
> +    /* Only (nr_vnodes) entries are filled */
> +    struct xen_vnode_info vnode_info[XEN_MAX_VNODES];
> +    /* Only (nr_vnodes*nr_vnodes) entries are filled */
> +    uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES];
As suggested above, make these pointers out to dynamic-sized arrays. No need
for XEN_MAX_VNODES at all.

 -- Keir
> +};
> +
> +#endif
On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote:
> oops .. sorry, here it is.
> 
> -dulloor
> 
> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser
<keir.fraser@eu.citrix.com>
> wrote:
>> This patch is incomplete.
>> 
>> 
>> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com>
wrote:
>> 
>>> Implement the structure that will be shared with hvmloader (with
HVMs)
>>> and directly with the VMs (with PV).
>>> 
>>> -dulloor
>>> 
>>> Signed-off-by : Dulloor <dulloor@gmail.com>
>> 
>> 
>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Jul-06 05:57 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

On Mon, Jul 5, 2010 at 3:23 AM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:>> +#ifndef __XEN_PUBLIC_DOM_NUMA_X86_H__
>> +#define __XEN_PUBLIC_DOM_NUMA_X86_H__
>> +
>> +/* struct xc_cpumask : static structure */
>> +#define XEN_CPUMASK_BITS_PER_BYTE 8
>> +#define XEN_CPUMASK_BITS_TO_BYTES(bits) \
>> +    (((bits)+XEN_CPUMASK_BITS_PER_BYTE-1)/XEN_CPUMASK_BITS_PER_BYTE)
>>
>> +#define XEN_MAX_VCPUS 128
>> +#define XEN_CPUMASK_DECLARE_BITMAP(name,bits) \
>> +    uint8_t name[XEN_CPUMASK_BITS_TO_BYTES(bits)]
>> +struct xen_cpumask{ XEN_CPUMASK_DECLARE_BITMAP(bits, XEN_MAX_VCPUS);
};
>> +#define XEN_CPUMASK_BITMAP(maskp) ((maskp)->bits)
>
> What are xc_cpumask (a libxc concept) related definitions doing in a
> hypervisor public header? These aren''t even used in this header
file. Below
> I suggest a vcpu_to_vnode[] array, which probably gets rid of the need for
> this bitmask stuff anyway.
Stale comment with xc_cpumask .. sorry !
I did think of the vcpu_to_vnode array, but then we use the bitmask in hvm_info
anyway (with vcpu_online). I thought I could atleast fold them into a
single structure.
I could change that if you insist.
>
>> +#define XEN_MAX_VNODES 4
>
> A small number to be statically defined. Better to make your structure
> extensible I think, perhaps including pointers out to vnode-indexed arrays?This structure is passed in hvm_info page. Should I use offset/len for these
dynamic-sized, vnode-indexed arrays ?
>
>> +/* vnodes are 1GB-aligned */
>> +#define XEN_MIN_VNODE_SHIFT (30)
>> +
>> +struct xen_vnode_info {
>> +    uint8_t vnode_id;
>> +    uint8_t mnode_id;
>
> How do vnodes and mnodes differ? Why should a guest care about or need to
> know about both, whatever they are?vnode_id is the node-id in the guest and mnode_id refers to the real node
it maps to. Actually I don''t need vnode_id. Will take that out.
>
>> +    uint32_t nr_pages;
>
> Not an address range? Is that implicitly worked out somehow? Should be
> commented, but even better just a <start,end> range explicitly given?
The node address ranges are assumed contiguous and increasing. I will
change that to <start,end> ranges.
>
>> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
>> +};
>
> Why not have a single integer array vcpu_to_vnode[] in the main
> xen_domain_numa_info structure?
No specific reason, except that all the vnode-related info is
folded into a single structure. I will change that if you insist.
>
>> +#define XEN_DOM_NUMA_INTERFACE_VERSION  0x01
>> +
>> +#define XEN_DOM_NUMA_CONFINE    0x01
>> +#define XEN_DOM_NUMA_SPLIT      0x02
>> +#define XEN_DOM_NUMA_STRIPE     0x03
>> +#define XEN_DOM_NUMA_DONTCARE   0x04
>
> What should the guest do with these? You''re rather light on
comments in this
> critical interface-defining header file.I will add comments. The intent is to share this information with the hypervisor
and PV guests (for ballooning).
>
>> +struct xen_domain_numa_info {
>> +    uint8_t version;
>> +    uint8_t type;
>> +
>> +    uint8_t nr_vcpus;
>> +    uint8_t nr_vnodes;
>> +
>> +    /* XXX: hvm_info_table uses 32-bit for high_mem_pgend,
>> +     * so we should be fine 32-bits too*/
>> +    uint32_t nr_pages;
>
> If this is going to be visible outside HVMloader (e.g., in PV guests) then
> just make it a uint64_aligned_t and be done with it.
Will do that.>
>> +    /* Only (nr_vnodes) entries are filled */
>> +    struct xen_vnode_info vnode_info[XEN_MAX_VNODES];
>> +    /* Only (nr_vnodes*nr_vnodes) entries are filled */
>> +    uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES];
>
> As suggested above, make these pointers out to dynamic-sized arrays. No
need
> for XEN_MAX_VNODES at all.
In general, I realise I should add more comments.>
>  -- Keir
>
>> +};
>> +
>> +#endif
>
> On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com> wrote:
>
>> oops .. sorry, here it is.
>>
>> -dulloor
>>
>> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser
<keir.fraser@eu.citrix.com>
>> wrote:
>>> This patch is incomplete.
>>>
>>>
>>> On 03/07/2010 00:54, "Dulloor" <dulloor@gmail.com>
wrote:
>>>
>>>> Implement the structure that will be shared with hvmloader
(with HVMs)
>>>> and directly with the VMs (with PV).
>>>>
>>>> -dulloor
>>>>
>>>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>>
>>>
>>>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jul-06 12:57 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

On 06/07/2010 06:57, "Dulloor" <dulloor@gmail.com> wrote:
>> What are xc_cpumask (a libxc concept) related definitions doing in a
>> hypervisor public header? These aren''t even used in this
header file. Below
>> I suggest a vcpu_to_vnode[] array, which probably gets rid of the need
for
>> this bitmask stuff anyway.
> 
> Stale comment with xc_cpumask .. sorry !
> I did think of the vcpu_to_vnode array, but then we use the bitmask in
> hvm_info
> anyway (with vcpu_online). I thought I could atleast fold them into a
> single structure.
> I could change that if you insist.
I think overall vnode_to_vcpu[] is a better way round, unless the per-node
vcpu maps are really particularly handy for some reason.
>> A small number to be statically defined. Better to make your structure
>> extensible I think, perhaps including pointers out to vnode-indexed
arrays?
> This structure is passed in hvm_info page. Should I use offset/len for
these
> dynamic-sized, vnode-indexed arrays ?
The ''hvm_info page'' is a slightly restrictive concept really.
Actually the
hvm_info data gets plopped down at a fixed location below 1MB in the
guest''s
memory map, and you can just extend from there even across a page boundary.
I would simply include pointers out to the dynamically-sized arrays; and
their sizes should be implicit given nr_vnodes.
>> How do vnodes and mnodes differ? Why should a guest care about or need
to
>> know about both, whatever they are?
> vnode_id is the node-id in the guest and mnode_id refers to the real node
> it maps to. Actually I don''t need vnode_id. Will take that out.
Yes that''s a completely pointless unnecessary distinction.
>> 
>>> +    uint32_t nr_pages;
>> 
>> Not an address range? Is that implicitly worked out somehow? Should be
>> commented, but even better just a <start,end> range explicitly
given?
> 
> The node address ranges are assumed contiguous and increasing. I will
> change that to <start,end> ranges.
Thanks.
>> 
>>> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
>>> +};
>> 
>> Why not have a single integer array vcpu_to_vnode[] in the main
>> xen_domain_numa_info structure?
> 
> No specific reason, except that all the vnode-related info is
> folded into a single structure. I will change that if you insist.
Personally I think it it would be neater to change it. A whole bunch of
cpumask machinery disappears.

 -- Keir
>> 
>>> +#define XEN_DOM_NUMA_INTERFACE_VERSION  0x01
>>> +
>>> +#define XEN_DOM_NUMA_CONFINE    0x01
>>> +#define XEN_DOM_NUMA_SPLIT      0x02
>>> +#define XEN_DOM_NUMA_STRIPE     0x03
>>> +#define XEN_DOM_NUMA_DONTCARE   0x04
>> 
>> What should the guest do with these? You''re rather light on
comments in this
>> critical interface-defining header file.
> I will add comments. The intent is to share this information with the
> hypervisor
> and PV guests (for ballooning).
> 
>> 
>>> +struct xen_domain_numa_info {
>>> +    uint8_t version;
>>> +    uint8_t type;
>>> +
>>> +    uint8_t nr_vcpus;
>>> +    uint8_t nr_vnodes;
>>> +
>>> +    /* XXX: hvm_info_table uses 32-bit for high_mem_pgend,
>>> +     * so we should be fine 32-bits too*/
>>> +    uint32_t nr_pages;
>> 
>> If this is going to be visible outside HVMloader (e.g., in PV guests)
then
>> just make it a uint64_aligned_t and be done with it.
> 
> Will do that.
>> 
>>> +    /* Only (nr_vnodes) entries are filled */
>>> +    struct xen_vnode_info vnode_info[XEN_MAX_VNODES];
>>> +    /* Only (nr_vnodes*nr_vnodes) entries are filled */
>>> +    uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES];
>> 
>> As suggested above, make these pointers out to dynamic-sized arrays. No
need
>> for XEN_MAX_VNODES at all.
> 
> In general, I realise I should add more comments.
>> 
>>  -- Keir
>> 
>>> +};
>>> +
>>> +#endif
>> 
>> On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com>
wrote:
>> 
>>> oops .. sorry, here it is.
>>> 
>>> -dulloor
>>> 
>>> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser
<keir.fraser@eu.citrix.com>
>>> wrote:
>>>> This patch is incomplete.
>>>> 
>>>> 
>>>> On 03/07/2010 00:54, "Dulloor"
<dulloor@gmail.com> wrote:
>>>> 
>>>>> Implement the structure that will be shared with hvmloader
(with HVMs)
>>>>> and directly with the VMs (with PV).
>>>>> 
>>>>> -dulloor
>>>>> 
>>>>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>>> 
>>>> 
>>>> 
>> 
>> 
>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Jul-06 17:52 UTC

head link

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Thanks for the comments. I will make changes over the weekend and post
v2 patches.

thanks
dulloor

On Tue, Jul 6, 2010 at 5:57 AM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> On 06/07/2010 06:57, "Dulloor" <dulloor@gmail.com> wrote:
>
>>> What are xc_cpumask (a libxc concept) related definitions doing in
a
>>> hypervisor public header? These aren''t even used in this
header file. Below
>>> I suggest a vcpu_to_vnode[] array, which probably gets rid of the
need for
>>> this bitmask stuff anyway.
>>
>> Stale comment with xc_cpumask .. sorry !
>> I did think of the vcpu_to_vnode array, but then we use the bitmask in
>> hvm_info
>> anyway (with vcpu_online). I thought I could atleast fold them into a
>> single structure.
>> I could change that if you insist.
>
> I think overall vnode_to_vcpu[] is a better way round, unless the per-node
> vcpu maps are really particularly handy for some reason.
>
>>> A small number to be statically defined. Better to make your
structure
>>> extensible I think, perhaps including pointers out to vnode-indexed
arrays?
>> This structure is passed in hvm_info page. Should I use offset/len for
these
>> dynamic-sized, vnode-indexed arrays ?
>
> The ''hvm_info page'' is a slightly restrictive concept
really. Actually the
> hvm_info data gets plopped down at a fixed location below 1MB in the
guest''s
> memory map, and you can just extend from there even across a page boundary.
> I would simply include pointers out to the dynamically-sized arrays; and
> their sizes should be implicit given nr_vnodes.
>
>>> How do vnodes and mnodes differ? Why should a guest care about or
need to
>>> know about both, whatever they are?
>> vnode_id is the node-id in the guest and mnode_id refers to the real
node
>> it maps to. Actually I don''t need vnode_id. Will take that
out.
>
> Yes that''s a completely pointless unnecessary distinction.
>
>>>
>>>> +    uint32_t nr_pages;
>>>
>>> Not an address range? Is that implicitly worked out somehow? Should
be
>>> commented, but even better just a <start,end> range
explicitly given?
>>
>> The node address ranges are assumed contiguous and increasing. I will
>> change that to <start,end> ranges.
>
> Thanks.
>
>>>
>>>> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
>>>> +};
>>>
>>> Why not have a single integer array vcpu_to_vnode[] in the main
>>> xen_domain_numa_info structure?
>>
>> No specific reason, except that all the vnode-related info is
>> folded into a single structure. I will change that if you insist.
>
> Personally I think it it would be neater to change it. A whole bunch of
> cpumask machinery disappears.
>
>  -- Keir
>
>>>
>>>> +#define XEN_DOM_NUMA_INTERFACE_VERSION  0x01
>>>> +
>>>> +#define XEN_DOM_NUMA_CONFINE    0x01
>>>> +#define XEN_DOM_NUMA_SPLIT      0x02
>>>> +#define XEN_DOM_NUMA_STRIPE     0x03
>>>> +#define XEN_DOM_NUMA_DONTCARE   0x04
>>>
>>> What should the guest do with these? You''re rather light
on comments in this
>>> critical interface-defining header file.
>> I will add comments. The intent is to share this information with the
>> hypervisor
>> and PV guests (for ballooning).
>>
>>>
>>>> +struct xen_domain_numa_info {
>>>> +    uint8_t version;
>>>> +    uint8_t type;
>>>> +
>>>> +    uint8_t nr_vcpus;
>>>> +    uint8_t nr_vnodes;
>>>> +
>>>> +    /* XXX: hvm_info_table uses 32-bit for high_mem_pgend,
>>>> +     * so we should be fine 32-bits too*/
>>>> +    uint32_t nr_pages;
>>>
>>> If this is going to be visible outside HVMloader (e.g., in PV
guests) then
>>> just make it a uint64_aligned_t and be done with it.
>>
>> Will do that.
>>>
>>>> +    /* Only (nr_vnodes) entries are filled */
>>>> +    struct xen_vnode_info vnode_info[XEN_MAX_VNODES];
>>>> +    /* Only (nr_vnodes*nr_vnodes) entries are filled */
>>>> +    uint8_t vnode_distance[XEN_MAX_VNODES*XEN_MAX_VNODES];
>>>
>>> As suggested above, make these pointers out to dynamic-sized
arrays. No need
>>> for XEN_MAX_VNODES at all.
>>
>> In general, I realise I should add more comments.
>>>
>>>  -- Keir
>>>
>>>> +};
>>>> +
>>>> +#endif
>>>
>>> On 05/07/2010 09:52, "Dulloor" <dulloor@gmail.com>
wrote:
>>>
>>>> oops .. sorry, here it is.
>>>>
>>>> -dulloor
>>>>
>>>> On Mon, Jul 5, 2010 at 12:39 AM, Keir Fraser
<keir.fraser@eu.citrix.com>
>>>> wrote:
>>>>> This patch is incomplete.
>>>>>
>>>>>
>>>>> On 03/07/2010 00:54, "Dulloor"
<dulloor@gmail.com> wrote:
>>>>>
>>>>>> Implement the structure that will be shared with
hvmloader (with HVMs)
>>>>>> and directly with the VMs (with PV).
>>>>>>
>>>>>> -dulloor
>>>>>>
>>>>>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-01 22:02 UTC

head link

[xen-devel][vNUMA v2][PATCH 2/8] public interface

Interface definition. Structure that will be shared with hvmloader (with HVMs)
and directly with the VMs (with PV).

-dulloor

Signed-off-by : Dulloor <dulloor@gmail.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-03 12:40 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Dulloor wrote:> Interface definition. Structure that will be shared with hvmloader (with
HVMs)
> and directly with the VMs (with PV).
> 
> -dulloor
> 
> Signed-off-by : Dulloor <dulloor@gmail.com>
> 
> +struct xen_vnode_info {
> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
> +    uint32_t start;    /* start of the vnode range (in pages) */
> +    uint32_t end;             /* end of the vnode range (in pages) */
> +};Is there a particular reason you made the start and end members 32bit only?
This effectively limits the amount of memory to 16TB, which is not that
far from being used (if not larger NUMA machines already shipping with
this amount of memory).
So can you push these variables to 64 bits?> +struct xen_domain_numa_info {
> +    uint8_t version;    /* Interface version */
> +    uint8_t type;       /* VM memory allocation scheme (see above) */
> +
> +    uint8_t nr_vcpus;
> +    uint8_t nr_vnodes;The same for here. I''d like to see at least nr_vcpus to be prepared for
more than 256 vCPUs.
Please keep in mind that if Dom0 is also NUMA aware, by default the 
whole host resources are first given to Dom0, so the NUMA info should 
not be restricted to the size of a typical guest only.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-03 13:37 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Dulloor wrote:> Interface definition. Structure that will be shared with hvmloader (with
HVMs)
> and directly with the VMs (with PV).
> 
> -dulloor
> 
> Signed-off-by : Dulloor <dulloor@gmail.com>
> 
> +/* vnodes are 1GB-aligned */
> +#define XEN_MIN_VNODE_SHIFT (30)Why that? Do you mean guest memory here? Isn''t that a bit restrictive?
What if the remaining system resources do not allow this?
What about a 5GB guest on 2 nodes?
In AMD hardware there is minimum shift of 16MB, so I think 24 bit would
be better.

+struct xen_vnode_info {
+    uint8_t mnode_id; 	/* physical node vnode is allocated from */
+    uint32_t start; 	/* start of the vnode range (in pages) */
+    uint32_t end; 		/* end of the vnode range (in pages) */
+};
+> +struct xen_domain_numa_info {
> +    uint8_t version;    /* Interface version */
> +    uint8_t type;       /* VM memory allocation scheme (see above) */
> +
> +    uint8_t nr_vcpus;Isn''t that redundant with info stored somewhere else (for instance
in the hvm_info table)?> +    uint8_t nr_vnodes;
> +    /* data[] has the following entries :
> +     * //Only (nr_vnodes) entries are filled, each sizeof(struct
xen_vnode_info)
> +     * struct xen_vnode_info vnode_info[nr_vnodes]; Why would the guest need that info (physical node, start and end) here?
Wouldn''t be just the size of the node''s memory sufficient?

Regards,
Andre.> +     * //Only (nr_vcpus) entries are filled, each sizeof(uint8_t)
> +     * uint8_t vcpu_to_vnode[nr_vcpus];
> +     * //Only (nr_vnodes*nr_vnodes) entries are filled, each
sizeof(uint8_t)
> +     * uint8_t vnode_distance[nr_vnodes*nr_vnodes];
> +     */
> +	uint8_t data[0];
> +};
> +
> +#endif
-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-03 14:10 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 03/08/2010 14:37, "Andre Przywara" <andre.przywara@amd.com>
wrote:
> +struct xen_vnode_info {
> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
> +    uint32_t start;  /* start of the vnode range (in pages) */
> +    uint32_t end;   /* end of the vnode range (in pages) */
> +};
> +
>> +struct xen_domain_numa_info {
>> +    uint8_t version;    /* Interface version */
>> +    uint8_t type;       /* VM memory allocation scheme (see above) */
>> +
>> +    uint8_t nr_vcpus;
> Isn''t that redundant with info stored somewhere else (for instance
> in the hvm_info table)?
>> +    uint8_t nr_vnodes;
>> +    /* data[] has the following entries :
>> +     * //Only (nr_vnodes) entries are filled, each sizeof(struct
>> xen_vnode_info)
>> +     * struct xen_vnode_info vnode_info[nr_vnodes];
> Why would the guest need that info (physical node, start and end) here?
> Wouldn''t be just the size of the node''s memory
sufficient?
I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id
could go away. Do you think the <start,end> ranges are also unnecessary,
and
could really be replaced by just a size? I asked for that to be changed the
other way last time round.

 -- Keir




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-03 15:24 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 5:40 AM, Andre Przywara <andre.przywara@amd.com>
wrote:> Dulloor wrote:
>>
>> Interface definition. Structure that will be shared with hvmloader
(with
>> HVMs)
>> and directly with the VMs (with PV).
>>
>> -dulloor
>>
>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>
>> +struct xen_vnode_info {
>> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
>> +    uint32_t start;    /* start of the vnode range (in pages) */
>> +    uint32_t end;             /* end of the vnode range (in pages) */
>> +};
>
> Is there a particular reason you made the start and end members 32bit only?
> This effectively limits the amount of memory to 16TB, which is not that
> far from being used (if not larger NUMA machines already shipping with
> this amount of memory).
> So can you push these variables to 64 bits?Yeah. Keir had asked for this too. I guess I missed it. Will do that.
>>
>> +struct xen_domain_numa_info {
>> +    uint8_t version;    /* Interface version */
>> +    uint8_t type;       /* VM memory allocation scheme (see above) */
>> +
>> +    uint8_t nr_vcpus;
>> +    uint8_t nr_vnodes;
>
> The same for here. I''d like to see at least nr_vcpus to be
prepared for more
> than 256 vCPUs.
> Please keep in mind that if Dom0 is also NUMA aware, by default the whole
> host resources are first given to Dom0, so the NUMA info should not be
> restricted to the size of a typical guest only.Sure. I will change nr_vcpus and nr_vnodes to 16-bits each ?
And, thats a good point about Dom0. I have a patch ready in queue
which allows for Dom0
to be made NUMA-aware. I would expect control and driver code in Dom0
to benefit from that.

>
> Regards,
> Andre.
>
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-03 15:32 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 6:37 AM, Andre Przywara <andre.przywara@amd.com>
wrote:> Dulloor wrote:
>>
>> Interface definition. Structure that will be shared with hvmloader
(with
>> HVMs)
>> and directly with the VMs (with PV).
>>
>> -dulloor
>>
>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>
>
>> +/* vnodes are 1GB-aligned */
>> +#define XEN_MIN_VNODE_SHIFT (30)
>
> Why that? Do you mean guest memory here? Isn''t that a bit
restrictive?
> What if the remaining system resources do not allow this?
> What about a 5GB guest on 2 nodes?
> In AMD hardware there is minimum shift of 16MB, so I think 24 bit would
> be better.Linux has stricter restrictions on min vnode shift (256MB afair). And,
I remember
one of the emails from Jan Beulich where the minimum node size was discussed
(but in another context). I will get verify my facts and reply on this.
>
> +struct xen_vnode_info {
> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
> +    uint32_t start;    /* start of the vnode range (in pages) */
> +    uint32_t end;              /* end of the vnode range (in pages) */
> +};
> +
>>
>> +struct xen_domain_numa_info {
>> +    uint8_t version;    /* Interface version */
>> +    uint8_t type;       /* VM memory allocation scheme (see above) */
>> +
>> +    uint8_t nr_vcpus;
>
> Isn''t that redundant with info stored somewhere else (for instance
> in the hvm_info table)?But, this being a dynamic structure, nr_vcpus and nr_vnodes determine the
actual size of the populated structure. It''s just easier to use in the
above helper macros.
>>
>> +    uint8_t nr_vnodes;
>> +    /* data[] has the following entries :
>> +     * //Only (nr_vnodes) entries are filled, each sizeof(struct
>> xen_vnode_info)
>> +     * struct xen_vnode_info vnode_info[nr_vnodes];
>
> Why would the guest need that info (physical node, start and end) here?
> Wouldn''t be just the size of the node''s memory
sufficient?I changed that from size to (start, end) on last review. size should
be sufficient since
all nodes are contiguous. Will revert this back to use size.
>
> Regards,
> Andre.
>>
>> +     * //Only (nr_vcpus) entries are filled, each sizeof(uint8_t)
>> +     * uint8_t vcpu_to_vnode[nr_vcpus];
>> +     * //Only (nr_vnodes*nr_vnodes) entries are filled, each
>> sizeof(uint8_t)
>> +     * uint8_t vnode_distance[nr_vnodes*nr_vnodes];
>> +     */
>> +       uint8_t data[0];
>> +};
>> +
>> +#endif
>
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-03 15:43 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 7:10 AM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> On 03/08/2010 14:37, "Andre Przywara"
<andre.przywara@amd.com> wrote:
>
>> +struct xen_vnode_info {
>> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
>> +    uint32_t start;  /* start of the vnode range (in pages) */
>> +    uint32_t end;   /* end of the vnode range (in pages) */
>> +};
>> +
>>> +struct xen_domain_numa_info {
>>> +    uint8_t version;    /* Interface version */
>>> +    uint8_t type;       /* VM memory allocation scheme (see above)
*/
>>> +
>>> +    uint8_t nr_vcpus;
>> Isn''t that redundant with info stored somewhere else (for
instance
>> in the hvm_info table)?
>>> +    uint8_t nr_vnodes;
>>> +    /* data[] has the following entries :
>>> +     * //Only (nr_vnodes) entries are filled, each sizeof(struct
>>> xen_vnode_info)
>>> +     * struct xen_vnode_info vnode_info[nr_vnodes];
>> Why would the guest need that info (physical node, start and end) here?
>> Wouldn''t be just the size of the node''s memory
sufficient?
>
> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id
> could go away.mnode_id maps the vnode to a particular physical node. This will be
used by balloon driver in
the VMs when the structure is passed as NUMA enlightenment to PVs and
PV on HVMs.
I have a patch ready for that (once we are done with this series).
> Do you think the <start,end> ranges are also unnecessary, and
> could really be replaced by just a size? I asked for that to be changed the
> other way last time round.Actually, we don''t need <start, end> ranges, since the nodes are
always
contiguous, which could be made implicit in domain builders (hvmloader
and PV enlightenment).
But, your previous suggestion to use <start, end> ranges could still
be seen as style/consistency thing,
which could be important in interfaces. Your pick :)

>
>  -- Keir
>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-03 15:52 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:
>> I would expect guest would see nodes 0 to nr_vnodes-1, and the mnode_id
>> could go away.
> mnode_id maps the vnode to a particular physical node. This will be
> used by balloon driver in
> the VMs when the structure is passed as NUMA enlightenment to PVs and
> PV on HVMs.
> I have a patch ready for that (once we are done with this series).
So what happens when the guest is migrated to another system with different
physical node ids? Is that never to be supported? I''m not sure why you
wouldn''t hide the vnode-to-mnode translation in the hypervisor.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-03 15:54 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:
>> Do you think the <start,end> ranges are also unnecessary, and
>> could really be replaced by just a size? I asked for that to be changed
the
>> other way last time round.
> Actually, we don''t need <start, end> ranges, since the nodes
are always
> contiguous, which could be made implicit in domain builders (hvmloader
> and PV enlightenment).
> But, your previous suggestion to use <start, end> ranges could still
> be seen as style/consistency thing,
> which could be important in interfaces. Your pick :)
I think <start,end> is clearer.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-03 17:24 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:
>
>>> I would expect guest would see nodes 0 to nr_vnodes-1, and the
mnode_id
>>> could go away.
>> mnode_id maps the vnode to a particular physical node. This will be
>> used by balloon driver in
>> the VMs when the structure is passed as NUMA enlightenment to PVs and
>> PV on HVMs.
>> I have a patch ready for that (once we are done with this series).
>
> So what happens when the guest is migrated to another system with different
> physical node ids? Is that never to be supported? I''m not sure why
you
> wouldn''t hide the vnode-to-mnode translation in the hypervisor.
Right now, migration is not supported when NUMA strategy is set.
This is in my TODO list (along with PoD support).

There are a few open questions wrt migration :
- What if the destination host is not NUMA, but the guest is NUMA. Do we fake
those nodes ? Or, should we not select such a destination host to begin with.
- What if the destination host is not NUMA, but guest has asked to be
striped across
a specific number of nodes (possibly for higher aggregate memory bandwidth) ?
- What if the guest has asked for a particular memory strategy
(split/confined/striped),
but the destination host can''t guarantee that (because of the
distribution of free memory
across the nodes) ?
Once we answer these questions, we will know whether vnode-to-mnode
translation is better
exposed or not. And, if exposed, could we just renegotiate the
vnode-to-mnode translation at the
destination host. I have started working on this. But, I have some
other patches ready to go
which we might want to check-in first - PV/Dom0 NUMA patches,
Ballooning support (see below).

As such, the purpose of vnode-to-mnode translation is for the enlightened
guests to know where their underlying memory comes from, so that
over-provisioning features
like ballooning are given a chance to maintain this distribution. This
way all that the hypervisor
cares about is to do sanity checks on increase/exchange reservation
requests from the guests
and the guest can decide whether to make an exact_node_request or not.
Other options which would allow us to discard this translation are :
- Ballooning at your risk : Let ballooning be as it is even when
guests use a numa strategy(particularly split/confined).
- Hypervisor-level policies : Let Xen do its best to maintain the
guest nodes (using gpfn ranges in guest nodes), which I think
is not a clean/flexible solution.

But, what I could do is to leave out vnode_to_mnode translation for
now and add it along with ballooning support
(if/when we decide to add it). I will just bump up the interface
version at that time. That might give us time to mull this over ?
>
>  -- Keir
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-03 19:52 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 03/08/2010 18:24, "Dulloor" <dulloor@gmail.com> wrote:
> But, what I could do is to leave out vnode_to_mnode translation for
> now and add it along with ballooning support
> (if/when we decide to add it). I will just bump up the interface
> version at that time. That might give us time to mull this over ?
I would rename your ''nr_vnodes'' field to
''nr_nodes'' and your ''mnode_id''
field to ''node_id''. Then I''m happy. It''s the
unnecessary distinction between
vnodes and mnodes that I was troubled by -- the two can be collapsed
together and the interface is clearer that way.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-03 20:32 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 12:52 PM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> On 03/08/2010 18:24, "Dulloor" <dulloor@gmail.com> wrote:
>
>> But, what I could do is to leave out vnode_to_mnode translation for
>> now and add it along with ballooning support
>> (if/when we decide to add it). I will just bump up the interface
>> version at that time. That might give us time to mull this over ?
>
> I would rename your ''nr_vnodes'' field to
''nr_nodes'' and your ''mnode_id''
> field to ''node_id''. Then I''m happy.
It''s the unnecessary distinction between
> vnodes and mnodes that I was troubled by -- the two can be collapsed
> together and the interface is clearer that way.
Sure, will do that.>
>  -- Keir
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-03 21:21 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Dulloor wrote:> On Tue, Aug 3, 2010 at 6:37 AM, Andre Przywara
<andre.przywara@amd.com> wrote:
>> Dulloor wrote:
>>> Interface definition. Structure that will be shared with hvmloader
(with
>>> HVMs)
>>> and directly with the VMs (with PV).
>>>
>>> -dulloor
>>>
>>> Signed-off-by : Dulloor <dulloor@gmail.com>
>>>
>>> +/* vnodes are 1GB-aligned */
>>> +#define XEN_MIN_VNODE_SHIFT (30)
>> Why that? Do you mean guest memory here? Isn''t that a bit
restrictive?
>> What if the remaining system resources do not allow this?
>> What about a 5GB guest on 2 nodes?
>> In AMD hardware there is minimum shift of 16MB, so I think 24 bit would
>> be better.
> Linux has stricter restrictions on min vnode shift (256MB afair). And,
> I remember
> one of the emails from Jan Beulich where the minimum node size was
discussed
> (but in another context). I will get verify my facts and reply on this.OK. I was just asking cause I wondered how the PCI hole issue is solved 
(I haven''t managed to review these patches today).
256 MB looks OK to me.
> 
>> +struct xen_vnode_info {
>> +    uint8_t mnode_id;  /* physical node vnode is allocated from */
>> +    uint32_t start;    /* start of the vnode range (in pages) */
>> +    uint32_t end;              /* end of the vnode range (in pages) */
>> +};
>> +
>>> +struct xen_domain_numa_info {
>>> +    uint8_t version;    /* Interface version */
>>> +    uint8_t type;       /* VM memory allocation scheme (see above)
*/
>>> +
>>> +    uint8_t nr_vcpus;
>> Isn''t that redundant with info stored somewhere else (for
instance
>> in the hvm_info table)?
> But, this being a dynamic structure, nr_vcpus and nr_vnodes determine the
> actual size of the populated structure. It''s just easier to use in
the
> above helper macros.Right. That is better. My concern was how to deal with possible 
inconsistencies. But the number of VCPUs shouldn''t be a problem.
>>> +    uint8_t nr_vnodes;
>>> +    /* data[] has the following entries :
>>> +     * //Only (nr_vnodes) entries are filled, each sizeof(struct
>>> xen_vnode_info)
>>> +     * struct xen_vnode_info vnode_info[nr_vnodes];
>> Why would the guest need that info (physical node, start and end) here?
>> Wouldn''t be just the size of the node''s memory
sufficient?
> I changed that from size to (start, end) on last review. size should
> be sufficient since
> all nodes are contiguous. Will revert this back to use size.start and end look fine on the first glance, but you gain nothing in 
using this if you only allow one entry per node. See the simple example 
of 4GB in 2 nodes, the SRAT looks like this:
node0: 0-640K
node0: 1MB - 2GB
node1: 2GB - 3.5GB
node1: 4GB - 4.5GB
In my patches I did this hole-punching in hvmloader and only send 2G/2G 
via hvm_info.
 From an architectural point of view the Xen tools code shouldn''t deal 
with these internals if this can be hidden in hvmloader.

Regards,
Andre.


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-03 21:35 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Keir Fraser wrote:> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com> wrote:
> 
>>> I would expect guest would see nodes 0 to nr_vnodes-1, and the
mnode_id
>>> could go away.
>> mnode_id maps the vnode to a particular physical node. This will be
>> used by balloon driver in
>> the VMs when the structure is passed as NUMA enlightenment to PVs and
>> PV on HVMs.
>> I have a patch ready for that (once we are done with this series).
> 
> So what happens when the guest is migrated to another system with different
> physical node ids? Is that never to be supported? I''m not sure why
you
> wouldn''t hide the vnode-to-mnode translation in the hypervisor.And what about if the node assignment changes at guest''s runtime to 
satisfy load-balancing? I think we should have the opportunity to change 
the assignment, although this could be costly when it involves copying 
guest memory to another physical location. A major virtualization 
product ;-) solves this by a "hot pages first, the rest in background"
algorithm. I see that this is definitely a future extension, but we 
shouldn''t block this way already in this early stage.

Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-03 21:55 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Dulloor wrote:> On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser
<keir.fraser@eu.citrix.com> wrote:
>> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com>
wrote:
>>
>>>> I would expect guest would see nodes 0 to nr_vnodes-1, and the
mnode_id
>>>> could go away.
>>> mnode_id maps the vnode to a particular physical node. This will be
>>> used by balloon driver in
>>> the VMs when the structure is passed as NUMA enlightenment to PVs
and
>>> PV on HVMs.
>>> I have a patch ready for that (once we are done with this series).
>> So what happens when the guest is migrated to another system with
different
>> physical node ids? Is that never to be supported? I''m not sure
why you
>> wouldn''t hide the vnode-to-mnode translation in the
hypervisor.
> 
> Right now, migration is not supported when NUMA strategy is set.
> This is in my TODO list (along with PoD support).
> 
> There are a few open questions wrt migration :
> - What if the destination host is not NUMA, but the guest is NUMA. Do we
fake
> those nodes ? Or, should we not select such a destination host to begin
with.I don''t see a problem with this situation. The guest has virtual nodes,
these can be mapped in any way to actual physical nodes (but only by the 
hypervisor/Dom0, not by the guest itself).
A corner case could be clearly to map all guest nodes to one single host 
node. In terms of performance this should be even better, if the new 
host can satisfy the requirement from one node, because there will be no 
remote accesses at all.> - What if the destination host is not NUMA, but guest has asked to be
> striped across
> a specific number of nodes (possibly for higher aggregate memory bandwidth)
?Most people deal with NUMA because they want to cure performance _drops_ 
caused by bad allocation policies. After all NUMA awareness is a 
performance optimization. If the user asks to migrate to another host, 
then we shouldn''t come with fussy argument like NUMA. In my eyes it is
a
question of priorities, I don''t want to deny migration because of
this.> - What if the guest has asked for a particular memory strategy
> (split/confined/striped),
> but the destination host can''t guarantee that (because of the
> distribution of free memory
> across the nodes) ?I see, there is one case where the new host has more nodes than the old 
one, but the memory on each node is not sufficient (like migrating from 
a 2*8GB machine to an 8*4GB one). I think we should inform the user 
about this and if she persists in the migration, use some kind of 
interleaving to join two (or more) nodes together. Looks like future 
work, though.> Once we answer these questions, we will know whether vnode-to-mnode
> translation is better
> exposed or not. And, if exposed, could we just renegotiate the
> vnode-to-mnode translation at the
> destination host. I have started working on this. But, I have some
> other patches ready to go
> which we might want to check-in first - PV/Dom0 NUMA patches,
> Ballooning support (see below).
> 
> As such, the purpose of vnode-to-mnode translation is for the enlightened
> guests to know where their underlying memory comes from, so that
> over-provisioning features
> like ballooning are given a chance to maintain this distribution.I was afraid you were saying that ;-) I haven''t thought about this in 
detail, but maybe we can make an exception for Dom0 only, because this 
is the most prominent and frequent user of ballooning. But I really 
think that DomUs should not know about or deal with host NUMA nodes.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-04 05:27 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 03/08/2010 22:55, "Andre Przywara" <andre.przywara@amd.com>
wrote:
>> As such, the purpose of vnode-to-mnode translation is for the
enlightened
>> guests to know where their underlying memory comes from, so that
>> over-provisioning features
>> like ballooning are given a chance to maintain this distribution.
> I was afraid you were saying that ;-) I haven''t thought about this
in
> detail, but maybe we can make an exception for Dom0 only, because this
> is the most prominent and frequent user of ballooning. But I really
> think that DomUs should not know about or deal with host NUMA nodes.
So long as it gets renamed to ''node_id'' in the info structure
I''m okay with
it. That doesn''t preclude simply setting that field to 0...nr_vnodes-1
and
doing translation in the hypervisor, but also leaves us a bit of
flexibility.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Aug-04 05:48 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On Tue, Aug 3, 2010 at 10:27 PM, Keir Fraser <keir.fraser@eu.citrix.com>
wrote:> On 03/08/2010 22:55, "Andre Przywara"
<andre.przywara@amd.com> wrote:
>
>>> As such, the purpose of vnode-to-mnode translation is for the
enlightened
>>> guests to know where their underlying memory comes from, so that
>>> over-provisioning features
>>> like ballooning are given a chance to maintain this distribution.
>> I was afraid you were saying that ;-) I haven''t thought about
this in
>> detail, but maybe we can make an exception for Dom0 only, because this
>> is the most prominent and frequent user of ballooning. But I really
>> think that DomUs should not know about or deal with host NUMA nodes.
>
> So long as it gets renamed to ''node_id'' in the info
structure I''m okay with
> it. That doesn''t preclude simply setting that field to
0...nr_vnodes-1 and
> doing translation in the hypervisor, but also leaves us a bit of
> flexibility.Yes, I like this idea. Will do that.
>
>  -- Keir
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2010-Aug-04 07:01 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Dulloor wrote:> On Tue, Aug 3, 2010 at 10:27 PM, Keir Fraser
<keir.fraser@eu.citrix.com> wrote:
>> On 03/08/2010 22:55, "Andre Przywara"
<andre.przywara@amd.com> wrote:
>>
>>>> As such, the purpose of vnode-to-mnode translation is for the
enlightened
>>>> guests to know where their underlying memory comes from, so
that
>>>> over-provisioning features
>>>> like ballooning are given a chance to maintain this
distribution.
>>> I was afraid you were saying that ;-) I haven''t thought
about this in
>>> detail, but maybe we can make an exception for Dom0 only, because
this
>>> is the most prominent and frequent user of ballooning. But I really
>>> think that DomUs should not know about or deal with host NUMA
nodes.
>> So long as it gets renamed to ''node_id'' in the info
structure I''m okay with
>> it. That doesn''t preclude simply setting that field to
0...nr_vnodes-1 and
>> doing translation in the hypervisor, but also leaves us a bit of
>> flexibility.
> Yes, I like this idea. Will do that.Ack, that sounds good. So we have virtual nodes, virtual physical nodes 
and machine nodes.

Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Aug-04 08:45 UTC

head link

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

On 04/08/2010 08:01, "Andre Przywara" <andre.przywara@amd.com>
wrote:
>>> So long as it gets renamed to ''node_id'' in the
info structure I''m okay with
>>> it. That doesn''t preclude simply setting that field to
0...nr_vnodes-1 and
>>> doing translation in the hypervisor, but also leaves us a bit of
>>> flexibility.
>> Yes, I like this idea. Will do that.
> Ack, that sounds good. So we have virtual nodes, virtual physical nodes
> and machine nodes.
The index into the ''node table'' in the informational structure
would have no
particular relevance in this scheme.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2010-Aug-04 13:34 UTC

head link

RE: [xen-devel][vNUMA v2][PATCH 2/8] public interface

I have to be brief right now, but since I was the one that
suggested that Dulloor turn on "no_migrate" when the guest
is using NUMA optimizations, I should reply.

Andre, you are assuming a very static world, e.g.:
> I think we should inform the user
> about this and if she persists in the migration...
and:
> but maybe we can make an exception for Dom0 only, because this
> is the most prominent and frequent user of ballooning.
One of the reasons virtualization is increasingly successful is
highly dynamic resource management.  Please don''t assume you
can "inform the user" that a migration might result in greatly
decreased performance... there may be hundreds or thousands
of migrations per minute in the data center.  And please don''t
assume that today''s horribly static "virtual physical memory"
model will be the same even a year or two from now.  If
virtualization is to survive the looming "memory wall",
much more creative mechanisms must be in place to better
optimize memory utilization and these will likely result in
thousands of "page-ownership" changes per guest per second.
> Most people deal with NUMA because they want to cure performance
> _drops_ caused by bad allocation policies. After all NUMA awareness
> is a performance optimization. If the user asks to migrate to
> another host, then we shouldn''t come with fussy argument like
> NUMA. In my eyes it is a question of priorities, I don''t want
> to deny migration because of this.
Actually, most people deal with NUMA because they are publishing
benchmark numbers ;-)

I firmly believe that virtualization exposes many tradeoffs
between performance and flexibility and these choices should
be presented to the "virtualization user" as simply as possible,
e.g. there should be a very high level choice between
performance and flexibility.  If a user is attempting to
improve performance by exposing the underlying NUMA-ness
of a physical machine to the guests, they are implicitly making
a policy decision choosing performance over flexibility.
If we then turn around and migrate the guest in such a
way that performance plummets, we''ve defeated the intent of
the user''s implicitly stated policy.

Dan
> From: Andre Przywara [mailto:andre.przywara@amd.com]
> 
> Dulloor wrote:
> > On Tue, Aug 3, 2010 at 8:52 AM, Keir Fraser
> <keir.fraser@eu.citrix.com> wrote:
> >> On 03/08/2010 16:43, "Dulloor" <dulloor@gmail.com>
wrote:
> >>
> >>>> I would expect guest would see nodes 0 to nr_vnodes-1, and
the
> mnode_id
> >>>> could go away.
> >>> mnode_id maps the vnode to a particular physical node. This
will be
> >>> used by balloon driver in
> >>> the VMs when the structure is passed as NUMA enlightenment to
PVs
> and
> >>> PV on HVMs.
> >>> I have a patch ready for that (once we are done with this
series).
> >> So what happens when the guest is migrated to another system with
> different
> >> physical node ids? Is that never to be supported? I''m not
sure why
> you
> >> wouldn''t hide the vnode-to-mnode translation in the
hypervisor.
> >
> > Right now, migration is not supported when NUMA strategy is set.
> > This is in my TODO list (along with PoD support).
> >
> > There are a few open questions wrt migration :
> > - What if the destination host is not NUMA, but the guest is NUMA. Do
> we fake
> > those nodes ? Or, should we not select such a destination host to
> begin with.
> I don''t see a problem with this situation. The guest has virtual
nodes,
> these can be mapped in any way to actual physical nodes (but only by
> the
> hypervisor/Dom0, not by the guest itself).
> A corner case could be clearly to map all guest nodes to one single
> host
> node. In terms of performance this should be even better, if the new
> host can satisfy the requirement from one node, because there will be
> no
> remote accesses at all.
> > - What if the destination host is not NUMA, but guest has asked to be
> > striped across
> > a specific number of nodes (possibly for higher aggregate memory
> bandwidth) ?
> Most people deal with NUMA because they want to cure performance
> _drops_
> caused by bad allocation policies. After all NUMA awareness is a
> performance optimization. If the user asks to migrate to another host,
> then we shouldn''t come with fussy argument like NUMA. In my eyes
it is
> a
> question of priorities, I don''t want to deny migration because of
this.
> > - What if the guest has asked for a particular memory strategy
> > (split/confined/striped),
> > but the destination host can''t guarantee that (because of the
> > distribution of free memory
> > across the nodes) ?
> I see, there is one case where the new host has more nodes than the old
> one, but the memory on each node is not sufficient (like migrating from
> a 2*8GB machine to an 8*4GB one). I think we should inform the user
> about this and if she persists in the migration, use some kind of
> interleaving to join two (or more) nodes together. Looks like future
> work, though.
> > Once we answer these questions, we will know whether vnode-to-mnode
> > translation is better
> > exposed or not. And, if exposed, could we just renegotiate the
> > vnode-to-mnode translation at the
> > destination host. I have started working on this. But, I have some
> > other patches ready to go
> > which we might want to check-in first - PV/Dom0 NUMA patches,
> > Ballooning support (see below).
> >
> > As such, the purpose of vnode-to-mnode translation is for the
> enlightened
> > guests to know where their underlying memory comes from, so that
> > over-provisioning features
> > like ballooning are given a chance to maintain this distribution.
> I was afraid you were saying that ;-) I haven''t thought about this
in
> detail, but maybe we can make an exception for Dom0 only, because this
> is the most prominent and frequent user of ballooning. But I really
> think that DomUs should not know about or deal with host NUMA nodes.
> 
> Regards,
> Andre.
> 
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2010 - [XEN][vNUMA][PATCH 3/9] public interface

[Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

Re: [Xen-devel] [XEN][vNUMA][PATCH 3/9] public interface

[xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

Re: [xen-devel][vNUMA v2][PATCH 2/8] public interface

RE: [xen-devel][vNUMA v2][PATCH 2/8] public interface