thr3ads.net - Xen devel - [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0] [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Dulloor

2010-Feb-13 06:25 UTC

[Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

I am attaching (RFC) patches for NUMA-aware pv guests.

* The patch adds hypervisor interfaces to export minimal numa-related
information about the memory of pv domain, which can then be used to
setup the node ranges, virtual cpu<->node maps, and virtual slit
tables in the pv domain.
* The guest-domain also maintains a mapping between its vnodes and
mnodes(actual machine nodes). These mappings can be used in the memory
operations, such as those in ballooning.
* In the patch, dom0 is made numa-aware using these interfaces. Other
domains should be simpler. I am in the process of adding python
interfaces for this. And, this would work with any node selection
policy.
* The patch is tested only for 64-on-64 (on x86_64)

* Along with the following other patches, this could provide a good
solution for numa-aware guests -
- numa-aware ballooning  (previously posted by me on xen-devel)
- Andre''s patch for HVM domains (posted by Andre recently)

I am in the process of making other places of dynamic memory
mgmt/operations numa-aware - tmem, memory exchange operations, etc.

Please let know your comments.

-dulloor

On Thu, Feb 11, 2010 at 10:21 AM, Ian Pratt <Ian.Pratt@eu.citrix.com>
wrote:>> > If guest NUMA is disabled, we just use a single node mask which is
the
>> > union of the per-VCPU node masks.
>> >
>> > Where allowed node masks span more than one physical node, we
should
>> > allocate memory to the guest''s virtual node by pseudo
randomly striping
>> > memory allocations (in 2MB chunks) from across the specified
physical
>> > nodes. [pseudo random is probably better than round robin]
>>
>> Do we really want to support this? I don''t think the allowed
node masks
>> should span more than one physical NUMA node. We also need to look at
I/O
>> devices as well.
>
> Given that we definitely need this striping code in the case where the
guest is non NUMA, I''d be inclined to still allow it to be used even if
the guest has multiple NUMA nodes. It could come in handy where there is a
hierarchy between physical NUMA nodes, enabling for example striping to be used
between a pair of ''close'' nodes, while exposing the
higher-level topology of sets of the paired nodes to be exposed to the guest.
>
> Ian
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2010-Feb-16 01:15 UTC

head link

RE: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

Hi Dulloor --
> I am in the process of making other places of dynamic memory
> mgmt/operations numa-aware - tmem, memory exchange operations, etc.
I''d be interested in your thoughts on numa-aware tmem
as well as the other dynamic memory mechanisms in Xen 4.0.

Tmem is special in that it uses primarily full-page copies
from/to tmem-space to/from guest-space so, assuming the
interconnect can pipeline/stream a memcpy, overhead of
off-node memory vs on-node memory should be less
noticeable.  However tmem uses large data structures
(rbtrees and radix-trees) and the lookup process might
benefit from being NUMA-aware.

Also, I will be looking into adding some page-sharing
techniques into tmem in the near future.  This (and the
existing page sharing feature just added to 4.0) may
create some other interesting challenges for NUMA-awareness.

Dan
> -----Original Message-----
> From: Dulloor [mailto:dulloor@gmail.com]
> Sent: Friday, February 12, 2010 11:25 PM
> To: Ian Pratt
> Cc: Andre Przywara; xen-devel@lists.xensource.com; Nakajima, Jun; Keir
> Fraser
> Subject: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in
> dom0]
> 
> I am attaching (RFC) patches for NUMA-aware pv guests.
> 
> * The patch adds hypervisor interfaces to export minimal numa-related
> information about the memory of pv domain, which can then be used to
> setup the node ranges, virtual cpu<->node maps, and virtual slit
> tables in the pv domain.
> * The guest-domain also maintains a mapping between its vnodes and
> mnodes(actual machine nodes). These mappings can be used in the memory
> operations, such as those in ballooning.
> * In the patch, dom0 is made numa-aware using these interfaces. Other
> domains should be simpler. I am in the process of adding python
> interfaces for this. And, this would work with any node selection
> policy.
> * The patch is tested only for 64-on-64 (on x86_64)
> 
> * Along with the following other patches, this could provide a good
> solution for numa-aware guests -
> - numa-aware ballooning  (previously posted by me on xen-devel)
> - Andre''s patch for HVM domains (posted by Andre recently)
> 
> I am in the process of making other places of dynamic memory
> mgmt/operations numa-aware - tmem, memory exchange operations, etc.
> 
> Please let know your comments.
> 
> -dulloor
> 
> On Thu, Feb 11, 2010 at 10:21 AM, Ian Pratt <Ian.Pratt@eu.citrix.com>
> wrote:
> >> > If guest NUMA is disabled, we just use a single node mask
which is
> the
> >> > union of the per-VCPU node masks.
> >> >
> >> > Where allowed node masks span more than one physical node, we
> should
> >> > allocate memory to the guest''s virtual node by
pseudo randomly
> striping
> >> > memory allocations (in 2MB chunks) from across the specified
> physical
> >> > nodes. [pseudo random is probably better than round robin]
> >>
> >> Do we really want to support this? I don''t think the
allowed node
> masks
> >> should span more than one physical NUMA node. We also need to look
> at I/O
> >> devices as well.
> >
> > Given that we definitely need this striping code in the case where
> the guest is non NUMA, I''d be inclined to still allow it to be
used
> even if the guest has multiple NUMA nodes. It could come in handy where
> there is a hierarchy between physical NUMA nodes, enabling for example
> striping to be used between a pair of ''close'' nodes,
while exposing the
> higher-level topology of sets of the paired nodes to be exposed to the
> guest.
> >
> > Ian
> >
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> >
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Feb-16 17:49 UTC

head link

Re: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

Run this patch through scripts/checkpatch.pl

On Sat, Feb 13, 2010 at 01:25:22AM -0500, Dulloor wrote:> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 86bef0f..7a24070 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -940,7 +940,8 @@ void __init setup_arch(char **cmdline_p)
>  	/*
>  	 * Parse SRAT to discover nodes.
>  	 */
> -	acpi_numa_init();
> +    if (acpi_numa > 0)
> +	    acpi_numa_init();
Why? Why can''t we just try  acpi_numa_init()?
>  #endif
>  
>  	initmem_init(0, max_pfn);
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index 459913b..14fa654 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -11,7 +11,9 @@
>  #include <linux/ctype.h>
>  #include <linux/module.h>
>  #include <linux/nodemask.h>
> +#include <linux/cpumask.h>
>  #include <linux/sched.h>
> +#include <xen/interface/xen.h>
>  
>  #include <asm/e820.h>
>  #include <asm/proto.h>
> @@ -19,6 +21,7 @@
>  #include <asm/numa.h>
>  #include <asm/acpi.h>
>  #include <asm/k8.h>
> +#include <asm/xen/hypervisor.h>
If one does not set CONFIG_XEN does this compile?>  
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
> @@ -428,7 +431,6 @@ static int __init numa_emulation(unsigned long
start_pfn, unsigned long last_pfn
>  	 */
>  	if (!strchr(cmdline, ''*'') && !strchr(cmdline,
'','')) {
>  		long n = simple_strtol(cmdline, NULL, 0);
> -
No need for this.>  		num_nodes = split_nodes_equally(nodes, &addr, max_addr, 0, n);
>  		if (num_nodes < 0)
>  			return num_nodes;
> @@ -522,16 +524,162 @@ out:
>  	numa_init_array();
>  	return 0;
>  }
> +struct xen_domain_numa_layout pv_numa_layout;
> +
> +void dump_numa_layout(struct xen_domain_numa_layout *layout)
> +{
> +    unsigned int i, j;
> +    char vcpumask[128];
> +    printk("NUMA-LAYOUT(Dom0) : vcpus(%u), vnodes(%u)\n",
> +                            layout->max_vcpus, layout->max_vnodes);
Redo the printk''s. They need KERN_DEBUG> +    for (i = 0; i < layout->max_vnodes; i++)
> +    {
> +        struct xen_vnode_data *vnode_data = &layout->vnode_data[i];
> +        cpumask_scnprintf(vcpumask, sizeof(vcpumask), 
> +                                ((cpumask_t
*)&vnode_data->vcpu_mask));
> +        printk("vnode[%u]:mnode(%u), node_nr_pages(%lx),
vcpu_mask(%s)\n",
> +                vnode_data->vnode_id, vnode_data->mnode_id,
> +                (unsigned long)vnode_data->nr_pages, vcpumask);
This one too.> +    }
> +
> +    printk("vnode distances :\n");
and > +    for (i = 0; i < layout->max_vnodes; i++)
> +        printk("\tvnode[%u]", i);
> +    for (i = 0; i < layout->max_vnodes; i++)
> +    {
> +        printk("\nvnode[%u]", i);
this> +        for (j = 0; j < layout->max_vnodes; j++)
> +            printk("\t%u",
layout->vnode_distance[i*layout->max_vnodes + j]);
> +        printk("\n");
one to.> +    }
> +    return;
> +}
> +
> +static void __init xen_init_slit_table(struct xen_domain_numa_layout
*layout)
> +{
> +    /* Construct a slit table (using layout->vnode_distance).
> +     * Copy it to acpi_slit. */
> +    return;
> +}
> +/* Distribute the vcpus over the vnodes according to their affinity */
> +static void __init xen_init_numa_array(struct xen_domain_numa_layout
*layout)
> +{
> +	int vcpu, vnode;
> +   
> +	printk(KERN_INFO "xen_numa_init_array - cpu_to_node
initialization\n");
pr_debug instead?> +
> +    for (vnode = 0; vnode < layout->max_vnodes; vnode++)
> +    {
> +        struct xen_vnode_data *vnode_data =
&layout->vnode_data[vnode];
> +        cpumask_t vcpu_mask = *((cpumask_t
*)&vnode_data->vcpu_mask);
> +   
> +        for (vcpu = 0; vcpu < layout->max_vcpus; vcpu++)
> +        {
> +            if (cpu_isset(vcpu, vcpu_mask))
> +            {
> +                if (early_cpu_to_node(vcpu) != NUMA_NO_NODE)
> +                {
> +                    printk(KERN_INFO "EARLY vcpu[%d] on
vnode[%d]\n",
> +                                        vcpu, early_cpu_to_node(vcpu)); 
> +                    continue;
> +                }
> +                printk(KERN_INFO "vcpu[%d] on vnode[%d]\n",
vcpu, vnode);
> +		        numa_set_node(vcpu, vnode);
> +            }
> +        }
> +    }
> +    return;
> +}
> +
> +static int __init xen_numa_emulation(struct xen_domain_numa_layout
*layout,
> +                            unsigned long start_pfn, unsigned long
last_pfn)
> +{
> +	int num_vnodes, i;
> +    u64 node_start_addr, node_end_addr, max_addr;
> +
> +    printk(KERN_INFO "xen_numa_emulation : max_vnodes(%d),
max_vcpus(%d)",
> +                                        layout->max_vnodes,
layout->max_vcpus);
> +    dump_numa_layout(layout);
> +	memset(&nodes, 0, sizeof(nodes));
> +
> +    num_vnodes = layout->max_vnodes;
> +    BUG_ON(num_vnodes > MAX_NUMNODES);
Hmm.. Is that really neccessary? What if we just do WARN("some lengthy
explanation"),
and bail out and not initialize these structures?> +
> +    max_addr = last_pfn << PAGE_SHIFT;
> +
> +    node_start_addr = start_pfn << PAGE_SHIFT;
> +    for (i = 0; i < num_vnodes; i++)
> +    {
> +        struct xen_vnode_data *vnode_data = &layout->vnode_data[i];
> +        u64 node_size = vnode_data->nr_pages << PAGE_SHIFT;
> +
> +		node_size &= FAKE_NODE_MIN_HASH_MASK; /* 64MB aligned */
> +
> +		if (i == (num_vnodes-1))
> +			node_end_addr = max_addr;
> +		else
> +        {
> +            node_end_addr = node_start_addr + node_size;
> +			while ((node_end_addr - node_start_addr - 
> +                e820_hole_size(node_start_addr, node_end_addr)) <
node_size)
> +            {
> +                node_end_addr += FAKE_NODE_MIN_SIZE;
> +				if (node_end_addr > max_addr) {
> +					node_end_addr = max_addr;
> +					break;
> +				}
> +			}
> +        }
> +        /* node_start_addr updated inside the function */
> +        if (setup_node_range(i, nodes, &node_start_addr, 
> +                    (node_end_addr-node_start_addr), max_addr+1))
> +            BUG();
> +    }
> +
> +	printk(KERN_INFO "XEN domain numa emulation - setup nodes\n");
Is this neccessary? Can be it be debug?> +
> +    memnode_shift = compute_hash_shift(nodes, num_vnodes, NULL);
> +    if (memnode_shift < 0) {
> +	    printk(KERN_ERR "No NUMA hash function found.\n");
> +        BUG();
Wow. BUG? What about just bailing out and unset xen_pv_emulation flag?
> +    }
> +    /* XXX: Shouldn''t be needed because we disabled acpi_numa
very early ! */
> +	/*
> +	 * We need to vacate all active ranges that may have been registered by
> +	 * SRAT and set acpi_numa to -1 so that srat_disabled() always returns
> +	 * true.  NUMA emulation has succeeded so we will not scan ACPI nodes.
> +	 */
> +	remove_all_active_ranges();
> +
> +    BUG_ON(acpi_numa >= 0);
> +	for_each_node_mask(i, node_possible_map) {
> +		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
> +						nodes[i].end >> PAGE_SHIFT);
> +		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
> +	}
> +    xen_init_slit_table(layout);
> +	xen_init_numa_array(layout);
> +	return 0;
> +}
>  #endif /* CONFIG_NUMA_EMU */
>  
>  void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn)
>  {
>  	int i;
> +    struct xen_domain_numa_layout *numa_layout = &pv_numa_layout;
> +
> +    int xen_pv_numa_enabled = numa_layout->max_vnodes;
>  
>  	nodes_clear(node_possible_map);
>  	nodes_clear(node_online_map);
>  
>  #ifdef CONFIG_NUMA_EMU
> +    if (xen_pv_domain() && xen_pv_numa_enabled)
> +    {
> +        if (!xen_numa_emulation(numa_layout, start_pfn, last_pfn))
> +            return;
> +    }
> +
>  	if (cmdline && !numa_emulation(start_pfn, last_pfn))
>  		return;
>  	nodes_clear(node_possible_map);
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index ecb9b0d..b020555 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -83,6 +83,8 @@ void *xen_initial_gdt;
>   */
>  struct shared_info *HYPERVISOR_shared_info = (void
*)&xen_dummy_shared_info;
>  
> +struct xen_domain_numa_layout *HYPERVISOR_domain_numa_layout;
> +
>  /*
>   * Flag to determine whether vcpu info placement is available on all
>   * VCPUs.  We assume it is to start with, and then set it to zero on
> @@ -1089,6 +1091,7 @@ static void __init xen_setup_stackprotector(void)
>  	pv_cpu_ops.load_gdt = xen_load_gdt;
>  }
>  
> +extern struct xen_domain_numa_layout pv_numa_layout;
>  /* First C function to be called on Xen boot */
>  asmlinkage void __init xen_start_kernel(void)
>  {
> @@ -1230,6 +1233,12 @@ asmlinkage void __init xen_start_kernel(void)
>  		xen_start_info->console.domU.evtchn = 0;
>  	}
>  
> +    {
> +        struct xen_domain_numa_layout *layout = 
> +            (void *)((char *)xen_start_info +
> +                        xen_start_info->numa_layout_info.info_off);
> +        memcpy(&pv_numa_layout, layout, sizeof(*layout));
> +    }
Shouldn''t there be a check to see if the Xen actually exports
this structure? Otherwise you might copy garbage in and set
pv_numa_layout values to random values.
>  	xen_raw_console_write("about to get started...\n");
>  
>  	/* Start the world */
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 612f2c9..cb944a2 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -281,6 +281,9 @@ void __init xen_arch_setup(void)
>  		printk(KERN_INFO "ACPI in unprivileged domain disabled\n");
>  		disable_acpi();
>  	}
> +
> +    acpi_numa = -1;
> +    numa_off = 1;
>  #endif
>  
>  	/* 
> diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
> index 812ffd5..d4588fa 100644
> --- a/include/xen/interface/xen.h
> +++ b/include/xen/interface/xen.h
> @@ -398,6 +398,53 @@ struct shared_info {
>  
>  };
>  
> +#define XEN_NR_CPUS 64No way to share this with some other variable? Like NR_CPUS?
> +#if defined(__i386__)
> +#define XEN_BITS_PER_LONG 32
> +#define XEN_BYTES_PER_LONG 4
There gotta be some of these defined in other headers.> +#define XEN_LONG_BYTEORDER 2
> +#elif defined(__x86_64__)
> +#define XEN_BITS_PER_LONG 64
> +#define XEN_BYTES_PER_LONG 8
> +#define XEN_LONG_BYTEORDER 3
> +#endif
> +
> +/* same as cpumask_t - in xen and even Linux (for now) */
> +#define XEN_BITS_TO_LONGS(bits) \
> +    (((bits)+XEN_BITS_PER_LONG-1)/XEN_BITS_PER_LONG)
> +#define XEN_DECLARE_BITMAP(name,bits) \
> +    unsigned long name[XEN_BITS_TO_LONGS(bits)]
I am pretty sure there are macros for this laready.> +struct xen_cpumask{ XEN_DECLARE_BITMAP(bits, XEN_NR_CPUS); };
> +#ifndef __XEN__
Is that neccessary? typedefs are frowned upon in kernel.> +typedef struct xen_cpumask xen_cpumask_t;
> +#endif
> +
> +#define XEN_MAX_VNODES 8
> +struct xen_vnode_data {
> +    uint32_t vnode_id;
> +    uint32_t mnode_id;
> +    uint64_t nr_pages;
> +    /* XXX: Can we use this in xen<->domain interfaces ? */
> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
> +};
> +#ifndef __XEN__
> +typedef struct xen_vnode_data xen_vnode_data_t;
Ditto.> +#endif
> +
> +/* NUMA layout for the domain at the time of startup.
> + * Structure has to fit within a page. */
> +struct xen_domain_numa_layout {
> +    uint32_t max_vcpus;
> +    uint32_t max_vnodes;
> +
> +    /* Only (max_vnodes*max_vnodes) entries are filled */
> +    uint32_t vnode_distance[XEN_MAX_VNODES * XEN_MAX_VNODES];
> +    struct xen_vnode_data vnode_data[XEN_MAX_VNODES];
> +};
> +#ifndef __XEN__
> +typedef struct xen_domain_numa_layout xen_domain_numa_layout_t;
Ditto.> +#endif
> +
>  /*
>   * Start-of-day memory layout for the initial domain (DOM0):
>   *  1. The domain is started within contiguous virtual-memory region.
> @@ -449,6 +496,13 @@ struct start_info {
>  	unsigned long mod_start;    /* VIRTUAL address of pre-loaded module.  */
>  	unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
>  	int8_t cmd_line[MAX_GUEST_CMDLINE];
> +    /* The pfn range here covers both page table and p->m table frames.
*/
> +    unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M table. 
*/
Why is this added in this patch? That doesn''t look to be used by your
patch.> +    unsigned long nr_p2m_frames;/* # of pfns forming initial P->M
table.  */
Ditto.> +    struct {
> +        uint32_t info_off;  /* Offset of console_info struct.         *
> +        uint32_t info_size; /* Size of console_info struct from start.*/
Wrong comment.> +    } numa_layout_info;
>  };
>  
>  struct dom0_vga_console_info {

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dulloor

2010-Feb-18 16:24 UTC

head link

Re: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

Konrad, Thanks for the comments. I will make the changes and send over
the patch.

Any comments on the general approach are welcome too.

-dulloor

On Tue, Feb 16, 2010 at 12:49 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> Run this patch through scripts/checkpatch.pl
>
> On Sat, Feb 13, 2010 at 01:25:22AM -0500, Dulloor wrote:
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 86bef0f..7a24070 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -940,7 +940,8 @@ void __init setup_arch(char **cmdline_p)
>>       /*
>>        * Parse SRAT to discover nodes.
>>        */
>> -     acpi_numa_init();
>> +    if (acpi_numa > 0)
>> +         acpi_numa_init();
>
> Why? Why can''t we just try  acpi_numa_init()?
>
>>  #endif
>>
>>       initmem_init(0, max_pfn);
>> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
>> index 459913b..14fa654 100644
>> --- a/arch/x86/mm/numa_64.c
>> +++ b/arch/x86/mm/numa_64.c
>> @@ -11,7 +11,9 @@
>>  #include <linux/ctype.h>
>>  #include <linux/module.h>
>>  #include <linux/nodemask.h>
>> +#include <linux/cpumask.h>
>>  #include <linux/sched.h>
>> +#include <xen/interface/xen.h>
>>
>>  #include <asm/e820.h>
>>  #include <asm/proto.h>
>> @@ -19,6 +21,7 @@
>>  #include <asm/numa.h>
>>  #include <asm/acpi.h>
>>  #include <asm/k8.h>
>> +#include <asm/xen/hypervisor.h>
>
> If one does not set CONFIG_XEN does this compile?
>>
>>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>>  EXPORT_SYMBOL(node_data);
>> @@ -428,7 +431,6 @@ static int __init numa_emulation(unsigned long
start_pfn, unsigned long last_pfn
>>        */
>>       if (!strchr(cmdline, ''*'') &&
!strchr(cmdline, '','')) {
>>               long n = simple_strtol(cmdline, NULL, 0);
>> -
>
> No need for this.
>>               num_nodes = split_nodes_equally(nodes, &addr,
max_addr, 0, n);
>>               if (num_nodes < 0)
>>                       return num_nodes;
>> @@ -522,16 +524,162 @@ out:
>>       numa_init_array();
>>       return 0;
>>  }
>> +struct xen_domain_numa_layout pv_numa_layout;
>> +
>> +void dump_numa_layout(struct xen_domain_numa_layout *layout)
>> +{
>> +    unsigned int i, j;
>> +    char vcpumask[128];
>> +    printk("NUMA-LAYOUT(Dom0) : vcpus(%u), vnodes(%u)\n",
>> +                            layout->max_vcpus,
layout->max_vnodes);
>
> Redo the printk''s. They need KERN_DEBUG
>> +    for (i = 0; i < layout->max_vnodes; i++)
>> +    {
>> +        struct xen_vnode_data *vnode_data =
&layout->vnode_data[i];
>> +        cpumask_scnprintf(vcpumask, sizeof(vcpumask),
>> +                                ((cpumask_t
*)&vnode_data->vcpu_mask));
>> +        printk("vnode[%u]:mnode(%u), node_nr_pages(%lx),
vcpu_mask(%s)\n",
>> +                vnode_data->vnode_id, vnode_data->mnode_id,
>> +                (unsigned long)vnode_data->nr_pages, vcpumask);
>
> This one too.
>> +    }
>> +
>> +    printk("vnode distances :\n");
>
> and
>> +    for (i = 0; i < layout->max_vnodes; i++)
>> +        printk("\tvnode[%u]", i);
>> +    for (i = 0; i < layout->max_vnodes; i++)
>> +    {
>> +        printk("\nvnode[%u]", i);
>
> this
>> +        for (j = 0; j < layout->max_vnodes; j++)
>> +            printk("\t%u",
layout->vnode_distance[i*layout->max_vnodes + j]);
>> +        printk("\n");
> one to.
>> +    }
>> +    return;
>> +}
>> +
>> +static void __init xen_init_slit_table(struct xen_domain_numa_layout
*layout)
>> +{
>> +    /* Construct a slit table (using layout->vnode_distance).
>> +     * Copy it to acpi_slit. */
>> +    return;
>> +}
>> +/* Distribute the vcpus over the vnodes according to their affinity */
>> +static void __init xen_init_numa_array(struct xen_domain_numa_layout
*layout)
>> +{
>> +     int vcpu, vnode;
>> +
>> +     printk(KERN_INFO "xen_numa_init_array - cpu_to_node
initialization\n");
>
> pr_debug instead?
>> +
>> +    for (vnode = 0; vnode < layout->max_vnodes; vnode++)
>> +    {
>> +        struct xen_vnode_data *vnode_data =
&layout->vnode_data[vnode];
>> +        cpumask_t vcpu_mask = *((cpumask_t
*)&vnode_data->vcpu_mask);
>> +
>> +        for (vcpu = 0; vcpu < layout->max_vcpus; vcpu++)
>> +        {
>> +            if (cpu_isset(vcpu, vcpu_mask))
>> +            {
>> +                if (early_cpu_to_node(vcpu) != NUMA_NO_NODE)
>> +                {
>> +                    printk(KERN_INFO "EARLY vcpu[%d] on
vnode[%d]\n",
>> +                                        vcpu,
early_cpu_to_node(vcpu));
>> +                    continue;
>> +                }
>> +                printk(KERN_INFO "vcpu[%d] on vnode[%d]\n",
vcpu, vnode);
>> +                     numa_set_node(vcpu, vnode);
>> +            }
>> +        }
>> +    }
>> +    return;
>> +}
>> +
>> +static int __init xen_numa_emulation(struct xen_domain_numa_layout
*layout,
>> +                            unsigned long start_pfn, unsigned long
last_pfn)
>> +{
>> +     int num_vnodes, i;
>> +    u64 node_start_addr, node_end_addr, max_addr;
>> +
>> +    printk(KERN_INFO "xen_numa_emulation : max_vnodes(%d),
max_vcpus(%d)",
>> +                                        layout->max_vnodes,
layout->max_vcpus);
>> +    dump_numa_layout(layout);
>> +     memset(&nodes, 0, sizeof(nodes));
>> +
>> +    num_vnodes = layout->max_vnodes;
>> +    BUG_ON(num_vnodes > MAX_NUMNODES);
>
> Hmm.. Is that really neccessary? What if we just do WARN("some lengthy
explanation"),
> and bail out and not initialize these structures?
>> +
>> +    max_addr = last_pfn << PAGE_SHIFT;
>> +
>> +    node_start_addr = start_pfn << PAGE_SHIFT;
>> +    for (i = 0; i < num_vnodes; i++)
>> +    {
>> +        struct xen_vnode_data *vnode_data =
&layout->vnode_data[i];
>> +        u64 node_size = vnode_data->nr_pages << PAGE_SHIFT;
>> +
>> +             node_size &= FAKE_NODE_MIN_HASH_MASK; /* 64MB aligned
*/
>> +
>> +             if (i == (num_vnodes-1))
>> +                     node_end_addr = max_addr;
>> +             else
>> +        {
>> +            node_end_addr = node_start_addr + node_size;
>> +                     while ((node_end_addr - node_start_addr -
>> +                e820_hole_size(node_start_addr, node_end_addr)) <
node_size)
>> +            {
>> +                node_end_addr += FAKE_NODE_MIN_SIZE;
>> +                             if (node_end_addr > max_addr) {
>> +                                     node_end_addr = max_addr;
>> +                                     break;
>> +                             }
>> +                     }
>> +        }
>> +        /* node_start_addr updated inside the function */
>> +        if (setup_node_range(i, nodes, &node_start_addr,
>> +                    (node_end_addr-node_start_addr), max_addr+1))
>> +            BUG();
>> +    }
>> +
>> +     printk(KERN_INFO "XEN domain numa emulation - setup
nodes\n");
>
> Is this neccessary? Can be it be debug?
>> +
>> +    memnode_shift = compute_hash_shift(nodes, num_vnodes, NULL);
>> +    if (memnode_shift < 0) {
>> +         printk(KERN_ERR "No NUMA hash function found.\n");
>> +        BUG();
>
> Wow. BUG? What about just bailing out and unset xen_pv_emulation flag?
>
>> +    }
>> +    /* XXX: Shouldn''t be needed because we disabled acpi_numa
very early ! */
>> +     /*
>> +      * We need to vacate all active ranges that may have been
registered by
>> +      * SRAT and set acpi_numa to -1 so that srat_disabled() always
returns
>> +      * true.  NUMA emulation has succeeded so we will not scan ACPI
nodes.
>> +      */
>> +     remove_all_active_ranges();
>> +
>> +    BUG_ON(acpi_numa >= 0);
>> +     for_each_node_mask(i, node_possible_map) {
>> +             e820_register_active_regions(i, nodes[i].start >>
PAGE_SHIFT,
>> +                                             nodes[i].end >>
PAGE_SHIFT);
>> +             setup_node_bootmem(i, nodes[i].start, nodes[i].end);
>> +     }
>> +    xen_init_slit_table(layout);
>> +     xen_init_numa_array(layout);
>> +     return 0;
>> +}
>>  #endif /* CONFIG_NUMA_EMU */
>>
>>  void __init initmem_init(unsigned long start_pfn, unsigned long
last_pfn)
>>  {
>>       int i;
>> +    struct xen_domain_numa_layout *numa_layout = &pv_numa_layout;
>> +
>> +    int xen_pv_numa_enabled = numa_layout->max_vnodes;
>>
>>       nodes_clear(node_possible_map);
>>       nodes_clear(node_online_map);
>>
>>  #ifdef CONFIG_NUMA_EMU
>> +    if (xen_pv_domain() && xen_pv_numa_enabled)
>> +    {
>> +        if (!xen_numa_emulation(numa_layout, start_pfn, last_pfn))
>> +            return;
>> +    }
>> +
>>       if (cmdline && !numa_emulation(start_pfn, last_pfn))
>>               return;
>>       nodes_clear(node_possible_map);
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index ecb9b0d..b020555 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -83,6 +83,8 @@ void *xen_initial_gdt;
>>   */
>>  struct shared_info *HYPERVISOR_shared_info = (void
*)&xen_dummy_shared_info;
>>
>> +struct xen_domain_numa_layout *HYPERVISOR_domain_numa_layout;
>> +
>>  /*
>>   * Flag to determine whether vcpu info placement is available on all
>>   * VCPUs.  We assume it is to start with, and then set it to zero on
>> @@ -1089,6 +1091,7 @@ static void __init xen_setup_stackprotector(void)
>>       pv_cpu_ops.load_gdt = xen_load_gdt;
>>  }
>>
>> +extern struct xen_domain_numa_layout pv_numa_layout;
>>  /* First C function to be called on Xen boot */
>>  asmlinkage void __init xen_start_kernel(void)
>>  {
>> @@ -1230,6 +1233,12 @@ asmlinkage void __init xen_start_kernel(void)
>>               xen_start_info->console.domU.evtchn = 0;
>>       }
>>
>> +    {
>> +        struct xen_domain_numa_layout *layout >> +          
 (void *)((char *)xen_start_info +
>> +                        xen_start_info->numa_layout_info.info_off);
>> +        memcpy(&pv_numa_layout, layout, sizeof(*layout));
>> +    }
>
> Shouldn''t there be a check to see if the Xen actually exports
> this structure? Otherwise you might copy garbage in and set
> pv_numa_layout values to random values.
>
>>       xen_raw_console_write("about to get started...\n");
>>
>>       /* Start the world */
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 612f2c9..cb944a2 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -281,6 +281,9 @@ void __init xen_arch_setup(void)
>>               printk(KERN_INFO "ACPI in unprivileged domain
disabled\n");
>>               disable_acpi();
>>       }
>> +
>> +    acpi_numa = -1;
>> +    numa_off = 1;
>>  #endif
>>
>>       /*
>> diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
>> index 812ffd5..d4588fa 100644
>> --- a/include/xen/interface/xen.h
>> +++ b/include/xen/interface/xen.h
>> @@ -398,6 +398,53 @@ struct shared_info {
>>
>>  };
>>
>> +#define XEN_NR_CPUS 64
> No way to share this with some other variable? Like NR_CPUS?
>
>> +#if defined(__i386__)
>> +#define XEN_BITS_PER_LONG 32
>> +#define XEN_BYTES_PER_LONG 4
>
> There gotta be some of these defined in other headers.
>> +#define XEN_LONG_BYTEORDER 2
>> +#elif defined(__x86_64__)
>> +#define XEN_BITS_PER_LONG 64
>> +#define XEN_BYTES_PER_LONG 8
>> +#define XEN_LONG_BYTEORDER 3
>> +#endif
>> +
>> +/* same as cpumask_t - in xen and even Linux (for now) */
>> +#define XEN_BITS_TO_LONGS(bits) \
>> +    (((bits)+XEN_BITS_PER_LONG-1)/XEN_BITS_PER_LONG)
>> +#define XEN_DECLARE_BITMAP(name,bits) \
>> +    unsigned long name[XEN_BITS_TO_LONGS(bits)]
>
> I am pretty sure there are macros for this laready.
>> +struct xen_cpumask{ XEN_DECLARE_BITMAP(bits, XEN_NR_CPUS); };
>> +#ifndef __XEN__
>
> Is that neccessary? typedefs are frowned upon in kernel.
>> +typedef struct xen_cpumask xen_cpumask_t;
>> +#endif
>> +
>> +#define XEN_MAX_VNODES 8
>> +struct xen_vnode_data {
>> +    uint32_t vnode_id;
>> +    uint32_t mnode_id;
>> +    uint64_t nr_pages;
>> +    /* XXX: Can we use this in xen<->domain interfaces ? */
>> +    struct xen_cpumask vcpu_mask; /* vnode_to_vcpumask */
>> +};
>> +#ifndef __XEN__
>> +typedef struct xen_vnode_data xen_vnode_data_t;
>
> Ditto.
>> +#endif
>> +
>> +/* NUMA layout for the domain at the time of startup.
>> + * Structure has to fit within a page. */
>> +struct xen_domain_numa_layout {
>> +    uint32_t max_vcpus;
>> +    uint32_t max_vnodes;
>> +
>> +    /* Only (max_vnodes*max_vnodes) entries are filled */
>> +    uint32_t vnode_distance[XEN_MAX_VNODES * XEN_MAX_VNODES];
>> +    struct xen_vnode_data vnode_data[XEN_MAX_VNODES];
>> +};
>> +#ifndef __XEN__
>> +typedef struct xen_domain_numa_layout xen_domain_numa_layout_t;
>
> Ditto.
>> +#endif
>> +
>>  /*
>>   * Start-of-day memory layout for the initial domain (DOM0):
>>   *  1. The domain is started within contiguous virtual-memory region.
>> @@ -449,6 +496,13 @@ struct start_info {
>>       unsigned long mod_start;    /* VIRTUAL address of pre-loaded
module.  */
>>       unsigned long mod_len;      /* Size (bytes) of pre-loaded module.
    */
>>       int8_t cmd_line[MAX_GUEST_CMDLINE];
>> +    /* The pfn range here covers both page table and p->m table
frames.   */
>> +    unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M
table.    */
>
> Why is this added in this patch? That doesn''t look to be used by
your
> patch.
>> +    unsigned long nr_p2m_frames;/* # of pfns forming initial P->M
table.  */
>
> Ditto.
>> +    struct {
>> +        uint32_t info_off;  /* Offset of console_info struct.        
*
>> +        uint32_t info_size; /* Size of console_info struct from
start.*/
>
> Wrong comment.
>> +    } numa_layout_info;
>
>>  };
>>
>>  struct dom0_vga_console_info {
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Feb 2010 - [RFC] pv guest numa [RE: Host Numa informtion in dom0]

[Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

RE: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

Re: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]

Re: [Xen-devel] [RFC] pv guest numa [RE: Host Numa informtion in dom0]