thr3ads.net - Xen devel - [Xen-devel] Memory allocation in NUMA system [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Yang, Xiaowei

2008-Jul-25 03:34 UTC

[Xen-devel] Memory allocation in NUMA system

Currently, when alloc_domheap_pages() is called with no range specified
(which is the case for allocating domain memory during its creation), it
uses such priority to allocate memory:
1) current node, > DMA range (2^dma_bitsize = 4G).
2) other nodes, > DMA range
3) current node, all range
4) other nodes, all range

Let''s say we have a 2-node system, with node0 and node1''s
memory range
being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively. 
In that case, node1''s memory is always preferred for domain memory 
allocation, no matter which node the created domain is pinned to. It 
results in performance penalty.

One possible fix is to specify all range for the domain memory 
allocation, which means local memory is preferred. This change may be 
restricted only to the domain pinned to one node for less impact.

One side effect is that the DMA memory size may be smaller, which makes 
device domain unhappy. This can be addressed by reserving node0 to be 
used lastly.

Comments?

Thanks,
Xiaowei


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-25 06:53 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
> Let''s say we have a 2-node system, with node0 and node1''s
memory range
> being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G)
respectively.
> In that case, node1''s memory is always preferred for domain memory
> allocation, no matter which node the created domain is pinned to. It
> results in performance penalty.
> 
> One possible fix is to specify all range for the domain memory
> allocation, which means local memory is preferred. This change may be
> restricted only to the domain pinned to one node for less impact.
> 
> One side effect is that the DMA memory size may be smaller, which makes
> device domain unhappy. This can be addressed by reserving node0 to be
> used lastly.
Doesn''t your solution amount to what we already do, for the 2-node
example?
i.e., node0 would not be chosen until node1 is exhausted?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yang, Xiaowei

2008-Jul-25 07:22 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

Keir Fraser wrote:> On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
> 
>  > Let''s say we have a 2-node system, with node0 and
node1''s memory range
>  > being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G)
respectively.
>  > In that case, node1''s memory is always preferred for domain
memory
>  > allocation, no matter which node the created domain is pinned to. It
>  > results in performance penalty.
>  >
>  > One possible fix is to specify all range for the domain memory
>  > allocation, which means local memory is preferred. This change may be
>  > restricted only to the domain pinned to one node for less impact.
>  >
>  > One side effect is that the DMA memory size may be smaller, which
makes
>  > device domain unhappy. This can be addressed by reserving node0 to be
>  > used lastly.
> 
> Doesn''t your solution amount to what we already do, for the 2-node
example?
> i.e., node0 would not be chosen until node1 is exhausted?
> Oh, what I mean is:
With the above possible fix, the domain memory is allocated from the 
node it pinned to. As node0''s memory is precious for DMA, it''s
suggested
to pin VMs to other nodes firstly.

And for non-pinned VM, we can stick to the original method.

Thanks,
Xiaowei

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-25 07:27 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
>> Doesn''t your solution amount to what we already do, for the
2-node example?
>> i.e., node0 would not be chosen until node1 is exhausted?
>> 
> Oh, what I mean is:
> With the above possible fix, the domain memory is allocated from the
> node it pinned to. As node0''s memory is precious for DMA,
it''s suggested
> to pin VMs to other nodes firstly.
> 
> And for non-pinned VM, we can stick to the original method.
How about by default we guarantee no more than 25% of a node''s memory
is
classed as ''DMA memory'', and we reduce the DMA address width
variable in Xen
to ensure that?

So, in your example, we would reduce dma_bitsize to 30.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yang, Xiaowei

2008-Jul-25 07:51 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

Keir Fraser wrote:> On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
> 
>  >> Doesn''t your solution amount to what we already do, for
the 2-node
> example?
>  >> i.e., node0 would not be chosen until node1 is exhausted?
>  >>
>  > Oh, what I mean is:
>  > With the above possible fix, the domain memory is allocated from the
>  > node it pinned to. As node0''s memory is precious for DMA,
it''s suggested
>  > to pin VMs to other nodes firstly.
>  >
>  > And for non-pinned VM, we can stick to the original method.
> 
> How about by default we guarantee no more than 25% of a node''s
memory is
> classed as ''DMA memory'', and we reduce the DMA address
width variable in Xen
> to ensure that?
> 
> So, in your example, we would reduce dma_bitsize to 30.
> 
>  -- Keir
> 
> Yes, a good suggestion!

Thanks,
Xiaowei

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-25 07:55 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
>> How about by default we guarantee no more than 25% of a node''s
memory is
>> classed as ''DMA memory'', and we reduce the DMA
address width variable in Xen
>> to ensure that?
>> 
>> So, in your example, we would reduce dma_bitsize to 30.
>> 
>>  -- Keir
>> 
>> 
> Yes, a good suggestion!
Indeed the only reason we still have dma_bitsize is to break the
select-NUMA-node-first memory allocation search strategy. So tweaking the
dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
seem the right thing to do. Feel free to work up a patch.

 -- Keir




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yang, Xiaowei

2008-Jul-25 10:26 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

Keir Fraser wrote:> On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
> 
>  >> How about by default we guarantee no more than 25% of a
node''s memory is
>  >> classed as ''DMA memory'', and we reduce the DMA
address width
> variable in Xen
>  >> to ensure that?
>  >>
>  >> So, in your example, we would reduce dma_bitsize to 30.
>  >>
>  >>  -- Keir
>  >>
>  >>
>  > Yes, a good suggestion!
> 
> Indeed the only reason we still have dma_bitsize is to break the
> select-NUMA-node-first memory allocation search strategy. So tweaking the
> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
> seem the right thing to do. Feel free to work up a patch.
> 
>  -- Keir
> 
> 
> How about this one?

diff -r 63317b6c3eab xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Mon Jul 14 15:21:03 2008 +0100
+++ b/xen/common/page_alloc.c	Fri Jul 25 18:24:16 2008 +0800
@@ -55,7 +55,7 @@
  /*
   * Bit width of the DMA heap.
   */
-static unsigned int dma_bitsize = CONFIG_DMA_BITSIZE;
+static unsigned int dma_bitsize;
  static void __init parse_dma_bits(char *s)
  {
      unsigned int v = simple_strtol(s, NULL, 0);
@@ -583,6 +583,16 @@
              init_heap_pages(pfn_dom_zone_type(i), mfn_to_page(i), 1);
      }

+    /* Reserve up to 25% of node0''s memory for DMA */
+    if ( dma_bitsize == 0 )
+    {
+        dma_bitsize = 
pfn_dom_zone_type(NODE_DATA(0)->node_spanned_pages / 4)
+                      + PAGE_SHIFT;
+
+        ASSERT(dma_bitsize <= BITS_PER_LONG + PAGE_SHIFT);
+        ASSERT(dma_bitsize > PAGE_SHIFT + 1);
+    }
+
      printk("Domain heap initialised: DMA width %u bits\n",
dma_bitsize);
  }
  #undef avail_for_domheap
diff -r 63317b6c3eab xen/include/asm-x86/config.h
--- a/xen/include/asm-x86/config.h	Mon Jul 14 15:21:03 2008 +0100
+++ b/xen/include/asm-x86/config.h	Fri Jul 25 18:24:16 2008 +0800
@@ -96,8 +96,6 @@

  /* Primary stack is restricted to 8kB by guard pages. */
  #define PRIMARY_STACK_SIZE 8192
-
-#define CONFIG_DMA_BITSIZE 32

  #define BOOT_TRAMPOLINE 0x8c000
  #define bootsym_phys(sym)                                 \


Thanks,
Xiaowei

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-25 12:56 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
>> Indeed the only reason we still have dma_bitsize is to break the
>> select-NUMA-node-first memory allocation search strategy. So tweaking
the
>> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance
does
>> seem the right thing to do. Feel free to work up a patch.
>> 
>>  -- Keir
>> 
>> 
>> 
> How about this one?
Hmmm.. something like that. Let''s wait until 3.4 development opens to
get
this checked in.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2008-Jul-28 12:21 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

Keir Fraser wrote:> On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com>
wrote:
> 
>>> Indeed the only reason we still have dma_bitsize is to break the
>>> select-NUMA-node-first memory allocation search strategy. So
tweaking the
>>> dma_bitsize approach further to strike the correct NUMA-vs-DMA
balance does
>>> seem the right thing to do. Feel free to work up a patch.
>>>
>>>  -- Keir
>>>
>> How about this one?
> 
> Hmmm.. something like that. Let''s wait until 3.4 development opens
to get
> this checked in.Mmh, why not check this in in 3.3? I have noticed this problem already a 
year ago and was having some other kind of fix for it (which actually 
prefered nodes over zones):
http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
I think this is a somewhat serious issue on NUMA machines, since with 
the automatic pinning now active (new in 3.3!) many domains will end up 
with remote memory _all the time_. So I think of this as a bugfix. 
Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst for some
months now...


Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-28 12:38 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com>
wrote:
> Mmh, why not check this in in 3.3? I have noticed this problem already a
> year ago and was having some other kind of fix for it (which actually
> prefered nodes over zones):
> http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
> I think this is a somewhat serious issue on NUMA machines, since with
> the automatic pinning now active (new in 3.3!) many domains will end up
> with remote memory _all the time_. So I think of this as a bugfix.
> Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst for
some
> months now...
Well, fine, but unfortunately the patch breaks ia64 and doesn''t even
work
properly:
 - why should NUMA node 0 be the one that overlaps with default DMA memory?
 - a ''large'' NUMA node 0 will cause dma_bitsize to be set much
larger than
it is currently, thus breaking its original intent.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andre Przywara

2008-Jul-28 14:26 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

Keir Fraser wrote:> On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com>
wrote:
> 
>> Mmh, why not check this in in 3.3? I have noticed this problem already
a
>> year ago and was having some other kind of fix for it (which actually
>> prefered nodes over zones):
>>
http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
>> I think this is a somewhat serious issue on NUMA machines, since with
>> the automatic pinning now active (new in 3.3!) many domains will end up
>> with remote memory _all the time_. So I think of this as a bugfix.
>> Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst
for some
>> months now...
> 
> Well, fine, but unfortunately the patch breaks ia64
Fixed.> and doesn''t even work properly:
>  - why should NUMA node 0 be the one that overlaps with default DMA memory?Because that is the most common configuration? Do you know of any 
machine where this is not true? I agree that a dual node machine with 2 
gig on each node does not need this patch, but NUMA machines tend to 
have more memory than this (especially given the current memory costs). 
I changed the default DMA_BITSIZE to 30 bits, this seems to be a 
reasonable value.>  - a ''large'' NUMA node 0 will cause dma_bitsize to be set
much larger than
> it is currently, thus breaking its original intent.Fixed in the attached patch. It now caps dma_bitsize to at most 1/4 of 
node0 memory.

What about using this patch for Xen 3.3 and work out a more general 
solution for Xen 3.4?

Signed off by: Andre Przywara <andre.przywara@amd.com>
Based on the patch from: "Yang, Xiaowei"
<xiaowei.yang@intel.com>

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-28 14:53 UTC

head link

Re: [Xen-devel] Memory allocation in NUMA system

On 28/7/08 15:26, "Andre Przywara" <andre.przywara@amd.com>
wrote:
> Because that is the most common configuration? Do you know of any
> machine where this is not true? I agree that a dual node machine with 2
> gig on each node does not need this patch, but NUMA machines tend to
> have more memory than this (especially given the current memory costs).
> I changed the default DMA_BITSIZE to 30 bits, this seems to be a
> reasonable value.
I''ll take that bit then (the CONFIG_DMA_BITSIZE change). Sounds like it
suffices for all systems you care about.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2008 - Memory allocation in NUMA system

[Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system

Re: [Xen-devel] Memory allocation in NUMA system