Currently, when alloc_domheap_pages() is called with no range specified (which is the case for allocating domain memory during its creation), it uses such priority to allocate memory: 1) current node, > DMA range (2^dma_bitsize = 4G). 2) other nodes, > DMA range 3) current node, all range 4) other nodes, all range Let''s say we have a 2-node system, with node0 and node1''s memory range being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively. In that case, node1''s memory is always preferred for domain memory allocation, no matter which node the created domain is pinned to. It results in performance penalty. One possible fix is to specify all range for the domain memory allocation, which means local memory is preferred. This change may be restricted only to the domain pinned to one node for less impact. One side effect is that the DMA memory size may be smaller, which makes device domain unhappy. This can be addressed by reserving node0 to be used lastly. Comments? Thanks, Xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:> Let''s say we have a 2-node system, with node0 and node1''s memory range > being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively. > In that case, node1''s memory is always preferred for domain memory > allocation, no matter which node the created domain is pinned to. It > results in performance penalty. > > One possible fix is to specify all range for the domain memory > allocation, which means local memory is preferred. This change may be > restricted only to the domain pinned to one node for less impact. > > One side effect is that the DMA memory size may be smaller, which makes > device domain unhappy. This can be addressed by reserving node0 to be > used lastly.Doesn''t your solution amount to what we already do, for the 2-node example? i.e., node0 would not be chosen until node1 is exhausted? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote: > > > Let''s say we have a 2-node system, with node0 and node1''s memory range > > being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively. > > In that case, node1''s memory is always preferred for domain memory > > allocation, no matter which node the created domain is pinned to. It > > results in performance penalty. > > > > One possible fix is to specify all range for the domain memory > > allocation, which means local memory is preferred. This change may be > > restricted only to the domain pinned to one node for less impact. > > > > One side effect is that the DMA memory size may be smaller, which makes > > device domain unhappy. This can be addressed by reserving node0 to be > > used lastly. > > Doesn''t your solution amount to what we already do, for the 2-node example? > i.e., node0 would not be chosen until node1 is exhausted? >Oh, what I mean is: With the above possible fix, the domain memory is allocated from the node it pinned to. As node0''s memory is precious for DMA, it''s suggested to pin VMs to other nodes firstly. And for non-pinned VM, we can stick to the original method. Thanks, Xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:>> Doesn''t your solution amount to what we already do, for the 2-node example? >> i.e., node0 would not be chosen until node1 is exhausted? >> > Oh, what I mean is: > With the above possible fix, the domain memory is allocated from the > node it pinned to. As node0''s memory is precious for DMA, it''s suggested > to pin VMs to other nodes firstly. > > And for non-pinned VM, we can stick to the original method.How about by default we guarantee no more than 25% of a node''s memory is classed as ''DMA memory'', and we reduce the DMA address width variable in Xen to ensure that? So, in your example, we would reduce dma_bitsize to 30. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote: > > >> Doesn''t your solution amount to what we already do, for the 2-node > example? > >> i.e., node0 would not be chosen until node1 is exhausted? > >> > > Oh, what I mean is: > > With the above possible fix, the domain memory is allocated from the > > node it pinned to. As node0''s memory is precious for DMA, it''s suggested > > to pin VMs to other nodes firstly. > > > > And for non-pinned VM, we can stick to the original method. > > How about by default we guarantee no more than 25% of a node''s memory is > classed as ''DMA memory'', and we reduce the DMA address width variable in Xen > to ensure that? > > So, in your example, we would reduce dma_bitsize to 30. > > -- Keir > >Yes, a good suggestion! Thanks, Xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:>> How about by default we guarantee no more than 25% of a node''s memory is >> classed as ''DMA memory'', and we reduce the DMA address width variable in Xen >> to ensure that? >> >> So, in your example, we would reduce dma_bitsize to 30. >> >> -- Keir >> >> > Yes, a good suggestion!Indeed the only reason we still have dma_bitsize is to break the select-NUMA-node-first memory allocation search strategy. So tweaking the dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does seem the right thing to do. Feel free to work up a patch. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote: > > >> How about by default we guarantee no more than 25% of a node''s memory is > >> classed as ''DMA memory'', and we reduce the DMA address width > variable in Xen > >> to ensure that? > >> > >> So, in your example, we would reduce dma_bitsize to 30. > >> > >> -- Keir > >> > >> > > Yes, a good suggestion! > > Indeed the only reason we still have dma_bitsize is to break the > select-NUMA-node-first memory allocation search strategy. So tweaking the > dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does > seem the right thing to do. Feel free to work up a patch. > > -- Keir > > >How about this one? diff -r 63317b6c3eab xen/common/page_alloc.c --- a/xen/common/page_alloc.c Mon Jul 14 15:21:03 2008 +0100 +++ b/xen/common/page_alloc.c Fri Jul 25 18:24:16 2008 +0800 @@ -55,7 +55,7 @@ /* * Bit width of the DMA heap. */ -static unsigned int dma_bitsize = CONFIG_DMA_BITSIZE; +static unsigned int dma_bitsize; static void __init parse_dma_bits(char *s) { unsigned int v = simple_strtol(s, NULL, 0); @@ -583,6 +583,16 @@ init_heap_pages(pfn_dom_zone_type(i), mfn_to_page(i), 1); } + /* Reserve up to 25% of node0''s memory for DMA */ + if ( dma_bitsize == 0 ) + { + dma_bitsize = pfn_dom_zone_type(NODE_DATA(0)->node_spanned_pages / 4) + + PAGE_SHIFT; + + ASSERT(dma_bitsize <= BITS_PER_LONG + PAGE_SHIFT); + ASSERT(dma_bitsize > PAGE_SHIFT + 1); + } + printk("Domain heap initialised: DMA width %u bits\n", dma_bitsize); } #undef avail_for_domheap diff -r 63317b6c3eab xen/include/asm-x86/config.h --- a/xen/include/asm-x86/config.h Mon Jul 14 15:21:03 2008 +0100 +++ b/xen/include/asm-x86/config.h Fri Jul 25 18:24:16 2008 +0800 @@ -96,8 +96,6 @@ /* Primary stack is restricted to 8kB by guard pages. */ #define PRIMARY_STACK_SIZE 8192 - -#define CONFIG_DMA_BITSIZE 32 #define BOOT_TRAMPOLINE 0x8c000 #define bootsym_phys(sym) \ Thanks, Xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:>> Indeed the only reason we still have dma_bitsize is to break the >> select-NUMA-node-first memory allocation search strategy. So tweaking the >> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does >> seem the right thing to do. Feel free to work up a patch. >> >> -- Keir >> >> >> > How about this one?Hmmm.. something like that. Let''s wait until 3.4 development opens to get this checked in. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote: > >>> Indeed the only reason we still have dma_bitsize is to break the >>> select-NUMA-node-first memory allocation search strategy. So tweaking the >>> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does >>> seem the right thing to do. Feel free to work up a patch. >>> >>> -- Keir >>> >> How about this one? > > Hmmm.. something like that. Let''s wait until 3.4 development opens to get > this checked in.Mmh, why not check this in in 3.3? I have noticed this problem already a year ago and was having some other kind of fix for it (which actually prefered nodes over zones): http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html I think this is a somewhat serious issue on NUMA machines, since with the automatic pinning now active (new in 3.3!) many domains will end up with remote memory _all the time_. So I think of this as a bugfix. Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst for some months now... Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com> wrote:> Mmh, why not check this in in 3.3? I have noticed this problem already a > year ago and was having some other kind of fix for it (which actually > prefered nodes over zones): > http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html > I think this is a somewhat serious issue on NUMA machines, since with > the automatic pinning now active (new in 3.3!) many domains will end up > with remote memory _all the time_. So I think of this as a bugfix. > Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst for some > months now...Well, fine, but unfortunately the patch breaks ia64 and doesn''t even work properly: - why should NUMA node 0 be the one that overlaps with default DMA memory? - a ''large'' NUMA node 0 will cause dma_bitsize to be set much larger than it is currently, thus breaking its original intent. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com> wrote: > >> Mmh, why not check this in in 3.3? I have noticed this problem already a >> year ago and was having some other kind of fix for it (which actually >> prefered nodes over zones): >> http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html >> I think this is a somewhat serious issue on NUMA machines, since with >> the automatic pinning now active (new in 3.3!) many domains will end up >> with remote memory _all the time_. So I think of this as a bugfix. >> Actually I have dma_bitsize=30 hardwired in my Grub''s menu.lst for some >> months now... > > Well, fine, but unfortunately the patch breaks ia64Fixed.> and doesn''t even work properly: > - why should NUMA node 0 be the one that overlaps with default DMA memory?Because that is the most common configuration? Do you know of any machine where this is not true? I agree that a dual node machine with 2 gig on each node does not need this patch, but NUMA machines tend to have more memory than this (especially given the current memory costs). I changed the default DMA_BITSIZE to 30 bits, this seems to be a reasonable value.> - a ''large'' NUMA node 0 will cause dma_bitsize to be set much larger than > it is currently, thus breaking its original intent.Fixed in the attached patch. It now caps dma_bitsize to at most 1/4 of node0 memory. What about using this patch for Xen 3.3 and work out a more general solution for Xen 3.4? Signed off by: Andre Przywara <andre.przywara@amd.com> Based on the patch from: "Yang, Xiaowei" <xiaowei.yang@intel.com> -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG, Wilschdorfer Landstr. 101, 01109 Dresden, Germany Register Court Dresden: HRA 4896, General Partner authorized to represent: AMD Saxony LLC (Wilmington, Delaware, US) General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 28/7/08 15:26, "Andre Przywara" <andre.przywara@amd.com> wrote:> Because that is the most common configuration? Do you know of any > machine where this is not true? I agree that a dual node machine with 2 > gig on each node does not need this patch, but NUMA machines tend to > have more memory than this (especially given the current memory costs). > I changed the default DMA_BITSIZE to 30 bits, this seems to be a > reasonable value.I''ll take that bit then (the CONFIG_DMA_BITSIZE change). Sounds like it suffices for all systems you care about. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel