I''m in the process of putting together patches to eliminate undue overhead resulting from Xen''s current assumption on a non-sparse machine address map. After removing large holes (symmetric across all nodes) from the machine address map and after also mapping the resulting (much smaller) frame table sparsely (to account for smaller holes distinct for one or more nodes), I intended to also map the P2M table sparsely. However, there''s a tools side assumption (for domain save) on this table being contiguously populated. While this could be relatively easily fixed in tools, I''m wondering if there are other dependencies on this table (or its read-only equivalent readable by all guests) to be non-sparse. In particular, while I can easily see that Linux has always been careful to access the read-only copy of the table with exception fixup, I don''t know whether that would also apply to the other PV OSes (Solaris, NetBSD, ???). One alternative would be to dedicate a zero-filled 2M page for all these holes, but I have to admit that this doesn''t seem very attractive. Another option would be to use a made-up MFN to represent the holes, allowing Xen (once they get fed back into mmu_update) to recognize this and simply ignore the mapping request (after all the tools should not access these ranges anyway). This wouldn''t address potential issues of non-Linux guests with a sparse R/O M2P table though. Other thoughts? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/09/2009 11:04, "Jan Beulich" <JBeulich@novell.com> wrote:> I''m in the process of putting together patches to eliminate undue overhead > resulting from Xen''s current assumption on a non-sparse machine address > map. After removing large holes (symmetric across all nodes) from the > machine address map and after also mapping the resulting (much smaller) > frame table sparsely (to account for smaller holes distinct for one or more > nodes), I intended to also map the P2M table sparsely.The guest P2Ms? Why would you want to do that - so that you can add some hierarchy (and therefore sparseness) to the pseuo-phys address space too? Otherwise I don''t see why you would ever have a sparse P2M, and therefore why you would care about efficiently handling large holes in it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/09/2009 11:04, "Jan Beulich" <JBeulich@novell.com> wrote:> I''m in the process of putting together patches to eliminate undue overhead > resulting from Xen''s current assumption on a non-sparse machine address > map.It would have been good to announce your intention to work on this on the list, by the way. I nearly started on this a couple of weeks back, since sparse-memory x86 systems are in the pipeline now, and one of us would have been wasting our time. I''m happy to wait for your patches now, of course. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 16.09.09 12:39 >>> >On 16/09/2009 11:04, "Jan Beulich" <JBeulich@novell.com> wrote: > >> I''m in the process of putting together patches to eliminate undue overhead >> resulting from Xen''s current assumption on a non-sparse machine address >> map. After removing large holes (symmetric across all nodes) from the >> machine address map and after also mapping the resulting (much smaller) >> frame table sparsely (to account for smaller holes distinct for one or more >> nodes), I intended to also map the P2M table sparsely. > >The guest P2Ms? Why would you want to do that - so that you can add some >hierarchy (and therefore sparseness) to the pseuo-phys address space too? >Otherwise I don''t see why you would ever have a sparse P2M, and therefore >why you would care about efficiently handling large holes in it.Oh, typo (see subject) - I really mean the M2P table here. The P2M table has no reason to be sparse. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/09/2009 12:05, "Jan Beulich" <JBeulich@novell.com> wrote:>> The guest P2Ms? Why would you want to do that - so that you can add some >> hierarchy (and therefore sparseness) to the pseuo-phys address space too? >> Otherwise I don''t see why you would ever have a sparse P2M, and therefore >> why you would care about efficiently handling large holes in it. > > Oh, typo (see subject) - I really mean the M2P table here. The P2M table > has no reason to be sparse.Ah okay, that makes more sense! I think M2P should be mapped sparsely and we should expect guests to deal with it. We can revise that opinion if we find strong reason to do otherwise. It''s certainly what I was intending to do. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 16.09.09 14:27 >>> >On 16/09/2009 12:05, "Jan Beulich" <JBeulich@novell.com> wrote: > >>> The guest P2Ms? Why would you want to do that - so that you can add some >>> hierarchy (and therefore sparseness) to the pseuo-phys address space too? >>> Otherwise I don''t see why you would ever have a sparse P2M, and therefore >>> why you would care about efficiently handling large holes in it. >> >> Oh, typo (see subject) - I really mean the M2P table here. The P2M table >> has no reason to be sparse. > >Ah okay, that makes more sense! I think M2P should be mapped sparsely and we >should expect guests to deal with it. We can revise that opinion if we find >strong reason to do otherwise. It''s certainly what I was intending to do.So for domain save, would you think that passing out zeroes for the holes in the output of XENMEM_machphys_mfn_list is a reasonable thing to do, with the expectation that the tools would split up their mapping request when encountering zeroes (single mmap(), but multiple subsequent calls through privcmd)? Or shouldn''t the tools perhaps not even do this with a single mmap() anymore, as the table can be pretty much unbounded now (I''m limiting it to 256G in my patches, but that doesn''t need to be the final limit). Also, will it be okay to leave the tools side stuff to be done by someone more familiar with them than I am? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/09/2009 13:49, "Jan Beulich" <JBeulich@novell.com> wrote:>> Ah okay, that makes more sense! I think M2P should be mapped sparsely and we >> should expect guests to deal with it. We can revise that opinion if we find >> strong reason to do otherwise. It''s certainly what I was intending to do. > > So for domain save, would you think that passing out zeroes for the holes > in the output of XENMEM_machphys_mfn_list is a reasonable thing to do, > with the expectation that the tools would split up their mapping request > when encountering zeroes (single mmap(), but multiple subsequent calls > through privcmd)? > > Or shouldn''t the tools perhaps not even do this with a single mmap() > anymore, as the table can be pretty much unbounded now (I''m limiting > it to 256G in my patches, but that doesn''t need to be the final limit). > > Also, will it be okay to leave the tools side stuff to be done by someone > more familiar with them than I am?Ah, I''d forgotten we guarantee that the M2P is made up of aligned 2MB extents. That given I don''t think we should bother having mapping holes after all -- the cost of the extra page-directory entries is only ~4kB per 1GB of M2P. What I would instead do is just alias the first 2MB extent of the M2P into every ''empty'' 2MB extent of the M2P (this is just a handy ''safe'' 2MB piece of memory to map in, so that (a) tools requests to map the M2P just continue to work; and (b) so that we have no doubts about guests possibly not handling faults on accesses into the M2P). Aliasing the first M2P extent like that just avoids us burning another 2MB for no good reason, to map regions of the M2P which really only contain ''garbage''. Does that sound good? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 16.09.09 15:40 >>> >Ah, I''d forgotten we guarantee that the M2P is made up of aligned 2MB >extents. That given I don''t think we should bother having mapping holes >after all -- the cost of the extra page-directory entries is only ~4kB per >1GB of M2P. What I would instead do is just alias the first 2MB extent of >the M2P into every ''empty'' 2MB extent of the M2P (this is just a handy >''safe'' 2MB piece of memory to map in, so that (a) tools requests to map the >M2P just continue to work; and (b) so that we have no doubts about guests >possibly not handling faults on accesses into the M2P). Aliasing the first >M2P extent like that just avoids us burning another 2MB for no good reason, >to map regions of the M2P which really only contain ''garbage''. > >Does that sound good?Indeed, it doesn''t really matter *what* gets mapped there. I wouldn''t, however, strictly stick to the first chunk - if there''s a 1G chunk, and later a 1G hole, I''d prefer using the existing 1G chunk there, and iteratively populate it with 2M chunks only when no 1G chunk was previously allocated. Perhaps (depending on what turns out simpler) I might also just use the most recent chunk for sticking in a hole... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/09/2009 15:04, "Jan Beulich" <JBeulich@novell.com> wrote:>> Does that sound good? > > Indeed, it doesn''t really matter *what* gets mapped there. I wouldn''t, > however, strictly stick to the first chunk - if there''s a 1G chunk, and later > a 1G hole, I''d prefer using the existing 1G chunk there, and iteratively > populate it with 2M chunks only when no 1G chunk was previously > allocated. Perhaps (depending on what turns out simpler) I might also > just use the most recent chunk for sticking in a hole...Sounds fine to me. K. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xen-devel-bounces@lists.xensource.com wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 16.09.09 14:27 >>> >> On 16/09/2009 12:05, "Jan Beulich" <JBeulich@novell.com> wrote: >> >>>> The guest P2Ms? Why would you want to do that - so that you can >>>> add some hierarchy (and therefore sparseness) to the pseuo-phys >>>> address space too? Otherwise I don''t see why you would ever have a >>>> sparse P2M, and therefore why you would care about efficiently >>>> handling large holes in it. >>> >>> Oh, typo (see subject) - I really mean the M2P table here. The P2M >>> table has no reason to be sparse. >> >> Ah okay, that makes more sense! I think M2P should be mapped >> sparsely and we should expect guests to deal with it. We can revise >> that opinion if we find strong reason to do otherwise. It''s >> certainly what I was intending to do. > > So for domain save, would you think that passing out zeroes > for the holes > in the output of XENMEM_machphys_mfn_list is a reasonable thing to do, > with the expectation that the tools would split up their > mapping request > when encountering zeroes (single mmap(), but multiple subsequent > calls through privcmd)? > > Or shouldn''t the tools perhaps not even do this with a single mmap() > anymore, as the table can be pretty much unbounded now (I''m limiting > it to 256G in my patches, but that doesn''t need to be the final > limit).I''m working on similar staff also, mainly because of memory hotplug requires such support. Maybe I can base my work on your work :-) 256G is not enough if memory hotplug is enabled. In some platform, user can set the start address for hotpluged memory and the default value is 1024G. That means 4M memory will be used for the L3 table entries (1024 * 4K). I''m not sure Keir''s "sparse-memory x86 systems are in the pipeline" is about memory hotplug also? Will there be sparse memory when the system boot up? Currently I didn''t change the m2p setup code when paging init, instead, I just add new sparse m2p table when hot-add happens. --jyh> > Also, will it be okay to leave the tools side stuff to be done > by someone > more familiar with them than I am? > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/09/2009 06:42, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> I''m working on similar staff also, mainly because of memory hotplug requires > such support. Maybe I can base my work on your work :-) > > 256G is not enough if memory hotplug is enabled. In some platform, user can > set the start address for hotpluged memory and the default value is 1024G. > That means 4M memory will be used for the L3 table entries (1024 * 4K).Jan doesn''t mean that the M2P only addresses up to 256G; he means that the M2P itself can be up to 256G in size! I.e., it will be able to address up to 128TB :-) The only structure in Xen that I think doesn''t just work with expanding its virtual memory allocation and sparse-mapping is the ''1:1 memory mapping''. Because to address such large sparse memory maps, the virtual memory allocation would be too large. So I''m guessing the ''1:1 memory map'' area will end up divided into say 1TB strips with phys_to_virt() executing a radix-tree lookup to map physical address ranges onto those dynamically-allocated strips. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 17/09/2009 06:42, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >> I''m working on similar staff also, mainly because of memory hotplug >> requires such support. Maybe I can base my work on your work :-) >> >> 256G is not enough if memory hotplug is enabled. In some platform, >> user can set the start address for hotpluged memory and the default >> value is 1024G. That means 4M memory will be used for the L3 table >> entries (1024 * 4K). > > Jan doesn''t mean that the M2P only addresses up to 256G; he > means that the > M2P itself can be up to 256G in size! I.e., it will be able to > address up to > 128TB :-) > > The only structure in Xen that I think doesn''t just work with > expanding its virtual memory allocation and sparse-mapping is the ''1:1 > memory mapping''. > Because to address such large sparse memory maps, the virtual memory > allocation would be too large. So I''m guessing the ''1:1 memory > map'' area > will end up divided into say 1TB strips with phys_to_virt() executing > a radix-tree lookup to map physical address ranges onto those > dynamically-allocated strips.Yes, and the strip size maybe dynamic determined. Also if we will always keep the 1:1 memory mapping for all memory? Currently we have at most 5TB virtual address for the 1:1 mapping, hopefully it will work for most system now. Some other need changed for hot-add and sparse memory. For example the phys_to_nid/memnodemap etc. --jyh> > -- Keir_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/09/2009 08:28, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:>> The only structure in Xen that I think doesn''t just work with >> expanding its virtual memory allocation and sparse-mapping is the ''1:1 >> memory mapping''. >> Because to address such large sparse memory maps, the virtual memory >> allocation would be too large. So I''m guessing the ''1:1 memory >> map'' area >> will end up divided into say 1TB strips with phys_to_virt() executing >> a radix-tree lookup to map physical address ranges onto those >> dynamically-allocated strips. > > Yes, and the strip size maybe dynamic determined. > > Also if we will always keep the 1:1 memory mapping for all memory? Currently > we have at most 5TB virtual address for the 1:1 mapping, hopefully it will > work for most system now.Yeah, I think 5TB will just about do us for now. ;-) After all, I think the biggest machine Xen has been tested on so far is 256GB. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 17.09.09 08:19 >>> >The only structure in Xen that I think doesn''t just work with expanding its >virtual memory allocation and sparse-mapping is the ''1:1 memory mapping''.The frame_table really also needs compression - a 256G M2P would imply 2T of frame table, which isn''t reasonable. I''m therefore using the same indexing for 1:1 mapping and frame table.>Because to address such large sparse memory maps, the virtual memory >allocation would be too large. So I''m guessing the ''1:1 memory map'' area >will end up divided into say 1TB strips with phys_to_virt() executing a >radix-tree lookup to map physical address ranges onto those >dynamically-allocated strips.Actually, I considered anything but a simple address transformation as too expensive (for a first cut at least), and I''m thus not using any sort of lookup, but rather determine bits below the most significant one that aren''t used in any (valid) mfn. Thus the transformation is two and-s, a shift, and an or. A more involved translation (including some sort of lookup) can imo be used as replacement if this simple mechanism turns out insufficient. Btw., I also decided against filling the holes in the M2P table mapping - for debuggability purposes, I definitely want to keep the holes in the writeable copy of the table (so that invalid accesses crash rather than causing data corruption). Instead, I now fill the holes only in the XENMEM_machphys_mfn_list handler (and I''m intentionally using the most recently stored mfn in favor of the first one to reduce the chance of reference count overflows when these get passed back to mmu_update - if the holes turn out still too large, this might need further tuning, but otoh in such setups [as said in an earlier reply] I think the tools should avoid mapping the whole M2P in a single chunk, and hence immediately recurring mfn-s can serve as a good indication to the tools that this ought to happen). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 17.09.09 10:02 >>> >Yeah, I think 5TB will just about do us for now. ;-) After all, I think the >biggest machine Xen has been tested on so far is 256GB.1Tb has been tested by IBM for us (SLE11), having required quite a few adjustments. And yes, part of the patch set is one to bump the limit to 5Tb. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 17.09.09 08:19 >>> >> The only structure in Xen that I think doesn''t just work with >> expanding its virtual memory allocation and sparse-mapping is the >> ''1:1 memory mapping''. > > The frame_table really also needs compression - a 256G M2P would imply > 2T of frame table, which isn''t reasonable. I''m therefore using the > same indexing for 1:1 mapping and frame table.Hmm, originally I thought it is ok to have hole in the frame_table, but yes, seems we can''t have hole in it. So either we squeze out the hole, or we need waste a lot of memory.> >> Because to address such large sparse memory maps, the virtual memory >> allocation would be too large. So I''m guessing the ''1:1 memory map'' >> area will end up divided into say 1TB strips with phys_to_virt() >> executing a radix-tree lookup to map physical address ranges onto >> those dynamically-allocated strips. > > Actually, I considered anything but a simple address transformation as > too expensive (for a first cut at least), and I''m thus not > using any sort > of lookup, but rather determine bits below the most > significant one that > aren''t used in any (valid) mfn. Thus the transformation is two and-s, > a shift, and an or.Can you elaborate it a bit? For example, considering system with following memory layout: 1G ~ 3G, 1024G ~ 1028G, 1056G~1060G, I did''t catch you algrithom :$> > A more involved translation (including some sort of lookup) can imo be > used as replacement if this simple mechanism turns out insufficient. > > Btw., I also decided against filling the holes in the M2P > table mapping - > for debuggability purposes, I definitely want to keep the holes in the > writeable copy of the table (so that invalid accesses crash rather > than causing data corruption). Instead, I now fill the holes only in > the XENMEM_machphys_mfn_list handler (and I''m intentionally using the > most recently stored mfn in favor of the first one to reduce the > chance of reference count overflows when these get passed back > to mmu_update - if the holes turn out still too large, this might need > further tuning, but otoh in such setups [as said in an earlier reply] > I think the tools should avoid mapping the whole M2P in a single > chunk, and hence immediately recurring mfn-s can serve as a good > indication to the tools that this ought to happen). > > Jan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Jiang, Yunhong" <yunhong.jiang@intel.com> 17.09.09 11:05 >>> >Can you elaborate it a bit? For example, considering system with following >memory layout: 1G ~ 3G, 1024G ~ 1028G, 1056G~1060G, I did''t catch >you algrithom :$That would be (assuming it really starts a 0) 0000000000000000-00000000bfffffff 0000010000000000-00000100ffffffff 0000010800000000-00000108ffffffff right? The common non-top zero bits are 36-39, which would reduce the virtual address space needed for the 1:1 mapping and frame table approximately by a factor of 16 (with the remaining gaps dealt with by leaving holes in these tables'' mappings). Actually, this tells me that I shouldn''t simply use the first range of non- top zero bits, but the largest one (currently, I would use bits 32-34). But, to be clear, for the purposes of memory hotplug, the SRAT is what the parameters get determined from, not the E820 (since these parameters, other than the upper boundaries, must not change post- boot). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>>>> "Jiang, Yunhong" <yunhong.jiang@intel.com> 17.09.09 11:05 >>> >> Can you elaborate it a bit? For example, considering system with >> following memory layout: 1G ~ 3G, 1024G ~ 1028G, 1056G~1060G, I >> did''t catch you algrithom :$ > > That would be (assuming it really starts a 0) > > 0000000000000000-00000000bfffffff > 0000010000000000-00000100ffffffff > 0000010800000000-00000108ffffffff > > right? The common non-top zero bits are 36-39, which would reduce the > virtual address space needed for the 1:1 mapping and frame table > approximately by a factor of 16 (with the remaining gaps dealt with by > leaving holes in these tables'' mappings). > > Actually, this tells me that I shouldn''t simply use the first > range of non- > top zero bits, but the largest one (currently, I would use bits > 32-34). > > But, to be clear, for the purposes of memory hotplug, the SRAT is what > the parameters get determined from, not the E820 (since these > parameters, other than the upper boundaries, must not change post- > boot).Hmm, this method is difficult to hotplug. I''m not sure if all new memory will be reported in SRAT. Also, did you change the mfn_valid()? Otherwise, the hole in frametable and m2p table will cause corruption to hypervisor. Currently I''m using an array to keep track of memory population. Each entry will guard like 32G memory. If the entry is empty, that mean the mfn is invalid. I think we can simply keep the entry value to be the virtual address. But I suspect that''s not efficent enough. --jyh> > Jan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel