Context: x86_64, 6gig ram.
(XEN) Physical RAM map:
(XEN) 0000000000000000 - 000000000009dc00 (usable)
(XEN) 000000000009dc00 - 00000000000a0000 (reserved)
(XEN) 00000000000d0000 - 0000000000100000 (reserved)
(XEN) 0000000000100000 - 00000000dff60000 (usable)
(XEN) 00000000dff60000 - 00000000dff72000 (ACPI data)
(XEN) 00000000dff72000 - 00000000dff80000 (ACPI NVS)
(XEN) 00000000dff80000 - 00000000e0000000 (reserved)
(XEN) 00000000fec00000 - 00000000fec00400 (reserved)
(XEN) 00000000fee00000 - 00000000fee01000 (reserved)
(XEN) 00000000fff80000 - 0000000100000000 (reserved)
(XEN) 0000000100000000 - 0000000180000000 (usable) <---- above hole
Notice there is now memory above the pci hole.
get_page_from_l1e() has the following code:
/* No reference counting for out-of-range I/O pages. */
if ( !pfn_valid(mfn) )
return 1;
where:
#define pfn_valid(_pfn) ((_pfn) < max_page)
Since max_page is now above the out-of-range io, the pfn_valid()
returns "valid". And hence get_page() is called, but returns an
error given that the page count is zero ("not allocated") which
ultimately ends up that the ioremap() for several device drivers
fails with ENOMEM.
While attached patch fixes this problem (from empirical evidence),
there may be a better solution.
sRp
--
Scott Parish
Signed-off-by: srparish@us.ibm.com
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
> #define pfn_valid(_pfn) ((_pfn) < max_page) > > Since max_page is now above the out-of-range io, the > pfn_valid() returns "valid". And hence get_page() is called, > but returns an error given that the page count is zero ("not > allocated") which ultimately ends up that the ioremap() for > several device drivers fails with ENOMEM. > > While attached patch fixes this problem (from empirical > evidence), there may be a better solution.I think the best fix is to have the frame_table cover the whole of physical ram, and then mark non-ram pages in the frame_table. To save some memory, we could map the frame_table in virtual address space, then use __get_user when reading from it (a fault indicates a non-ram page too) Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com >[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Ian Pratt >Sent: Thursday, May 19, 2005 6:18 AM > >> #define pfn_valid(_pfn) ((_pfn) < max_page) >> >> Since max_page is now above the out-of-range io, the >> pfn_valid() returns "valid". And hence get_page() is called, >> but returns an error given that the page count is zero ("not >> allocated") which ultimately ends up that the ioremap() for >> several device drivers fails with ENOMEM. >> >> While attached patch fixes this problem (from empirical >> evidence), there may be a better solution. > >I think the best fix is to have the frame_table cover the whole of >physical ram, and then mark non-ram pages in the frame_table. > >To save some memory, we could map the frame_table in virtual address >space, then use __get_user when reading from it (a fault indicates a >non-ram page too) > >Best, >Ian >Agree with the approach for virtual frame_table to support big hole, just like virtual memmap in Linux working currently on large machine. However this one actually throws a brick about how to support large hole of DomU. Frame table itself only describes machine hierarchy, and same problem also exists for PMT (or 1:1 page table) which describes the hierarch in guest physical layer to DomU. With more deployments of xen and with more hardware resources, it''s reasonable to configure DomU with GB memory which also covers big I/O hole in guest physical layer. Say providing 4G to DomU, the PMT table has to cover whole 5G space if I/O hole is 1G. This problem is especially practical on platform with big address space, like x86/64 and IPF. We realized this problem in progress of design and implementation for supporting DomU on IPF/VTI. Actually current work model in control panel and device model is somehow not matching above requirement. Usually the sequence in CP or DM is to call xc_get_pfn_list and then extract guest pfn -> mfn mapping info in a plain array, and later use xc_map_foreign to map domU''s space into Dom0''s virtual space based upon that plain array. The major problem is: - Currently xc_get_pfn_list is implemented in HV by walk domain->page_list, however page_list is only collection of machine page_frames allocated to this domain which doesn''t reflect accurate guest pfn -> machine pfn relationship - DM or CP shouldn''t assume a plain continuous memory mapping To solve this issue, we''re considering following approaches: 1. Add a new DOM0_GETMEMSECTIONS hypercall to get holes first, and then use xc_get_pfn_list to get pure memory mapping. Finally these info can be passed into xc_map_foreign to construct mapping in Dom0''s virtual space. 2. Add a new DOM0_GETPMTTABLE, which returns a plain array including those holes. Then without change to xc_map_foreign which simply also includes holes upon that array. Comments? Rgs, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Agree with the approach for virtual frame_table to support > big hole, just like virtual memmap in Linux working currently > on large machine. > However this one actually throws a brick about how to support > large hole of DomU.Guests can have arbitrary sparse machine memory maps already on x86. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 2:51 PM >To: Tian, Kevin; Scott Parish; xen-devel@lists.xensource.com > > > >> Agree with the approach for virtual frame_table to support >> big hole, just like virtual memmap in Linux working currently >> on large machine. >> However this one actually throws a brick about how to support >> large hole of DomU. > >Guests can have arbitrary sparse machine memory maps already on x86. > >IanSo how does that sparse style get implemented? Could you say more or show a link to the place in source tree? :) Take following sequence in xc_linux_build.c as example: 1. Setup_guest() call xc_get_pfn_list(xc_handle, dom, page_array, nr_pages), where page_array is acquired by walking domain->page_list in HV. So page_array is actually the mapping of [index in page_list, machine pfn], not [guest pfn, machine pfn]. 2. loadelfimage() will utilize that page_array to load kernel of domU, like: pa = (phdr->p_paddr + done) - dsi->v_start; va = xc_map_foreign_range( xch, dom, PAGE_SIZE, PROT_WRITE, parray[pa>>PAGE_SHIFT]); Here parray[pa>>PAGE_SHIFT] is used, which tempt to consider index of page_array as guest pfn, however it''s not from explanation in 1st point. Yes, it should work in above example, since usually kernel is loaded at lower address which is far from I/O hole. So in lower range actually "index in page_list" == "guest pfn". However this is not correct model in generic concept. Especially for device model, which needs to map whole machine pages of domU, also follows the wrong model as xc_get_pfn_list + xc_map_foreign. Maybe the sparse memory maps has already been managed inside HV as you said, but we also need to waterfall same sparse info to CP and DM especially for GB memory. That''s why we''re considering adding new hypercall. Correct me if I misunderstand something there. :) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> So how does that sparse style get implemented? Could you say > more or show a link to the place in source tree? :)On x86, for fully virtualized guests the pfn->mfn table is virtually mapped and hence you can have holes in the ''physical'' memory and arbitrary page granularity mappings to machine memory. See phys_to_machine_mapping(). For paravirtualized guests we provide a model wherebe ''physical'' memory starts at 0 and is contiguous, but maps to arbitrary machine pages. Since for paravirtualized guests you can hack the kernel, I don''t see any need to support anything else. [Note that IO address do not have pages in this map, whereas they do in the fully virtualized case] Ian> Take following sequence in xc_linux_build.c as example: > 1. Setup_guest() call xc_get_pfn_list(xc_handle, dom, > page_array, nr_pages), where page_array is acquired by > walking domain->page_list in HV. So page_array is actually > the mapping of [index in page_list, machine pfn], not [guest > pfn, machine pfn]. > > 2. loadelfimage() will utilize that page_array to load kernel of domU, > like: > pa = (phdr->p_paddr + done) - dsi->v_start; va = xc_map_foreign_range( > xch, dom, PAGE_SIZE, PROT_WRITE, > parray[pa>>PAGE_SHIFT]); Here parray[pa>>PAGE_SHIFT] is used, > which tempt to consider index of page_array as guest pfn, > however it''s not from explanation in 1st point. > > Yes, it should work in above example, since usually kernel is > loaded at lower address which is far from I/O hole. So in > lower range actually "index in page_list" == "guest pfn". > However this is not correct model in generic concept. > Especially for device model, which needs to map whole machine > pages of domU, also follows the wrong model as > xc_get_pfn_list + xc_map_foreign. > > Maybe the sparse memory maps has already been managed inside > HV as you said, but we also need to waterfall same sparse > info to CP and DM especially for GB memory. That''s why we''re > considering adding new hypercall. > > Correct me if I misunderstand something there. :) > > Thanks, > Kevin > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 3:44 PMThanks for nice explanation. As a background, let''s not refine discussion only to x86, since other arch like x86-64/ia64 will have more memories which may span traditional I/O hole, like MMIO range, etc.> > >> So how does that sparse style get implemented? Could you say >> more or show a link to the place in source tree? :) > >On x86, for fully virtualized guests the pfn->mfn table is virtually >mapped and hence you can have holes in the ''physical'' memory and >arbitrary page granularity mappings to machine memory. See >phys_to_machine_mapping().I can see that 1:1 mapping table mapped by one pgd entry on current x86. But, as I described in tail of this mail, why isn''t such information about holes getting used by CP and DM? Why doesn''t CP and DM utilize phys_to_machine_mapping(), but xc_get_pfn_list? Ideally the implication of xc_get_pfn_list is only to get all machine frames allocated to that domain, not the guest pfn -> machine pfn mapping info, which is not the anchor for dom0 to manipulate domN''s memory...> >For paravirtualized guests we provide a model wherebe ''physical'' memory >starts at 0 and is contiguous, but maps to arbitrary machine pages. >Since for paravirtualized guests you can hack the kernel, I don''t see >any need to support anything else. [Note that IO address do not have >pages in this map, whereas they do in the fully virtualized case]Sorry that I need some time to understand trick here. Do you mean the ''physical'' memory will always be continuous for any memory size, like 4G, 16G, nG...? Does that mean there''s other way to arrange the MMIO address, PIB address, etc. dynamically based on memory size? Or all the I/O will be dummy operation... But dom0 has to access physical memory... sorry I''m getting messed here, and appreciate your input. :) Thanks, Kevin> >Ian > >> Take following sequence in xc_linux_build.c as example: >> 1. Setup_guest() call xc_get_pfn_list(xc_handle, dom, >> page_array, nr_pages), where page_array is acquired by >> walking domain->page_list in HV. So page_array is actually >> the mapping of [index in page_list, machine pfn], not [guest >> pfn, machine pfn]. >> >> 2. loadelfimage() will utilize that page_array to load kernel ofdomU,>> like: >> pa = (phdr->p_paddr + done) - dsi->v_start; va xc_map_foreign_range( >> xch, dom, PAGE_SIZE, PROT_WRITE, >> parray[pa>>PAGE_SHIFT]); Here parray[pa>>PAGE_SHIFT] is used, >> which tempt to consider index of page_array as guest pfn, >> however it''s not from explanation in 1st point. >> >> Yes, it should work in above example, since usually kernel is >> loaded at lower address which is far from I/O hole. So in >> lower range actually "index in page_list" == "guest pfn". >> However this is not correct model in generic concept. >> Especially for device model, which needs to map whole machine >> pages of domU, also follows the wrong model as >> xc_get_pfn_list + xc_map_foreign. >> >> Maybe the sparse memory maps has already been managed inside >> HV as you said, but we also need to waterfall same sparse >> info to CP and DM especially for GB memory. That''s why we''re >> considering adding new hypercall. >> >> Correct me if I misunderstand something there. :) >> >> Thanks, >> Kevin >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com >[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Tian, Kevin >Sent: Thursday, May 19, 2005 4:06 PM >> >>For paravirtualized guests we provide a model wherebe ''physical''memory>>starts at 0 and is contiguous, but maps to arbitrary machine pages. >>Since for paravirtualized guests you can hack the kernel, I don''t see >>any need to support anything else. [Note that IO address do not have >>pages in this map, whereas they do in the fully virtualized case] > >Sorry that I need some time to understand trick here. Do you mean the >''physical'' memory will always be continuous for any memory size, like >4G, 16G, nG...? Does that mean there''s other way to arrange the MMIO >address, PIB address, etc. dynamically based on memory size? Or all the >I/O will be dummy operation... But dom0 has to access physicalmemory...>sorry I''m getting messed here, and appreciate your input. :) > >Thanks, >KevinHi, Ian, For this part, I made a mistake to confuse domN and dom0. OK, for paravirtualized guest, there''s actually no I/O range for domN, since the front driver in domN will do all things to communicate with backend in Dom0. But what about a driver domain which has access to physical device, thus need real I/O address? Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 18 May 2005, at 23:18, Ian Pratt wrote:> I think the best fix is to have the frame_table cover the whole of > physical ram, and then mark non-ram pages in the frame_table. > > To save some memory, we could map the frame_table in virtual address > space, then use __get_user when reading from it (a fault indicates a > non-ram page too)We already do this, and I initilialise every non-RAM e820 entry as an I/O area in the frame table. Actually, now I think about it, I/O mappings probably don''t tend to appear in the e820... best fix is to initialise the frame table to all I/O, then override the RAM sections and the Xen section (to protect it). I''ll sort out a fix. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> For this part, I made a mistake to confuse domN and dom0. OK, > for paravirtualized guest, there''s actually no I/O range for > domN, since the front driver in domN will do all things to > communicate with backend in Dom0. But what about a driver > domain which has access to physical device, thus need real > I/O address?This works the same in dom0 and other domains: IO machine addresses must be mapped into the kernel virtual address space before you can use them. They are totally orthogonal to ram addresses, and don''t get mfn->pfn translated. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19 May 2005, at 09:14, Tian, Kevin wrote:> For this part, I made a mistake to confuse domN and dom0. OK, for > paravirtualized guest, there''s actually no I/O range for domN, since > the > front driver in domN will do all things to communicate with backend in > Dom0. But what about a driver domain which has access to physical > device, thus need real I/O address?We rely on the driver using the dma_coherent/pci_consistent/bus_address macros for mapping device memory. Originally designed to handle IOMMUs, it is handy for us to determine places where code is handling real machine physical addresses rather than pseudophysical addresses. This gets groans of distaste from the kernel maintainers but has worked enormously well so far (AGP needed patching separately though). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 4:20 PM > > > For this part, I made a mistake to confuse domN and dom0. OK, >> for paravirtualized guest, there''s actually no I/O range for >> domN, since the front driver in domN will do all things to >> communicate with backend in Dom0. But what about a driver >> domain which has access to physical device, thus need real >> I/O address? > >This works the same in dom0 and other domains: >IO machine addresses must be mapped into the kernel virtual address >space before you can use them. They are totally orthogonal to ram >addresses, and don''t get mfn->pfn translated. > >IanThanks and that''s make it clearer now. So just for last confirmation (sorry for tedious): 1. If driver domN''s ''physical'' memory is set as 0 - 4G continuously, and 2. When dom0 does PCI bus init, machine mmio space is set between [3G, 3G+512M] (Take a large range for example), Under above 2 conditions, current paravirtualized implementation can clearly handle between: 1. A normal access to ''physical'' 3G + 4k address, and 2. Access to machine mmio address 3G + 4k of some physical device Is that assumption right? BTW, will that make some complexities for non-access operation, like comparison upon some address? Thanks a lot, - Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Thanks and that''s make it clearer now. So just for last > confirmation (sorry for tedious): > 1. If driver domN''s ''physical'' memory is set as 0 - 4G > continuously, and > 2. When dom0 does PCI bus init, machine mmio space is > set between [3G, 3G+512M] (Take a large range for example), > > Under above 2 conditions, current paravirtualized > implementation can clearly handle between: > 1. A normal access to ''physical'' 3G + 4k address, and > 2. Access to machine mmio address 3G + 4k of some > physical device > > Is that assumption right?Yes, that''s it.> BTW, will that make some > complexities for non-access operation, like comparison upon > some address?Linux doesn''t do this (It doesn''t make sense anyhow). Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 4:23 PM >To: Tian, Kevin > >On 19 May 2005, at 09:14, Tian, Kevin wrote: > >> For this part, I made a mistake to confuse domN and dom0. OK, for >> paravirtualized guest, there''s actually no I/O range for domN, since >> the >> front driver in domN will do all things to communicate with backendin>> Dom0. But what about a driver domain which has access to physical >> device, thus need real I/O address? > >We rely on the driver using the dma_coherent/pci_consistent/bus_address >macros for mapping device memory. Originally designed to handle IOMMUs, >it is handy for us to determine places where code is handling real >machine physical addresses rather than pseudophysical addresses. This >gets groans of distaste from the kernel maintainers but has worked >enormously well so far (AGP needed patching separately though). > > -- KeirThanks for guide. That''s really a way to differentiate normal memory and machine memory used for device. After searching the source tree, yes, if all drivers conform to this convention (should be), low level pci-dma interface can adjust guest pfn -> machine pfn mapping to promise contiguous requirement in machine address layer, and also 4G limitation for old DMA driver on new platform. But, this only handles the difference between machine memory and ''physical'' memory, not the one between "physical'' memory and machine MMIO space. If ''physical'' memory is kept continuous whatever memory size is, there''ll be some overlap with machine MMIO space anyway. But since your careful design can promise no confusion, then let it since there''re already many mixed knowledge in dear paravirtualized domain. :) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 5:14 PM > >> Is that assumption right? > >Yes, that''s it.OK, got it. Then, aside from para-virtualized linux, do you agree that some change should be made to unmodified vmx domain build and DM? When domain creation in CP and when DM services other domain, they shouldn''t operate DomN''s memory by simply acquiring a plain continuous page_array which has no hole information. Either extra information about hole, or the page_array itself containing hole, should be added thereafter...> >> BTW, will that make some >> complexities for non-access operation, like comparison upon >> some address? > >Linux doesn''t do this (It doesn''t make sense anyhow).Keep my mouth close now until I convince myself with more knowledge from Linux. :) - Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> OK, got it. Then, aside from para-virtualized linux, do you > agree that some change should be made to unmodified vmx > domain build and DM? When domain creation in CP and when DM > services other domain, they shouldn''t operate DomN''s memory > by simply acquiring a plain continuous page_array which has > no hole information. Either extra information about hole, or > the page_array itself containing hole, should be added thereafter...vmx domains already have a virtually mapped pfn->mfn table stored within Xen. See phys_to_machine_mapping(gpfn) Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt wrote:>> OK, got it. Then, aside from para-virtualized linux, do you >> agree that some change should be made to unmodified vmx >> domain build and DM? When domain creation in CP and when DM >> services other domain, they shouldn''t operate DomN''s memory >> by simply acquiring a plain continuous page_array which has >> no hole information. Either extra information about hole, or >> the page_array itself containing hole, should be added thereafter... > > vmx domains already have a virtually mapped pfn->mfn table stored > within Xen. > See phys_to_machine_mapping(gpfn) >Yes, but current VMX code is still using a simple contiguous page_array to do foreign map although there is phys_to_machine_mapping in HV. It is just a minor bug. Probably providing phys_to_machine_mapping to DM like kevin suggested through a special hypercall is an easy way for DM to do map right. Comments? Eddie _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] >Sent: Thursday, May 19, 2005 5:37 PM > > > >> OK, got it. Then, aside from para-virtualized linux, do you >> agree that some change should be made to unmodified vmx >> domain build and DM? When domain creation in CP and when DM >> services other domain, they shouldn''t operate DomN''s memory >> by simply acquiring a plain continuous page_array which has >> no hole information. Either extra information about hole, or >> the page_array itself containing hole, should be added thereafter... > >vmx domains already have a virtually mapped pfn->mfn table storedwithin>Xen. >See phys_to_machine_mapping(gpfn) > >IanYes, as you said earlier. But it''s not used by CP and DM by far, which still go to page_array style. :( I just mean some change may be required there for CP and DM to correctly manipulate other domain''s memory. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel