Konrad Rzeszutek Wilk
2010-Nov-12 23:08 UTC
[Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
Hey guys, Attached is an RFC patch for making a PCI hole in the PV guests. This allows PV guests(*) with 4GB or more to now properly work with or without PCI passthrough cards. Previously the Linux kernel would not be able to allocate the PCI region underneath the 4GB region as that region was all System RAM. And you would see this: [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 [ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range [ 0.000000] PCI: Unassigned devices with 32bit resource registers may break! [ 0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000) This patchset punches an PCI hole in the E820 region and as well fills the P2M properly, so that now you can see (*): [ 0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000) It adds a new option to guest config file, which is "pci_hole". The user can specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' which will automatically figure out the start of the PCI address. *: This option requires support in the Linux kernel to actually deal with two entries in the E820 map and P2M space filled with ~0. The patches (draft, not ready for upstream) for the Linux kernel to support this are available at: git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole All of these patches make the E820 of the Linux guest with 4GB (or more) passed look like this (2.6.37-rc1+devel/e820-hole): [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 00000000a0000000 (usable) [ 0.000000] Xen: 0000000100000000 - 0000000160800000 (usable) compared to (2.6.36) [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000100000000 (usable) and (2.6.37-rc1): [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000100800000 (usable) In regards to the patches that I am attaching here, what is the magic incantention to make the indentation/StyleGuide proper for the tools/libxc directory? The tab spacing is off a bit (I think). I''ve tested this so far only on 64-bit guests, and I am quite sure the tool-stack needs some extra care for the 32-bit guests.. But please take a look and give feedback. diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -91,6 +91,8 @@ struct xc_dom_image { /* physical memory */ xen_pfn_t total_pages; + /* start of the pci_hole. goes up to 4gb */ + xen_pfn_t pci_hole; struct xc_dom_phys *phys_pages; int realmodearea_log; diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c --- a/tools/libxc/xc_dom_core.c +++ b/tools/libxc/xc_dom_core.c @@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do uint32_t *p2m_32; uint64_t *p2m_64; xen_pfn_t i; + size_t tot_pages; if ( !dom->p2m_guest ) return 0; + tot_pages = dom->total_pages; + if (dom->pci_hole) + tot_pages += (0x100000 - dom->pci_hole); + switch ( dom->arch_hooks->sizeof_pfn ) { case 4: DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn "", - __FUNCTION__, dom->total_pages); + __FUNCTION__, tot_pages); p2m_32 = dom->p2m_guest; - for ( i = 0; i < dom->total_pages; i++ ) + for ( i = 0; i < tot_pages; i++ ) if ( dom->p2m_host[i] != INVALID_P2M_ENTRY ) p2m_32[i] = dom->p2m_host[i]; else @@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do break; case 8: DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn "", - __FUNCTION__, dom->total_pages); + __FUNCTION__, tot_pages); p2m_64 = dom->p2m_guest; - for ( i = 0; i < dom->total_pages; i++ ) + for ( i = 0; i < tot_pages; i++ ) if ( dom->p2m_host[i] != INVALID_P2M_ENTRY ) p2m_64[i] = dom->p2m_host[i]; else diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d { size_t p2m_size = dom->total_pages * dom->arch_hooks->sizeof_pfn; + if (dom->pci_hole && (dom->total_pages > dom->pci_hole)) + { + size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) * + dom->arch_hooks->sizeof_pfn; + + DOMPRINTF("%s: Expanding P2M to include PCI hole (%ld->%ld)\n", + __FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size); + p2m_size += p2m_pci_hole_size; + } /* allocate phys2mach table */ if ( xc_dom_alloc_segment(dom, &dom->p2m_seg, "phys2mach", 0, p2m_size) ) return -1; @@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima { int rc; xen_pfn_t pfn, allocsz, i, j, mfn; + size_t p2m_size; rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type); if ( rc ) @@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima if ( rc ) return rc; } + p2m_size = dom->total_pages; - dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * dom->total_pages); + if (dom->pci_hole && (dom->total_pages > dom->pci_hole)) + p2m_size += (0x100000 - dom->pci_hole); + + DOMPRINTF("Allocating %ld bytes for P2M", p2m_size * sizeof(xen_pfn_t)); + dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size); if ( dom->superpages ) { int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT; @@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima } else { - /* setup initial p2m */ - for ( pfn = 0; pfn < dom->total_pages; pfn++ ) - dom->p2m_host[pfn] = pfn; - - /* allocate guest memory */ - for ( i = rc = allocsz = 0; - (i < dom->total_pages) && !rc; - i += allocsz ) + /* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */ + if ( dom->pci_hole && (dom->total_pages > dom->pci_hole) ) { - allocsz = dom->total_pages - i; - if ( allocsz > 1024*1024 ) - allocsz = 1024*1024; - rc = xc_domain_populate_physmap_exact( - dom->xch, dom->guest_domid, allocsz, - 0, 0, &dom->p2m_host[i]); + /* setup initial p2m in three passes. */ + for (pfn = 0; pfn < dom->pci_hole; pfn++) + dom->p2m_host[pfn] = pfn; + + xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has PFNs.", __FUNCTION__, pfn); + xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has INVALID_MFN", + __FUNCTION__, pfn, 0x100000); + for (; pfn < 0x100000; pfn++) + dom->p2m_host[pfn] = INVALID_MFN; + + for (; pfn < 0x100000 + dom->total_pages - dom->pci_hole; pfn++) + dom->p2m_host[pfn] = pfn; + xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has PFNs.", __FUNCTION__, + 0x100000, pfn); + + /* allocate guest memory in two passes. */ + for (i = rc = allocsz = 0; (i < dom->pci_hole) && !rc; i += allocsz) + { + allocsz = dom->pci_hole - i; + xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx", + __FUNCTION__, i, i + allocsz); + rc = xc_domain_populate_physmap_exact (dom->xch, dom->guest_domid, + allocsz, 0, 0, + &dom->p2m_host[i]); + } + for (i = 0x100000, allocsz = rc = 0; + (i < (0x100000 + dom->total_pages - dom->pci_hole)) + && !rc; i += allocsz) + { + allocsz = (dom->total_pages - dom->pci_hole) - (i - 0x100000); + if (allocsz > 1024 * 1024) + allocsz = 1024 * 1024; + xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx", + __FUNCTION__, i, i + allocsz); + rc = xc_domain_populate_physmap_exact (dom->xch, dom->guest_domid, + allocsz, 0, 0, + &dom->p2m_host[i]); + } + xc_dom_printf (dom->xch, "%s: Done with PCI populate physmap", + __FUNCTION__); + } else { + /* setup initial p2m */ + for ( pfn = 0; pfn < dom->total_pages; pfn++ ) + dom->p2m_host[pfn] = pfn; + + /* allocate guest memory */ + for ( i = rc = allocsz = 0; + (i < dom->total_pages) && !rc; + i += allocsz ) + { + allocsz = dom->total_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, 0, &dom->p2m_host[i]); + } } } diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in #include "xc_e820.h" int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, - unsigned long map_limitkb) + unsigned long map_limitkb, + xen_pfn_t pci_hole_start) { int rc; + uint64_t delta_kb; + size_t e820_sz; struct xen_foreign_memory_map fmap = { .domid = domid, .map = { .nr_entries = 1 } }; DECLARE_HYPERCALL_BUFFER(struct e820entry, e820); - e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820)); + delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2); + if (pci_hole_start && (delta_kb > 0)) + e820_sz = sizeof(*e820); + else + e820_sz = sizeof(*e820)*2; + + e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz); if ( e820 == NULL ) { @@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf e820->size = (uint64_t)map_limitkb << 10; e820->type = E820_RAM; + if (pci_hole_start && (delta_kb > 0)) + { + fmap.map.nr_entries ++; + e820[0].size = (uint64_t)pci_hole_start << 12; + + e820[1].type = E820_RAM; + e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn... */ + e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB. */ + } + set_xen_guest_handle(fmap.map.buffer, e820); rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); @@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf #else int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, - unsigned long map_limitkb) + unsigned long map_limitkb, + xen_pfn_t pci_hole_start) { PERROR("Function not implemented"); errno = ENOSYS; diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, - unsigned long map_limitkb); + unsigned long map_limitkb, + xen_pfn_t pci_hole_start); int xc_domain_set_time_offset(xc_interface *xch, uint32_t domid, diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx libxl_device_disk *disk, libxl_diskinfo *diskinfo); int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk *disk); +int libxl_find_pci_hole(uint32_t *start_pfn); /* * Make a disk available in this domain. Returns path to a device. */ diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl --- a/tools/libxl/libxl.idl +++ b/tools/libxl/libxl.idl @@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain ])), ("pv", "!%s", Struct(None, [("slack_memkb", uint32), + ("pci_hole_start", uint32), ("bootloader", string), ("bootloader_args", string), ("cmdline", string), diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); xc_domain_set_memmap_limit(ctx->xch, domid, (info->hvm) ? info->max_memkb : - (info->max_memkb + info->u.pv.slack_memkb)); + (info->max_memkb + info->u.pv.slack_memkb), + (info->hvm) ? 0 : info->u.pv.pci_hole_start); xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0); if ( info->disable_migrate ) xc_domain_disable_migrate(ctx->xch, domid); @@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint } } } + if ( info->u.pv.pci_hole_start) + dom->pci_hole = info->u.pv.pci_hole_start; dom->flags = flags; dom->console_evtchn = state->console_port; diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c +++ b/tools/libxl/libxl_pci.c @@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx free(pcidevs); return 0; } + +#define MAX_LINE 300 +int libxl_find_pci_hole(uint32_t *start_pfn) +{ + FILE *fp; + char *s; + char buf[MAX_LINE]; + int ret = -ENODEV; + long int pci_hole_phys; + + *start_pfn = 0; + fp = fopen("/proc/iomem", "r"); + if (!fp) + return ret; + + while (1) { + s = fgets(buf, MAX_LINE, fp); + if (!s) + break; + if (strlen(buf) < 1) + continue; + if (buf[strlen(buf)-1] == ''\n'') + buf[strlen(buf)-1] = ''\0''; + s = strchr(buf,''P''); + if (!s) + continue; + if (strncmp(s, "PCI", 3) == 0) { + if (buf[0] == '' '') + continue; + s = strchr(buf,''-''); + if (!s) + break; + s[0]=''\0''; + pci_hole_phys = strtol(buf, NULL, 16); + if (!pci_hole_phys) + break; + /* We don''t want to the holes below 16MB. */ + if (pci_hole_phys <= 0x1000) + continue; + *start_pfn = pci_hole_phys >> 12; + fprintf(stderr,"The value is 0x%d\n", *start_pfn); + ret = 0; + break; + } + } + fclose(fp); + return ret; +} diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -1078,6 +1078,14 @@ skip_vfb: if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l)) pci_power_mgmt = l; + if (!xlu_cfg_get_long (config, "pci_hole", &l)) { + if (l == 1) { + uint32_t pfn_start = 0; + if (!libxl_find_pci_hole(&pfn_start)) + b_info->u.pv.pci_hole_start = pfn_start; + } else + b_info->u.pv.pci_hole_start = l; + } if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { int i; d_config->num_pcidevs = 0; diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c --- a/tools/python/xen/lowlevel/xc/xc.c +++ b/tools/python/xen/lowlevel/xc/xc.c @@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje unsigned int mem_mb; unsigned long store_mfn = 0; unsigned long console_mfn = 0; + unsigned long pci_hole_start = 0; PyObject* elfnote_dict; PyObject* elfnote = NULL; PyObject* ret; @@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje "console_evtchn", "image", /* optional */ "ramdisk", "cmdline", "flags", - "features", "vhpt", "superpages", NULL }; - - if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisii", kwd_list, + "features", "vhpt", "superpages", + "pci_hole", NULL }; + + if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisiii", kwd_list, &domid, &store_evtchn, &mem_mb, &console_evtchn, &image, /* optional */ &ramdisk, &cmdline, &flags, - &features, &vhpt, &superpages) ) + &features, &vhpt, &superpages, + &pci_hole_start) ) return NULL; xc_dom_loginit(self->xc_handle); @@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje dom->superpages = superpages; + dom->pci_hole = pci_hole_start; + if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image, ramdisk, flags, store_evtchn, &store_mfn, console_evtchn, &console_mfn) != 0 ) { @@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_ { uint32_t dom; unsigned int maplimit_kb; - - if ( !PyArg_ParseTuple(args, "ii", &dom, &maplimit_kb) ) + unsigned long pci_hole_start = 0; + + if ( !PyArg_ParseTuple(args, "ii|i", &dom, &maplimit_kb, &pci_hole_start) ) return NULL; - if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb) != 0 ) + if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb, + pci_hole_start) != 0 ) return pyxc_error_to_exception(self->xc_handle); Py_INCREF(zero); @@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = { "Set a domain''s physical memory mappping limit\n" " dom [int]: Identifier of domain.\n" " map_limitkb [int]: .\n" + " pci_hole_start [int]: PFN for start of PCI hole (optional).\n" "Returns: [int] 0 on success; -1 on error.\n" }, #ifdef __ia64__ diff --git a/tools/python/xen/xend/XendConfig.py b/tools/python/xen/xend/XendConfig.py --- a/tools/python/xen/xend/XendConfig.py +++ b/tools/python/xen/xend/XendConfig.py @@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = { ''suppress_spurious_page_faults'': bool0, ''s3_integrity'' : int, ''superpages'' : int, + ''pci_hole'' : int, ''memory_sharing'': int, ''pool_name'' : str, ''Description'': str, @@ -422,6 +423,7 @@ class XendConfig(dict): ''target'': 0, ''pool_name'' : ''Pool-0'', ''superpages'': 0, + ''pci_hole'': 0, ''description'': '''', } @@ -2135,6 +2137,9 @@ class XendConfig(dict): image.append([''args'', self[''PV_args'']]) if self.has_key(''superpages''): image.append([''superpages'', self[''superpages'']]) + if self.has_key(''pci_hole''): + image.append([''pci_hole'', self[''pci_hole'']]) + for key in XENAPI_PLATFORM_CFG_TYPES.keys(): if key in self[''platform'']: @@ -2179,6 +2184,10 @@ class XendConfig(dict): val = sxp.child_value(image_sxp, ''superpages'') if val is not None: self[''superpages''] = val + + val = sxp.child_value(image_sxp, ''pci_hole'') + if val is not None: + self[''pci_hole''] = val val = sxp.child_value(image_sxp, ''memory_sharing'') if val is not None: diff --git a/tools/python/xen/xend/image.py b/tools/python/xen/xend/image.py --- a/tools/python/xen/xend/image.py +++ b/tools/python/xen/xend/image.py @@ -84,6 +84,7 @@ class ImageHandler: ostype = None superpages = 0 + pci_hole = 0 memory_sharing = 0 def __init__(self, vm, vmConfig): @@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler): self.vramsize = int(vmConfig[''platform''].get(''videoram'',4)) * 1024 self.is_stubdom = (self.kernel.find(''stubdom'') >= 0) self.superpages = int(vmConfig[''superpages'']) + self.pci_hole = int(vmConfig[''pci_hole'']) def buildDomain(self): store_evtchn = self.vm.getStorePort() @@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler): log.debug("features = %s", self.vm.getFeatures()) log.debug("flags = %d", self.flags) log.debug("superpages = %d", self.superpages) + log.debug("pci_hole = %d", self.pci_hole) if arch.type == "ia64": log.debug("vhpt = %d", self.vhpt) @@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler): features = self.vm.getFeatures(), flags = self.flags, vhpt = self.vhpt, - superpages = self.superpages) + superpages = self.superpages, + pci_hole = self.pci_hole) def getBitSize(self): return xc.getBitSize(image = self.kernel, @@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler): args = args + ([ "-M", "xenpv"]) return args - class HVMImageHandler(ImageHandler): ostype = "hvm" @@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH # set physical mapping limit # add an 8MB slack to balance backend allocations. mem_kb = self.getRequiredMaximumReservation() + (8 * 1024) - xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb) + xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb, self.pci_hole) rc = LinuxImageHandler.buildDomain(self) self.setCpuid() return rc diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py --- a/tools/python/xen/xm/create.py +++ b/tools/python/xen/xm/create.py @@ -680,6 +680,11 @@ gopts.var(''superpages'', val=''0|1'', fn=set_int, default=0, use="Create domain with superpages") +gopts.var(''pci_hole'', val=''0x<XXX>|0'', + fn=set_int, default=0, + use="""Create domain with a PCI hole. The value is the PFN of the + start of PCI hole. Usually that is 0xc0000.""") + def err(msg): """Print an error to stderr and exit. """ @@ -770,6 +775,9 @@ def configure_image(vals): config_image.append([''args'', vals.extra]) if vals.superpages: config_image.append([''superpages'', vals.superpages]) + if vals.pci_hole: + config_image.append([''pci_hole'', vals.pci_hole]) + if vals.builder == ''hvm'': configure_hvm(config_image, vals) diff --git a/tools/python/xen/xm/xenapi_create.py b/tools/python/xen/xm/xenapi_create.py --- a/tools/python/xen/xm/xenapi_create.py +++ b/tools/python/xen/xm/xenapi_create.py @@ -285,6 +285,8 @@ class xenapi_create: vm.attributes["s3_integrity"].value, "superpages": vm.attributes["superpages"].value, + "pci_hole": + vm.attributes["pci_hole"].value, "memory_static_max": get_child_node_attribute(vm, "memory", "static_max"), "memory_static_min": @@ -697,6 +699,8 @@ class sxp2xml: = str(get_child_by_name(config, "s3_integrity", 0)) vm.attributes["superpages"] \ = str(get_child_by_name(config, "superpages", 0)) + vm.attributes["pci_hole"] \ + = str(get_child_by_name(config, "pci_hole", 0)) vm.attributes["pool_name"] \ = str(get_child_by_name(config, "pool_name", "Pool-0")) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-12 23:16 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
> It adds a new option to guest config file, which is "pci_hole". The user can > specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' which > will automatically figure out the start of the PCI address.Here is an example: kernel="/home/konrad/git/xtt/bootstrap/dist/common/vmlinuz" ramdisk="/home/konrad/git/xtt/bootstrap/dist/common/initramfs.cpio.gz" extra="inittab=/etc/inittab-xen console=hvc0 debug earlyprintk=xen memblock=debug iommu=soft" memory=6048 vcpus=4 on_crash="preserve" vif = [ ''mac=00:0f:4b:00:00:68, bridge=switch'' ] pci= ["0000:03:00.0","00:1d.0","00:1d.1","00:1d.2","00:1d.7"] vfb = [ ''vnc=1, vnclisten=0.0.0.0,vncunused=1''] pci_hole=0xa0000 Also I found out that ''xl'' does not understand multiple options in the pci line so testing with that might give you some weird results. However "xm" works fine. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-13 07:40 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
Why doesn''t the guest punch its own hole, by relocating RAM above 4GB? That''s what all HVM guests do (in hvmloader). -- Keir On 12/11/2010 23:08, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:> Hey guys, > > Attached is an RFC patch for making a PCI hole in the PV guests. This allows > PV guests(*) with 4GB or more to now properly work with or without > PCI passthrough cards. > > Previously the Linux kernel would not be able to allocate the PCI region > underneath the 4GB region as that region was all System RAM. And you would see > this: > > [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - > 0000000000100000 > [ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range > [ 0.000000] PCI: Unassigned devices with 32bit resource registers may > break! > [ 0.000000] Allocating PCI resources starting at 100100000 (gap: > 100100000:400000) > > > This patchset punches an PCI hole in the E820 region and as well fills the P2M > properly, > so that now you can see (*): > [ 0.000000] Allocating PCI resources starting at a0000000 (gap: > a0000000:60000000) > > It adds a new option to guest config file, which is "pci_hole". The user can > specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' > which > will automatically figure out the start of the PCI address. > > *: This option requires support in the Linux kernel to actually deal with two > entries in the E820 map and P2M space filled with ~0. > > > The patches (draft, not ready for upstream) for the Linux kernel to support > this are > available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole > > All of these patches make the E820 of the Linux guest with 4GB (or more) > passed > look like this (2.6.37-rc1+devel/e820-hole): > [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) > [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) > [ 0.000000] Xen: 0000000000100000 - 00000000a0000000 (usable) > [ 0.000000] Xen: 0000000100000000 - 0000000160800000 (usable) > > compared to (2.6.36) > [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) > [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) > [ 0.000000] Xen: 0000000000100000 - 0000000100000000 (usable) > > and (2.6.37-rc1): > [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) > [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) > [ 0.000000] Xen: 0000000000100000 - 0000000100800000 (usable) > > In regards to the patches that I am attaching here, what is the magic > incantention > to make the indentation/StyleGuide proper for the tools/libxc directory? The > tab spacing > is off a bit (I think). > > I''ve tested this so far only on 64-bit guests, and I am quite sure the > tool-stack > needs some extra care for the 32-bit guests.. > > But please take a look and give feedback. > > diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h > --- a/tools/libxc/xc_dom.h > +++ b/tools/libxc/xc_dom.h > @@ -91,6 +91,8 @@ struct xc_dom_image { > > /* physical memory */ > xen_pfn_t total_pages; > + /* start of the pci_hole. goes up to 4gb */ > + xen_pfn_t pci_hole; > struct xc_dom_phys *phys_pages; > int realmodearea_log; > > diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c > --- a/tools/libxc/xc_dom_core.c > +++ b/tools/libxc/xc_dom_core.c > @@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do > uint32_t *p2m_32; > uint64_t *p2m_64; > xen_pfn_t i; > + size_t tot_pages; > > if ( !dom->p2m_guest ) > return 0; > > + tot_pages = dom->total_pages; > + if (dom->pci_hole) > + tot_pages += (0x100000 - dom->pci_hole); > + > switch ( dom->arch_hooks->sizeof_pfn ) > { > case 4: > DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn "", > - __FUNCTION__, dom->total_pages); > + __FUNCTION__, tot_pages); > p2m_32 = dom->p2m_guest; > - for ( i = 0; i < dom->total_pages; i++ ) > + for ( i = 0; i < tot_pages; i++ ) > if ( dom->p2m_host[i] != INVALID_P2M_ENTRY ) > p2m_32[i] = dom->p2m_host[i]; > else > @@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do > break; > case 8: > DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn "", > - __FUNCTION__, dom->total_pages); > + __FUNCTION__, tot_pages); > p2m_64 = dom->p2m_guest; > - for ( i = 0; i < dom->total_pages; i++ ) > + for ( i = 0; i < tot_pages; i++ ) > if ( dom->p2m_host[i] != INVALID_P2M_ENTRY ) > p2m_64[i] = dom->p2m_host[i]; > else > diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c > --- a/tools/libxc/xc_dom_x86.c > +++ b/tools/libxc/xc_dom_x86.c > @@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d > { > size_t p2m_size = dom->total_pages * dom->arch_hooks->sizeof_pfn; > > + if (dom->pci_hole && (dom->total_pages > dom->pci_hole)) > + { > + size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) * > + dom->arch_hooks->sizeof_pfn; > + > + DOMPRINTF("%s: Expanding P2M to include PCI hole (%ld->%ld)\n", > + __FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size); > + p2m_size += p2m_pci_hole_size; > + } > /* allocate phys2mach table */ > if ( xc_dom_alloc_segment(dom, &dom->p2m_seg, "phys2mach", 0, p2m_size) ) > return -1; > @@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima > { > int rc; > xen_pfn_t pfn, allocsz, i, j, mfn; > + size_t p2m_size; > > rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type); > if ( rc ) > @@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima > if ( rc ) > return rc; > } > + p2m_size = dom->total_pages; > > - dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * dom->total_pages); > + if (dom->pci_hole && (dom->total_pages > dom->pci_hole)) > + p2m_size += (0x100000 - dom->pci_hole); > + > + DOMPRINTF("Allocating %ld bytes for P2M", p2m_size * sizeof(xen_pfn_t)); > + dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size); > if ( dom->superpages ) > { > int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT; > @@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima > } > else > { > - /* setup initial p2m */ > - for ( pfn = 0; pfn < dom->total_pages; pfn++ ) > - dom->p2m_host[pfn] = pfn; > - > - /* allocate guest memory */ > - for ( i = rc = allocsz = 0; > - (i < dom->total_pages) && !rc; > - i += allocsz ) > + /* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */ > + if ( dom->pci_hole && (dom->total_pages > dom->pci_hole) ) > { > - allocsz = dom->total_pages - i; > - if ( allocsz > 1024*1024 ) > - allocsz = 1024*1024; > - rc = xc_domain_populate_physmap_exact( > - dom->xch, dom->guest_domid, allocsz, > - 0, 0, &dom->p2m_host[i]); > + /* setup initial p2m in three passes. */ > + for (pfn = 0; pfn < dom->pci_hole; pfn++) > + dom->p2m_host[pfn] = pfn; > + > + xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has PFNs.", > __FUNCTION__, pfn); > + xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has INVALID_MFN", > + __FUNCTION__, pfn, 0x100000); > + for (; pfn < 0x100000; pfn++) > + dom->p2m_host[pfn] = INVALID_MFN; > + > + for (; pfn < 0x100000 + dom->total_pages - dom->pci_hole; pfn++) > + dom->p2m_host[pfn] = pfn; > + xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has PFNs.", > __FUNCTION__, > + 0x100000, pfn); > + > + /* allocate guest memory in two passes. */ > + for (i = rc = allocsz = 0; (i < dom->pci_hole) && !rc; i +> allocsz) > + { > + allocsz = dom->pci_hole - i; > + xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx", > + __FUNCTION__, i, i + allocsz); > + rc = xc_domain_populate_physmap_exact (dom->xch, > dom->guest_domid, > + allocsz, 0, 0, > + &dom->p2m_host[i]); > + } > + for (i = 0x100000, allocsz = rc = 0; > + (i < (0x100000 + dom->total_pages - dom->pci_hole)) > + && !rc; i += allocsz) > + { > + allocsz = (dom->total_pages - dom->pci_hole) - (i - 0x100000); > + if (allocsz > 1024 * 1024) > + allocsz = 1024 * 1024; > + xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx", > + __FUNCTION__, i, i + allocsz); > + rc = xc_domain_populate_physmap_exact (dom->xch, > dom->guest_domid, > + allocsz, 0, 0, > + &dom->p2m_host[i]); > + } > + xc_dom_printf (dom->xch, "%s: Done with PCI populate physmap", > + __FUNCTION__); > + } else { > + /* setup initial p2m */ > + for ( pfn = 0; pfn < dom->total_pages; pfn++ ) > + dom->p2m_host[pfn] = pfn; > + > + /* allocate guest memory */ > + for ( i = rc = allocsz = 0; > + (i < dom->total_pages) && !rc; > + i += allocsz ) > + { > + allocsz = dom->total_pages - i; > + if ( allocsz > 1024*1024 ) > + allocsz = 1024*1024; > + rc = xc_domain_populate_physmap_exact( > + dom->xch, dom->guest_domid, allocsz, > + 0, 0, &dom->p2m_host[i]); > + } > } > } > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > --- a/tools/libxc/xc_domain.c > +++ b/tools/libxc/xc_domain.c > @@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in > #include "xc_e820.h" > int xc_domain_set_memmap_limit(xc_interface *xch, > uint32_t domid, > - unsigned long map_limitkb) > + unsigned long map_limitkb, > + xen_pfn_t pci_hole_start) > { > int rc; > + uint64_t delta_kb; > + size_t e820_sz; > struct xen_foreign_memory_map fmap = { > .domid = domid, > .map = { .nr_entries = 1 } > }; > DECLARE_HYPERCALL_BUFFER(struct e820entry, e820); > > - e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820)); > + delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2); > + if (pci_hole_start && (delta_kb > 0)) > + e820_sz = sizeof(*e820); > + else > + e820_sz = sizeof(*e820)*2; > + > + e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz); > > if ( e820 == NULL ) > { > @@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf > e820->size = (uint64_t)map_limitkb << 10; > e820->type = E820_RAM; > > + if (pci_hole_start && (delta_kb > 0)) > + { > + fmap.map.nr_entries ++; > + e820[0].size = (uint64_t)pci_hole_start << 12; > + > + e820[1].type = E820_RAM; > + e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn... */ > + e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB. */ > + } > + > set_xen_guest_handle(fmap.map.buffer, e820); > > rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); > @@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf > #else > int xc_domain_set_memmap_limit(xc_interface *xch, > uint32_t domid, > - unsigned long map_limitkb) > + unsigned long map_limitkb, > + xen_pfn_t pci_hole_start) > { > PERROR("Function not implemented"); > errno = ENOSYS; > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > --- a/tools/libxc/xenctrl.h > +++ b/tools/libxc/xenctrl.h > @@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc > > int xc_domain_set_memmap_limit(xc_interface *xch, > uint32_t domid, > - unsigned long map_limitkb); > + unsigned long map_limitkb, > + xen_pfn_t pci_hole_start); > > int xc_domain_set_time_offset(xc_interface *xch, > uint32_t domid, > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx > libxl_device_disk *disk, libxl_diskinfo > *diskinfo); > int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk > *disk); > > +int libxl_find_pci_hole(uint32_t *start_pfn); > /* > * Make a disk available in this domain. Returns path to a device. > */ > diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl > --- a/tools/libxl/libxl.idl > +++ b/tools/libxl/libxl.idl > @@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain > ])), > ("pv", "!%s", Struct(None, > [("slack_memkb", uint32), > + ("pci_hole_start", uint32), > ("bootloader", string), > ("bootloader_args", string), > ("cmdline", string), > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + > LIBXL_MAXMEM_CONSTANT); > xc_domain_set_memmap_limit(ctx->xch, domid, > (info->hvm) ? info->max_memkb : > - (info->max_memkb + info->u.pv.slack_memkb)); > + (info->max_memkb + info->u.pv.slack_memkb), > + (info->hvm) ? 0 : info->u.pv.pci_hole_start); > xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0); > if ( info->disable_migrate ) > xc_domain_disable_migrate(ctx->xch, domid); > @@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint > } > } > } > + if ( info->u.pv.pci_hole_start) > + dom->pci_hole = info->u.pv.pci_hole_start; > > dom->flags = flags; > dom->console_evtchn = state->console_port; > diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c > --- a/tools/libxl/libxl_pci.c > +++ b/tools/libxl/libxl_pci.c > @@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx > free(pcidevs); > return 0; > } > + > +#define MAX_LINE 300 > +int libxl_find_pci_hole(uint32_t *start_pfn) > +{ > + FILE *fp; > + char *s; > + char buf[MAX_LINE]; > + int ret = -ENODEV; > + long int pci_hole_phys; > + > + *start_pfn = 0; > + fp = fopen("/proc/iomem", "r"); > + if (!fp) > + return ret; > + > + while (1) { > + s = fgets(buf, MAX_LINE, fp); > + if (!s) > + break; > + if (strlen(buf) < 1) > + continue; > + if (buf[strlen(buf)-1] == ''\n'') > + buf[strlen(buf)-1] = ''\0''; > + s = strchr(buf,''P''); > + if (!s) > + continue; > + if (strncmp(s, "PCI", 3) == 0) { > + if (buf[0] == '' '') > + continue; > + s = strchr(buf,''-''); > + if (!s) > + break; > + s[0]=''\0''; > + pci_hole_phys = strtol(buf, NULL, 16); > + if (!pci_hole_phys) > + break; > + /* We don''t want to the holes below 16MB. */ > + if (pci_hole_phys <= 0x1000) > + continue; > + *start_pfn = pci_hole_phys >> 12; > + fprintf(stderr,"The value is 0x%d\n", *start_pfn); > + ret = 0; > + break; > + } > + } > + fclose(fp); > + return ret; > +} > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -1078,6 +1078,14 @@ skip_vfb: > if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l)) > pci_power_mgmt = l; > > + if (!xlu_cfg_get_long (config, "pci_hole", &l)) { > + if (l == 1) { > + uint32_t pfn_start = 0; > + if (!libxl_find_pci_hole(&pfn_start)) > + b_info->u.pv.pci_hole_start = pfn_start; > + } else > + b_info->u.pv.pci_hole_start = l; > + } > if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { > int i; > d_config->num_pcidevs = 0; > diff --git a/tools/python/xen/lowlevel/xc/xc.c > b/tools/python/xen/lowlevel/xc/xc.c > --- a/tools/python/xen/lowlevel/xc/xc.c > +++ b/tools/python/xen/lowlevel/xc/xc.c > @@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje > unsigned int mem_mb; > unsigned long store_mfn = 0; > unsigned long console_mfn = 0; > + unsigned long pci_hole_start = 0; > PyObject* elfnote_dict; > PyObject* elfnote = NULL; > PyObject* ret; > @@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje > "console_evtchn", "image", > /* optional */ > "ramdisk", "cmdline", "flags", > - "features", "vhpt", "superpages", NULL }; > - > - if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisii", kwd_list, > + "features", "vhpt", "superpages", > + "pci_hole", NULL }; > + > + if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisiii", kwd_list, > &domid, &store_evtchn, &mem_mb, > &console_evtchn, &image, > /* optional */ > &ramdisk, &cmdline, &flags, > - &features, &vhpt, &superpages) ) > + &features, &vhpt, &superpages, > + &pci_hole_start) ) > return NULL; > > xc_dom_loginit(self->xc_handle); > @@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje > > dom->superpages = superpages; > > + dom->pci_hole = pci_hole_start; > + > if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image, > ramdisk, flags, store_evtchn, &store_mfn, > console_evtchn, &console_mfn) != 0 ) { > @@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_ > { > uint32_t dom; > unsigned int maplimit_kb; > - > - if ( !PyArg_ParseTuple(args, "ii", &dom, &maplimit_kb) ) > + unsigned long pci_hole_start = 0; > + > + if ( !PyArg_ParseTuple(args, "ii|i", &dom, &maplimit_kb, &pci_hole_start) > ) > return NULL; > > - if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb) != 0 ) > + if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb, > + pci_hole_start) != 0 ) > return pyxc_error_to_exception(self->xc_handle); > > Py_INCREF(zero); > @@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = { > "Set a domain''s physical memory mappping limit\n" > " dom [int]: Identifier of domain.\n" > " map_limitkb [int]: .\n" > + " pci_hole_start [int]: PFN for start of PCI hole (optional).\n" > "Returns: [int] 0 on success; -1 on error.\n" }, > > #ifdef __ia64__ > diff --git a/tools/python/xen/xend/XendConfig.py > b/tools/python/xen/xend/XendConfig.py > --- a/tools/python/xen/xend/XendConfig.py > +++ b/tools/python/xen/xend/XendConfig.py > @@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = { > ''suppress_spurious_page_faults'': bool0, > ''s3_integrity'' : int, > ''superpages'' : int, > + ''pci_hole'' : int, > ''memory_sharing'': int, > ''pool_name'' : str, > ''Description'': str, > @@ -422,6 +423,7 @@ class XendConfig(dict): > ''target'': 0, > ''pool_name'' : ''Pool-0'', > ''superpages'': 0, > + ''pci_hole'': 0, > ''description'': '''', > } > > @@ -2135,6 +2137,9 @@ class XendConfig(dict): > image.append([''args'', self[''PV_args'']]) > if self.has_key(''superpages''): > image.append([''superpages'', self[''superpages'']]) > + if self.has_key(''pci_hole''): > + image.append([''pci_hole'', self[''pci_hole'']]) > + > > for key in XENAPI_PLATFORM_CFG_TYPES.keys(): > if key in self[''platform'']: > @@ -2179,6 +2184,10 @@ class XendConfig(dict): > val = sxp.child_value(image_sxp, ''superpages'') > if val is not None: > self[''superpages''] = val > + > + val = sxp.child_value(image_sxp, ''pci_hole'') > + if val is not None: > + self[''pci_hole''] = val > > val = sxp.child_value(image_sxp, ''memory_sharing'') > if val is not None: > diff --git a/tools/python/xen/xend/image.py b/tools/python/xen/xend/image.py > --- a/tools/python/xen/xend/image.py > +++ b/tools/python/xen/xend/image.py > @@ -84,6 +84,7 @@ class ImageHandler: > > ostype = None > superpages = 0 > + pci_hole = 0 > memory_sharing = 0 > > def __init__(self, vm, vmConfig): > @@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler): > self.vramsize = int(vmConfig[''platform''].get(''videoram'',4)) * 1024 > self.is_stubdom = (self.kernel.find(''stubdom'') >= 0) > self.superpages = int(vmConfig[''superpages'']) > + self.pci_hole = int(vmConfig[''pci_hole'']) > > def buildDomain(self): > store_evtchn = self.vm.getStorePort() > @@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler): > log.debug("features = %s", self.vm.getFeatures()) > log.debug("flags = %d", self.flags) > log.debug("superpages = %d", self.superpages) > + log.debug("pci_hole = %d", self.pci_hole) > if arch.type == "ia64": > log.debug("vhpt = %d", self.vhpt) > > @@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler): > features = self.vm.getFeatures(), > flags = self.flags, > vhpt = self.vhpt, > - superpages = self.superpages) > + superpages = self.superpages, > + pci_hole = self.pci_hole) > > def getBitSize(self): > return xc.getBitSize(image = self.kernel, > @@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler): > args = args + ([ "-M", "xenpv"]) > return args > > - > class HVMImageHandler(ImageHandler): > > ostype = "hvm" > @@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH > # set physical mapping limit > # add an 8MB slack to balance backend allocations. > mem_kb = self.getRequiredMaximumReservation() + (8 * 1024) > - xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb) > + xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb, self.pci_hole) > rc = LinuxImageHandler.buildDomain(self) > self.setCpuid() > return rc > diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py > --- a/tools/python/xen/xm/create.py > +++ b/tools/python/xen/xm/create.py > @@ -680,6 +680,11 @@ gopts.var(''superpages'', val=''0|1'', > fn=set_int, default=0, > use="Create domain with superpages") > > +gopts.var(''pci_hole'', val=''0x<XXX>|0'', > + fn=set_int, default=0, > + use="""Create domain with a PCI hole. The value is the PFN of the > + start of PCI hole. Usually that is 0xc0000.""") > + > def err(msg): > """Print an error to stderr and exit. > """ > @@ -770,6 +775,9 @@ def configure_image(vals): > config_image.append([''args'', vals.extra]) > if vals.superpages: > config_image.append([''superpages'', vals.superpages]) > + if vals.pci_hole: > + config_image.append([''pci_hole'', vals.pci_hole]) > + > > if vals.builder == ''hvm'': > configure_hvm(config_image, vals) > diff --git a/tools/python/xen/xm/xenapi_create.py > b/tools/python/xen/xm/xenapi_create.py > --- a/tools/python/xen/xm/xenapi_create.py > +++ b/tools/python/xen/xm/xenapi_create.py > @@ -285,6 +285,8 @@ class xenapi_create: > vm.attributes["s3_integrity"].value, > "superpages": > vm.attributes["superpages"].value, > + "pci_hole": > + vm.attributes["pci_hole"].value, > "memory_static_max": > get_child_node_attribute(vm, "memory", "static_max"), > "memory_static_min": > @@ -697,6 +699,8 @@ class sxp2xml: > = str(get_child_by_name(config, "s3_integrity", 0)) > vm.attributes["superpages"] \ > = str(get_child_by_name(config, "superpages", 0)) > + vm.attributes["pci_hole"] \ > + = str(get_child_by_name(config, "pci_hole", 0)) > vm.attributes["pool_name"] \ > = str(get_child_by_name(config, "pool_name", "Pool-0")) > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-15 17:03 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser wrote:> Why doesn''t the guest punch its own hole, by relocating RAM above 4GB?1). Did not work for me - I am not sure why but I had the hardest time do hypervisor_populate_physmap - it would just hang the guest. 2). It is much simple to parse the E820 in the Linux kernel than actually creating new E820 entries in the kernel (hypercall), making a bunch of hypervisor calls that unmap, then remap the space, filling out the P2M with INVALID_MFN, and doing all of that before the "real" Linux kernel actually starts (all would have to be done in xen_start_kernel). I have a sinking feeling tha the upstream community would not like it this that much.> That''s what all HVM guests do (in hvmloader).3). Which is also part of the Xen tool-stack. Keir, I think you posted it at some point - was there a standard ''indent'' incantention for the tools/libxc StyleGuide (which is similar to to the xen/* one) - I can''t find it in my mail archive... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-15 17:20 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk wrote:> 2). It is much simple to parse the E820 in the Linux kernel[...]The ability to ingest an e820 from the hypervisor is also needed for dom0 to consume the host e820 so it''s not like there is additional code on the kernel side to cope with this. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-15 17:28 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Mon, Nov 15, 2010 at 05:20:00PM +0000, Ian Campbell wrote:> On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk wrote: > > 2). It is much simple to parse the E820 in the Linux kernel[...] > > The ability to ingest an e820 from the hypervisor is also needed for > dom0 to consume the host e820 so it''s not like there is additional code > on the kernel side to cope with this.There is a bit. We need to decouple the info->nr_pages magic that is used throughout the code b/c we are treating that value as a ''last pfn''. With the offset, we need to be more careful about it. P.S. (devel code, probably going to post a refresh today): git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-15 17:48 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 15/11/2010 17:03, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:> On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser wrote: >> Why doesn''t the guest punch its own hole, by relocating RAM above 4GB? > > > 1). Did not work for me - I am not sure why but I had the hardest time do > hypervisor_populate_physmap - it would just hang the guest.For a PV guest you don''t need to do any alloc/free/move memory hypercalls. You rewrite your own p2m to relocate mfns where you want them in pfn space. Then some hypercalls just to update the m2p array to match.> 2). It is much simple to parse the E820 in the Linux kernel than actually > creating new E820 entries in the kernel (hypercall), making a bunch of > hypervisor calls that unmap, then remap the space, filling out the P2M > with INVALID_MFN, and doing all of that before the "real" Linux kernel > actually starts (all would have to be done in xen_start_kernel). > I have a sinking feeling tha the upstream community would not like it > this that much.Well it is all quite Xen specific, so I''m surprised.>> That''s what all HVM guests do (in hvmloader). > > 3). Which is also part of the Xen tool-stack. > > > Keir, I think you posted it at some point - was there a standard ''indent'' > incantention for the tools/libxc StyleGuide (which is similar to to the xen/* > one) - I can''t find it in my mail archive...I think someone else worked one out, but I don''t have it to hand I''m afraid. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-15 18:15 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
> > 1). Did not work for me - I am not sure why but I had the hardest time do > > hypervisor_populate_physmap - it would just hang the guest. > > For a PV guest you don''t need to do any alloc/free/move memory hypercalls. > You rewrite your own p2m to relocate mfns where you want them in pfn space. > Then some hypercalls just to update the m2p array to match.Ok, I can play with that see what fun/havoc I can create.> > > 2). It is much simple to parse the E820 in the Linux kernel than actually > > creating new E820 entries in the kernel (hypercall), making a bunch of > > hypervisor calls that unmap, then remap the space, filling out the P2M > > with INVALID_MFN, and doing all of that before the "real" Linux kernel > > actually starts (all would have to be done in xen_start_kernel). > > I have a sinking feeling tha the upstream community would not like it > > this that much. > > Well it is all quite Xen specific, so I''m surprised.Oh, there was another reason that I so obvious that I completly forgot. DomU has no idea where the host PCI hole starts. In most cases it is at 3GB (or even further up - 3.5GB), but a quick look for ''Allocating PCI resources starting at'' at Google shows that there are some that start at 1.2G. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-15 18:41 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 15/11/2010 18:15, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>> Well it is all quite Xen specific, so I''m surprised. > > Oh, there was another reason that I so obvious that I completly forgot. DomU > has no idea where the host PCI hole starts. In most cases it is at 3GB (or > even > further up - 3.5GB), but a quick look for ''Allocating PCI resources starting > at'' > at Google shows that there are some that start at 1.2G.Hm, true. We could give you access to XENMEM_machine_memory_map? It''s not really got any big secrets or privileged things in it. :-) Or, is there much disadvantage, to having a static really big PCI hole? Say starting at 1GB? The advantage of this would be the ability to hotplug PCI devices to a domU even across save/restore/migrate -- this may not work so well if you commit yourself to the hole size of the original host, and the restore/migrate target host has a bigger hole! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-15 19:30 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 11/15/2010 10:15 AM, Konrad Rzeszutek Wilk wrote:>>> 1). Did not work for me - I am not sure why but I had the hardest time do >>> hypervisor_populate_physmap - it would just hang the guest. >> For a PV guest you don''t need to do any alloc/free/move memory hypercalls. >> You rewrite your own p2m to relocate mfns where you want them in pfn space. >> Then some hypercalls just to update the m2p array to match. > Ok, I can play with that see what fun/havoc I can create. >>> 2). It is much simple to parse the E820 in the Linux kernel than actually >>> creating new E820 entries in the kernel (hypercall), making a bunch of >>> hypervisor calls that unmap, then remap the space, filling out the P2M >>> with INVALID_MFN, and doing all of that before the "real" Linux kernel >>> actually starts (all would have to be done in xen_start_kernel). >>> I have a sinking feeling tha the upstream community would not like it >>> this that much. >> Well it is all quite Xen specific, so I''m surprised. > Oh, there was another reason that I so obvious that I completly forgot. DomU > has no idea where the host PCI hole starts. In most cases it is at 3GB (or even > further up - 3.5GB), but a quick look for ''Allocating PCI resources starting at'' > at Google shows that there are some that start at 1.2G.Yes, that''s the main reason I think it should be in the toolstack. The domain doesn''t know whether the PCI hole is necessary or not, and its too early for it to poke around in xenstore to look for passed devices or anything. It could conservatively always reserve a 1G hole at 3G, but that seems too pessimistic and is a waste of sub-4G adress space which it might otherwise have use for. If the toolstack can tell it where to make the hole by editing the E820, then the dom0 and domU cases are very similar. The question of whether all the pages given by the builder to the domain should be distributed over the E820 RAM ranges, or should just be considered ballooned out of the holes (which is what currently happens) is pretty much orthogonal. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-15 19:32 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 11/15/2010 10:41 AM, Keir Fraser wrote:>> Oh, there was another reason that I so obvious that I completly forgot. DomU >> has no idea where the host PCI hole starts. In most cases it is at 3GB (or >> even >> further up - 3.5GB), but a quick look for ''Allocating PCI resources starting >> at'' >> at Google shows that there are some that start at 1.2G. > Hm, true. We could give you access to XENMEM_machine_memory_map? It''s not > really got any big secrets or privileged things in it. :-) > > Or, is there much disadvantage, to having a static really big PCI hole? Say > starting at 1GB? The advantage of this would be the ability to hotplug PCI > devices to a domU even across save/restore/migrate -- this may not work so > well if you commit yourself to the hole size of the original host, and the > restore/migrate target host has a bigger hole!Well, the other question is whether the devices have to have the same pfn as mfn within the hole. We''re emulating the PCI config space anyway - couldn''t we stick the passthrough PCI space at 3G regardless of where it is on the real hardware? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-15 19:57 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:>> Or, is there much disadvantage, to having a static really big PCI hole? Say >> starting at 1GB? The advantage of this would be the ability to hotplug PCI >> devices to a domU even across save/restore/migrate -- this may not work so >> well if you commit yourself to the hole size of the original host, and the >> restore/migrate target host has a bigger hole! > > Well, the other question is whether the devices have to have the same > pfn as mfn within the hole. We''re emulating the PCI config space anyway > - couldn''t we stick the passthrough PCI space at 3G regardless of where > it is on the real hardware?Well, I don''t know. It sounds pretty sensible to me. :-) Certain virtualisation feature sdisappearing after a save/restore/migrate -- or worsse, becoming unreliable -- would be a bit sad. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-15 23:11 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser wrote:> On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > > >> Or, is there much disadvantage, to having a static really big PCI hole? Say > >> starting at 1GB? The advantage of this would be the ability to hotplug PCI > >> devices to a domU even across save/restore/migrate -- this may not work so > >> well if you commit yourself to the hole size of the original host, and the > >> restore/migrate target host has a bigger hole! > > > > Well, the other question is whether the devices have to have the same > > pfn as mfn within the hole. We''re emulating the PCI config space anyway > > - couldn''t we stick the passthrough PCI space at 3G regardless of where > > it is on the real hardware?Your thinking is that on the Linux side, any of the pfns that are within those System RAM gaps (irregardless if they are above or below 4GB) would be consultated during PTE creation/lookup (xen_pte_val..). And if those PFNs are within those System RAM gaps, we would store the MFN in the P2M list and instead of doing: val = ((pteval_t)pfn << PAGE_SHIFT) | flags we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP flag. And example patch (compiled tested, not tested any other way) attached at the end of this email. How does that work on the Xen side? Does the hypervisor depend on the pages that belong to the DOM_IO domain to have a INVALID_MFN value in the mfn_list? We do make the PTE that refer to physical devices to be the DOM_IO domain..> > Well, I don''t know. It sounds pretty sensible to me. :-) > > Certain virtualisation feature sdisappearing after a save/restore/migrate -- > or worsse, becoming unreliable -- would be a bit sad.So having the option of the PCI hole being passed through, and giving the tools the value (pci_hole) would mean we could migrate an SR-IOV type device from one machine to another. Constructing the PCI hole using the XENMEM_machine_memory_map could generate different E820 for the two guests, which would be indeed a bit sad. --- the patch --- diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 50dc626..96a08ef 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr) static bool xen_iomap_pte(pte_t pte) { - return pte_flags(pte) & _PAGE_IOMAP; + return xen_pfn_is_pci(pte_mfn(pte)); } void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid) @@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags) void xen_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval) { - if (xen_iomap_pte(pteval)) { - xen_set_iomap_pte(ptep, pteval); - goto out; - } - ADD_STATS(set_pte_at, 1); // ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep)); ADD_STATS(set_pte_at_current, mm == current->mm); @@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val) return val; } -static pteval_t iomap_pte(pteval_t val) -{ - if (val & _PAGE_PRESENT) { - unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; - pteval_t flags = val & PTE_FLAGS_MASK; - - /* We assume the pte frame number is a MFN, so - just use it as-is. */ - val = ((pteval_t)pfn << PAGE_SHIFT) | flags; - } - - return val; -} pteval_t xen_pte_val(pte_t pte) { @@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte) pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT; } - if (xen_initial_domain() && (pteval & _PAGE_IOMAP)) - return pteval; + if (xen_pfn_is_pci(pte_mfn(pte))) + pteval |= _PAGE_IOMAP; return pte_mfn_to_pfn(pteval); } @@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte) * mappings are just dummy local mappings to keep other * parts of the kernel happy. */ - if (unlikely(pte & _PAGE_IOMAP) && - (xen_initial_domain() || addr >= ISA_END_ADDRESS)) { - pte = iomap_pte(pte); - } else { + if ((unlikely(pte & _PAGE_IOMAP) && + (xen_initial_domain() || addr >= ISA_END_ADDRESS)) || + (unlikely(xen_pfn_is_pci(PFN_UP(addr))))) + pte |= _PAGE_IOMAP; + else pte &= ~_PAGE_IOMAP; - pte = pte_pfn_to_mfn(pte); - } + + pte = pte_pfn_to_mfn(pte); return native_make_pte(pte); } @@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val) void xen_set_pte(pte_t *ptep, pte_t pte) { - if (xen_iomap_pte(pte)) { + if (xen_iomap_pte(pte)) xen_set_iomap_pte(ptep, pte); - return; - } ADD_STATS(pte_update, 1); // ADD_STATS(pte_update_pinned, xen_page_pinned(ptep)); @@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte) #ifdef CONFIG_X86_PAE void xen_set_pte_atomic(pte_t *ptep, pte_t pte) { - if (xen_iomap_pte(pte)) { + if (xen_iomap_pte(pte)) xen_set_iomap_pte(ptep, pte); - return; - } set_64bit((u64 *)ptep, native_pte_val(pte)); } diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 5a1f22d..bb424e3 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void) xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn, xen_start_info->nr_pages); return max_pfn; } + +int xen_pfn_is_pci(unsigned long pfn) +{ + static struct e820entry map[E820MAX] __initdata; + int rc, op, i; + struct xen_memory_map memmap; + unsigned long long addr = PFN_PHYS(pfn); + memmap.nr_entries = E820MAX; + set_xen_guest_handle(memmap.buffer, map); + + op = xen_initial_domain() ? + XENMEM_machine_memory_map : + XENMEM_memory_map; + rc = HYPERVISOR_memory_op(op, &memmap); + BUG_ON(rc); + + for (i = 0; i < memmap.nr_entries; i++) { + unsigned long long end = map[i].addr + map[i].size; + if (map[i].type != E820_RAM) + continue; + if (addr >= map[i].addr && addr <= end) + return 0; + } + return 1; +} /** * machine_specific_memory_setup - Hook for machine specific memory setup. **/ diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h index eee2045..f859b04 100644 --- a/include/xen/xen-ops.h +++ b/include/xen/xen-ops.h @@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma, extern phys_addr_t xen_extra_mem_start; unsigned long xen_find_max_pfn(void); +int xen_pfn_is_pci(unsigned long pfn); + #endif /* INCLUDE_XEN_OPS_H */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-16 01:06 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 11/15/2010 03:11 PM, Konrad Rzeszutek Wilk wrote:> On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser wrote: >> On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: >> >>>> Or, is there much disadvantage, to having a static really big PCI hole? Say >>>> starting at 1GB? The advantage of this would be the ability to hotplug PCI >>>> devices to a domU even across save/restore/migrate -- this may not work so >>>> well if you commit yourself to the hole size of the original host, and the >>>> restore/migrate target host has a bigger hole! >>> Well, the other question is whether the devices have to have the same >>> pfn as mfn within the hole. We''re emulating the PCI config space anyway >>> - couldn''t we stick the passthrough PCI space at 3G regardless of where >>> it is on the real hardware? > Your thinking is that on the Linux side, any of the pfns that are within > those System RAM gaps (irregardless if they are above or below 4GB) would > be consultated during PTE creation/lookup (xen_pte_val..). > > And if those PFNs are within those System RAM gaps, we would store the > MFN in the P2M list and instead of doing: > val = ((pteval_t)pfn << PAGE_SHIFT) | flags > > we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP flag. > > And example patch (compiled tested, not tested any other way) attached at the > end of this email.Right, it basically depends on dropping _PAGE_IOMAP and populating the p2m with the correct mapping for both memory and hardware pages.> How does that work on the Xen side? Does the hypervisor depend on the pages > that belong to the DOM_IO domain to have a INVALID_MFN value in the mfn_list?Xen wouldn''t care. I don''t think its necessary to explicitly do a cross-domain mapping with DOM_IO as we currently do; that''s overkill and/or a misunderstanding on my part.> We do make the PTE that refer to physical devices to be the DOM_IO domain..I think Xen will sort that out for itself when presented with a hardware/device mfn.>> Well, I don''t know. It sounds pretty sensible to me. :-) >> >> Certain virtualisation feature sdisappearing after a save/restore/migrate -- >> or worsse, becoming unreliable -- would be a bit sad. > So having the option of the PCI hole being passed through, and giving > the tools the value (pci_hole) would mean we could migrate an SR-IOV type > device from one machine to another. Constructing the PCI hole using the > XENMEM_machine_memory_map could generate different E820 for the two guests, which > would be indeed a bit sad. > > > --- the patch --- > > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c > index 50dc626..96a08ef 100644 > --- a/arch/x86/xen/mmu.c > +++ b/arch/x86/xen/mmu.c > @@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr) > > static bool xen_iomap_pte(pte_t pte) > { > - return pte_flags(pte) & _PAGE_IOMAP; > + return xen_pfn_is_pci(pte_mfn(pte)); > }I think populating the p2m appropriately in advance is better than this test; this is OK for prototyping I guess, but way to expensive for every set_pte. J> > void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid) > @@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags) > void xen_set_pte_at(struct mm_struct *mm, unsigned long addr, > pte_t *ptep, pte_t pteval) > { > - if (xen_iomap_pte(pteval)) { > - xen_set_iomap_pte(ptep, pteval); > - goto out; > - } > - > ADD_STATS(set_pte_at, 1); > // ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep)); > ADD_STATS(set_pte_at_current, mm == current->mm); > @@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val) > return val; > } > > -static pteval_t iomap_pte(pteval_t val) > -{ > - if (val & _PAGE_PRESENT) { > - unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; > - pteval_t flags = val & PTE_FLAGS_MASK; > - > - /* We assume the pte frame number is a MFN, so > - just use it as-is. */ > - val = ((pteval_t)pfn << PAGE_SHIFT) | flags; > - } > - > - return val; > -} > > pteval_t xen_pte_val(pte_t pte) > { > @@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte) > pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT; > } > > - if (xen_initial_domain() && (pteval & _PAGE_IOMAP)) > - return pteval; > + if (xen_pfn_is_pci(pte_mfn(pte))) > + pteval |= _PAGE_IOMAP; > > return pte_mfn_to_pfn(pteval); > } > @@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte) > * mappings are just dummy local mappings to keep other > * parts of the kernel happy. > */ > - if (unlikely(pte & _PAGE_IOMAP) && > - (xen_initial_domain() || addr >= ISA_END_ADDRESS)) { > - pte = iomap_pte(pte); > - } else { > + if ((unlikely(pte & _PAGE_IOMAP) && > + (xen_initial_domain() || addr >= ISA_END_ADDRESS)) || > + (unlikely(xen_pfn_is_pci(PFN_UP(addr))))) > + pte |= _PAGE_IOMAP; > + else > pte &= ~_PAGE_IOMAP; > - pte = pte_pfn_to_mfn(pte); > - } > + > + pte = pte_pfn_to_mfn(pte); > > return native_make_pte(pte); > } > @@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val) > > void xen_set_pte(pte_t *ptep, pte_t pte) > { > - if (xen_iomap_pte(pte)) { > + if (xen_iomap_pte(pte)) > xen_set_iomap_pte(ptep, pte); > - return; > - } > > ADD_STATS(pte_update, 1); > // ADD_STATS(pte_update_pinned, xen_page_pinned(ptep)); > @@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte) > #ifdef CONFIG_X86_PAE > void xen_set_pte_atomic(pte_t *ptep, pte_t pte) > { > - if (xen_iomap_pte(pte)) { > + if (xen_iomap_pte(pte)) > xen_set_iomap_pte(ptep, pte); > - return; > - } > > set_64bit((u64 *)ptep, native_pte_val(pte)); > } > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > index 5a1f22d..bb424e3 100644 > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void) > xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn, xen_start_info->nr_pages); > return max_pfn; > } > + > +int xen_pfn_is_pci(unsigned long pfn) > +{ > + static struct e820entry map[E820MAX] __initdata; > + int rc, op, i; > + struct xen_memory_map memmap; > + unsigned long long addr = PFN_PHYS(pfn); > + memmap.nr_entries = E820MAX; > + set_xen_guest_handle(memmap.buffer, map); > + > + op = xen_initial_domain() ? > + XENMEM_machine_memory_map : > + XENMEM_memory_map; > + rc = HYPERVISOR_memory_op(op, &memmap); > + BUG_ON(rc); > + > + for (i = 0; i < memmap.nr_entries; i++) { > + unsigned long long end = map[i].addr + map[i].size; > + if (map[i].type != E820_RAM) > + continue; > + if (addr >= map[i].addr && addr <= end) > + return 0; > + } > + return 1; > +} > /** > * machine_specific_memory_setup - Hook for machine specific memory setup. > **/ > diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h > index eee2045..f859b04 100644 > --- a/include/xen/xen-ops.h > +++ b/include/xen/xen-ops.h > @@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma, > extern phys_addr_t xen_extra_mem_start; > unsigned long xen_find_max_pfn(void); > > +int xen_pfn_is_pci(unsigned long pfn); > + > #endif /* INCLUDE_XEN_OPS_H */ >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 07:40 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 15/11/2010 23:11, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>> Well, I don''t know. It sounds pretty sensible to me. :-) >> >> Certain virtualisation feature sdisappearing after a save/restore/migrate -- >> or worsse, becoming unreliable -- would be a bit sad. > > So having the option of the PCI hole being passed through, and giving > the tools the value (pci_hole) would mean we could migrate an SR-IOV type > device from one machine to another. Constructing the PCI hole using the > XENMEM_machine_memory_map could generate different E820 for the two guests, > which > would be indeed a bit sad.SR-IOV is a nice example, but even assuming the user does not have pass-thru devices attached during migrate, or save/restore, it would still be nice if they could have the pci-hotplug-passthru facility available to them fully working both before and after such an event. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-16 09:26 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Tue, 2010-11-16 at 01:06 +0000, Jeremy Fitzhardinge wrote:> > > How does that work on the Xen side? Does the hypervisor depend on > the pages > > that belong to the DOM_IO domain to have a INVALID_MFN value in the > mfn_list? > > Xen wouldn''t care. I don''t think its necessary to explicitly do a > cross-domain mapping with DOM_IO as we currently do; that''s overkill > and/or a misunderstanding on my part. > > > We do make the PTE that refer to physical devices to be the DOM_IO > domain.. > > I think Xen will sort that out for itself when presented with a > hardware/device mfn.My main concern would be with save/restore code will canonicalise all the MFNs in the page tables back into PFNs and then convert back to MFNs on the other side, which is likely to go pretty wrong on one end of the other unless the save restore code is aware of which MFNs are device MFNs and which are actual memory. I''m not sure there is any way it can tell. ** scrobbles around in xc_domain_save.c ** Hrmm... The MFN_IS_IN_PSEUDOPHYS_MAP macro might have some impact on this issue in some way (depending on what the m2p contains for DOM_IO owned pages) but I don''t think it actually fixes anything. I don''t see anything else which would make this work... Best case as it stands AFAICT is that MFN_IS_IN_PSEUDOPHYS_MAP causes device mappings to get zapped requiring the kernel to reinstate them on restore. Which isn''t so bad I guess. On an unrelated note I think if we do go down the route of having the guest kernel punch the holes itself and such we should do so iff XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open the possibility of cunning tricks on the tools side in the future. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 09:52 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 16/11/2010 09:26, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:>>> We do make the PTE that refer to physical devices to be the DOM_IO >> domain.. >> >> I think Xen will sort that out for itself when presented with a >> hardware/device mfn. > > My main concern would be with save/restore code will canonicalise all > the MFNs in the page tables back into PFNs and then convert back to MFNs > on the other side, which is likely to go pretty wrong on one end of the > other unless the save restore code is aware of which MFNs are device > MFNs and which are actual memory. I''m not sure there is any way it can > tell.The right answer is probably to refuse save/restore/migrate when devices are passed through. It''s somewhere between very hard and very nuts to attempt that in general. For example, even with SR-IOV, we''ve only been talking about it so far for NICs, and then in terms of having a Solarflare-like acceleration abstraction allowing us to step off of SR-IOV for at least the duration of the critical bit of the save/restore. A sensible first goal would simply be to be able to do PCI passthrough both before and after a s/r/m across reasonaly heterogenous hardware, but not attempt to be able to maintain such a device passthru *during* the s/r/m. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-16 10:02 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Tue, 2010-11-16 at 09:52 +0000, Keir Fraser wrote:> On 16/11/2010 09:26, "Ian Campbell" <Ian.Campbell@citrix.com> wrote: > > >>> We do make the PTE that refer to physical devices to be the DOM_IO > >> domain.. > >> > >> I think Xen will sort that out for itself when presented with a > >> hardware/device mfn. > > > > My main concern would be with save/restore code will canonicalise all > > the MFNs in the page tables back into PFNs and then convert back to MFNs > > on the other side, which is likely to go pretty wrong on one end of the > > other unless the save restore code is aware of which MFNs are device > > MFNs and which are actual memory. I''m not sure there is any way it can > > tell. > > The right answer is probably to refuse save/restore/migrate when devices are > passed through.Absolutely. However we are talking about setting up a 1-1 mapping in the P2M region corresponding to the PCI hole at guest boot and preserving that until such a time as a device is plugged in, which may be after a migration. I don''t think it matters that no device is passed through at the time of the migration, in this configuration we still need arrange for the relevant P2M entries to be correct after the migration (or at least before the device gets plugged in, perhaps we can leave holes and only establish the 1-1 p2m on demand in pcifront?). So long as this configuration doesn''t cause the save/restore code to go mad it''s something we can likely fixup in the guest on restore. My worry is that the save/restore code will just barf before we get that opportunity... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 10:11 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 16/11/2010 10:02, "Ian Campbell" <Ian.Campbell@eu.citrix.com> wrote:>> The right answer is probably to refuse save/restore/migrate when devices are >> passed through. > > Absolutely. > > However we are talking about setting up a 1-1 mapping in the P2M region > corresponding to the PCI hole at guest boot and preserving that until > such a time as a device is plugged in, which may be after a migration. > > I don''t think it matters that no device is passed through at the time of > the migration, in this configuration we still need arrange for the > relevant P2M entries to be correct after the migration (or at least > before the device gets plugged in, perhaps we can leave holes and only > establish the 1-1 p2m on demand in pcifront?).Leave the hole empty and populate on demand when devices are passed through would seem sensible. -- Keir> So long as this configuration doesn''t cause the save/restore code to go > mad it''s something we can likely fixup in the guest on restore. My worry > is that the save/restore code will just barf before we get that > opportunity..._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-16 15:50 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
disclaimer: This email got a bit lengthy - so make sure you got a cup of coffee when you read this.> On an unrelated note I think if we do go down the route of having the > guest kernel punch the holes itself and such we should do so iff > XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave openWhen would that actually happen? Is that return value returned when the hypervisor is not implementing it (what version was that implemented this)?> the possibility of cunning tricks on the tools side in the future.<shuders> I think we have three options in regards to this RFC patch I posted: 1). Continue with this and have the toolstack punch the PCI hole. It would fill the PCI hole area with INVALID_MFN. The toolstack determines where the PCI hole starts. 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map and XENMEM_memory_map to get an idea of the host side PCI hole and set it up. Requires changes in hypervisor to allow non-privileged PV guest to make XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts and the PCI hole is filled with INVALID_MFN. 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with INVALID_MFN. 4). Another one I didn''t think of? For all of those cases when devices show up we populate on demand the P2M array with the MFNs). For the first two proposals the BARs we read of the PCI devices are going to be written to the P2M array as identity (so mfn_list[0xc0000] == 0xc0000). Code has not been written. For the third proposal, we would have non-identity mappings in the P2M array, as during the migration we could move from a device with BARs of 0xc0000 to 0x20000. So mfn_list[0xc0000] = 0x20000. But for the third case I am unsure how we would get the "real" MFNs. We initially get the BARs via 0xcf8 calls and if we don''t filter them, it gets to ioremap function. Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The ioremap gets called with 0x20000, and in its E820 that region is ''System RAM''. last_pfn = last_addr >> PAGE_SHIFT; for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) { int is_ram = page_is_ram(pfn); if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn))) return NULL; WARN_ON_ONCE(is_ram); } Ugh, and it will think (correctly) that it falls within RAM. If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can then provide BARs that always start at 0xC0000. But that does not help the PV guest to know the "real" MFNs which it needs so it can program the P2M array. So the Xen PCI front would have to do this - which it could, thought it adds a complexity to it. We also need to make all of this works with Domain zero, and here 1) or 2) can easily be used as the Xen hypervisor has given us the E820 nicely peppered with holes. (I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get anything?) If we then go with 3), we would need to instrument the code that reads the BARs so that it can filter it properly. That would be low-level Linux pci_conf_read and that is not going happen - so we would have to make the Xen hypervisor be aware of this and when it traps the in/out provide new BAR values starting at 0xC0000. I am not comfortable maintaining this filter/keep state code in both the Xen hypervisor and the Xen PCI front module so I think 3) would not work that well, unless there are better ways that I have missed? Back to 1) and 2). Migration would work if we unplug the PCI devices before suspend and on resume plug them back in - otherwise the PCI BARs might have changed between migrations. When the guest gets recreated - how does it iterate over the E820 to create the P2M list? Or is that something that is not done and we just save the P2M list and restore as-is on the other side? Naturally, since we would unplug the PCI device the entries in the E820 gaps would be INVALID_MFN... If we consult the E820 during resume I think doing the PCI hole in the toolstack has merits - simply b/c the user can set the PCI hole to an arbitrary address that is low enough (0x2000 say) to cover all of the machines that he/she would migrate too. While if we do it in the Linux kernel we do not have that information. Even if we don''t consult the E820, the toolstack still has merits - as the PCI hole start address might be different between the migration machines and we might have started on a box with the PCI hole being way up (3.9GB) while the other machines might have at 1.2GB. The other thing I don''t know is how all of this works with 32-bit kernels? P.S. I''ve done the testing of 1) with 64-bit w/ and w/o ballooning and it worked fine. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-16 18:01 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On 11/16/2010 02:11 AM, Keir Fraser wrote:> On 16/11/2010 10:02, "Ian Campbell" <Ian.Campbell@eu.citrix.com> wrote: > >>> The right answer is probably to refuse save/restore/migrate when devices are >>> passed through. >> Absolutely. >> >> However we are talking about setting up a 1-1 mapping in the P2M region >> corresponding to the PCI hole at guest boot and preserving that until >> such a time as a device is plugged in, which may be after a migration. >> >> I don''t think it matters that no device is passed through at the time of >> the migration, in this configuration we still need arrange for the >> relevant P2M entries to be correct after the migration (or at least >> before the device gets plugged in, perhaps we can leave holes and only >> establish the 1-1 p2m on demand in pcifront?). > Leave the hole empty and populate on demand when devices are passed through > would seem sensible.Actually I was originally thinking that the hole would all be INVALID_MFN but then pfn_to_mfn() would translate that to being an identity translation. But that''s pretty hacky, and only works if you actually want identity. On-demand population of regions is much cleaner. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gianni Tedesco
2010-Nov-17 11:14 UTC
[Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:> Hey guys, > > Attached is an RFC patch for making a PCI hole in the PV guests. This allows > PV guests(*) with 4GB or more to now properly work with or without > PCI passthrough cards. > > Previously the Linux kernel would not be able to allocate the PCI region > underneath the 4GB region as that region was all System RAM. And you would see > this: > > [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 > [ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range > [ 0.000000] PCI: Unassigned devices with 32bit resource registers may break! > [ 0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000) > > > This patchset punches an PCI hole in the E820 region and as well fills the P2M properly, > so that now you can see (*): > [ 0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000) > > It adds a new option to guest config file, which is "pci_hole". The user can > specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' which > will automatically figure out the start of the PCI address. > > *: This option requires support in the Linux kernel to actually deal with two > entries in the E820 map and P2M space filled with ~0.Doesn''t this problem affect dom0 also? How is it to be fixed there? Gianni _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-17 11:43 UTC
Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote:> On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote: > > Hey guys, > > > > Attached is an RFC patch for making a PCI hole in the PV guests. This allows > > PV guests(*) with 4GB or more to now properly work with or without > > PCI passthrough cards. > > > > Previously the Linux kernel would not be able to allocate the PCI region > > underneath the 4GB region as that region was all System RAM. And you would see > > this: > > > > [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 > > [ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range > > [ 0.000000] PCI: Unassigned devices with 32bit resource registers may break! > > [ 0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000) > > > > > > This patchset punches an PCI hole in the E820 region and as well fills the P2M properly, > > so that now you can see (*): > > [ 0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000) > > > > It adds a new option to guest config file, which is "pci_hole". The user can > > specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' which > > will automatically figure out the start of the PCI address. > > > > *: This option requires support in the Linux kernel to actually deal with two > > entries in the E820 map and P2M space filled with ~0. > > Doesn''t this problem affect dom0 also? How is it to be fixed there?Domain 0 sees the actual host e820 map instead of the paravirtualised one constructed by the tools for domU. So the BIOS effectively takes the place of the tools side of this patch for dom0. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gianni Tedesco
2010-Nov-17 13:37 UTC
Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Wed, 2010-11-17 at 11:43 +0000, Ian Campbell wrote:> On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote: > > On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote: > > > Hey guys, > > > > > > Attached is an RFC patch for making a PCI hole in the PV guests. This allows > > > PV guests(*) with 4GB or more to now properly work with or without > > > PCI passthrough cards. > > > > > > Previously the Linux kernel would not be able to allocate the PCI region > > > underneath the 4GB region as that region was all System RAM. And you would see > > > this: > > > > > > [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 > > > [ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range > > > [ 0.000000] PCI: Unassigned devices with 32bit resource registers may break! > > > [ 0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000) > > > > > > > > > This patchset punches an PCI hole in the E820 region and as well fills the P2M properly, > > > so that now you can see (*): > > > [ 0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000) > > > > > > It adds a new option to guest config file, which is "pci_hole". The user can > > > specify the PFN number, such as ''0xc0000'' or in case of using the xl, ''1'' which > > > will automatically figure out the start of the PCI address. > > > > > > *: This option requires support in the Linux kernel to actually deal with two > > > entries in the E820 map and P2M space filled with ~0. > > > > Doesn''t this problem affect dom0 also? How is it to be fixed there? > > Domain 0 sees the actual host e820 map instead of the paravirtualised > one constructed by the tools for domU. So the BIOS effectively takes the > place of the tools side of this patch for dom0.Hmm, of course, I did think xen "sanitised" it somehow though. I suppose I just assumed (hoped) this may fix my dom0 boot crash issues. Gianni _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-17 14:23 UTC
Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
On Tue, 2010-11-16 at 15:50 +0000, Konrad Rzeszutek Wilk wrote:> disclaimer: > This email got a bit lengthy - so make sure you got a cup of coffee when you read this. > > > On an unrelated note I think if we do go down the route of having the > > guest kernel punch the holes itself and such we should do so iff > > XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open > > When would that actually happen? Is that return value returned when the > hypervisor is not implementing it (what version was that implemented this)?-ENOSYS implies an older Xen which does not have this interface. Currently this causes the guest to create a fake one-entry e820 covering 0..nr_pages, which is what old guests which don''t know about the hypercall do too. If the hypercall returns an e820 with nr_entries == 1 then this implies a newer Xen which implements the interface but where the tools have only poked down a simple one-entry e820 covering 0..nr_pages or possibly 0..max_pages (this is all any existing hypervisor/tools will do). If the hypervisor returns nr_entries >= 2 then you have some future Xen which has tools which (think they) know what they are doing and so we should trust the e820 given to us. Without allowing for this now we will end up with XENFEAT_tools_provide_a_useful_guest_e820 which would be a shame! Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel