thr3ads.net - Xen devel - [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Konrad Rzeszutek Wilk

2010-Nov-12 23:08 UTC

[Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Hey guys,

Attached is an RFC patch for making a PCI hole in the PV guests. This allows
PV guests(*) with 4GB or more to now properly work with or without
PCI passthrough cards.

Previously the Linux kernel would not be able to allocate the PCI region
underneath the 4GB region as that region was all System RAM. And you would see
this:

[    0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
[    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
[    0.000000] PCI: Unassigned devices with 32bit resource registers may break!
[    0.000000] Allocating PCI resources starting at 100100000 (gap:
100100000:400000)


This patchset punches an PCI hole in the E820 region and as well fills the P2M
properly,
so that now you can see (*):
[    0.000000] Allocating PCI resources starting at a0000000 (gap:
a0000000:60000000)

It adds a new option to guest config file, which is "pci_hole". The
user can
specify the PFN number, such as ''0xc0000'' or in case of using
the xl, ''1'' which
will automatically figure out the start of the PCI address.

*: This option requires support in the Linux kernel to actually deal with two 
entries in the E820 map and P2M space filled with ~0. 


The patches (draft, not ready for upstream) for the Linux kernel to support this
are
available at:

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole

All of these patches make the E820 of the Linux guest with 4GB (or more) passed
look like this (2.6.37-rc1+devel/e820-hole):
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 00000000a0000000 (usable)
[    0.000000]  Xen: 0000000100000000 - 0000000160800000 (usable)

compared to (2.6.36)
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000100000000 (usable)

and (2.6.37-rc1):
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000100800000 (usable)

In regards to the patches that I am attaching here, what is the magic
incantention
to make the indentation/StyleGuide proper for the tools/libxc directory? The tab
spacing
is off a bit (I think).

I''ve tested this so far only on 64-bit guests, and I am quite sure the
tool-stack
needs some extra care for the 32-bit guests..

But please take a look and give feedback.

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -91,6 +91,8 @@ struct xc_dom_image {
 
     /* physical memory */
     xen_pfn_t total_pages;
+    /* start of the pci_hole. goes up to 4gb */
+    xen_pfn_t pci_hole;
     struct xc_dom_phys *phys_pages;
     int realmodearea_log;
 
diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c
--- a/tools/libxc/xc_dom_core.c
+++ b/tools/libxc/xc_dom_core.c
@@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do
     uint32_t *p2m_32;
     uint64_t *p2m_64;
     xen_pfn_t i;
+    size_t tot_pages;
 
     if ( !dom->p2m_guest )
         return 0;
 
+    tot_pages = dom->total_pages;
+    if (dom->pci_hole)
+         tot_pages += (0x100000 - dom->pci_hole);
+
     switch ( dom->arch_hooks->sizeof_pfn )
     {
     case 4:
         DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn "",
-                  __FUNCTION__, dom->total_pages);
+                  __FUNCTION__, tot_pages);
         p2m_32 = dom->p2m_guest;
-        for ( i = 0; i < dom->total_pages; i++ )
+        for ( i = 0; i < tot_pages; i++ )
             if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
                 p2m_32[i] = dom->p2m_host[i];
             else
@@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do
         break;
     case 8:
         DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn "",
-                  __FUNCTION__, dom->total_pages);
+                  __FUNCTION__, tot_pages);
         p2m_64 = dom->p2m_guest;
-        for ( i = 0; i < dom->total_pages; i++ )
+        for ( i = 0; i < tot_pages; i++ )
             if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
                 p2m_64[i] = dom->p2m_host[i];
             else
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d
 {
     size_t p2m_size = dom->total_pages * dom->arch_hooks->sizeof_pfn;
 
+    if (dom->pci_hole && (dom->total_pages >
dom->pci_hole))
+    {
+	size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) *
+				  dom->arch_hooks->sizeof_pfn;
+
+        DOMPRINTF("%s: Expanding P2M to include PCI hole
(%ld->%ld)\n",
+		__FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size);
+	p2m_size += p2m_pci_hole_size;
+    }
     /* allocate phys2mach table */
     if ( xc_dom_alloc_segment(dom, &dom->p2m_seg, "phys2mach",
0, p2m_size) )
         return -1;
@@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima
 {
     int rc;
     xen_pfn_t pfn, allocsz, i, j, mfn;
+    size_t p2m_size;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima
         if ( rc )
             return rc;
     }
+    p2m_size = dom->total_pages;
 
-    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) *
dom->total_pages);
+    if (dom->pci_hole && (dom->total_pages >
dom->pci_hole))
+		p2m_size += (0x100000 - dom->pci_hole);
+
+    DOMPRINTF("Allocating %ld bytes for P2M", p2m_size *
sizeof(xen_pfn_t));
+    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size);
     if ( dom->superpages )
     {
         int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT;
@@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima
     }
     else
     {
-        /* setup initial p2m */
-        for ( pfn = 0; pfn < dom->total_pages; pfn++ )
-            dom->p2m_host[pfn] = pfn;
-        
-        /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+	/* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */
+        if ( dom->pci_hole && (dom->total_pages >
dom->pci_hole) )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            /* setup initial p2m in three passes. */
+            for (pfn = 0; pfn < dom->pci_hole; pfn++)
+              dom->p2m_host[pfn] = pfn;
+
+            xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has
PFNs.", __FUNCTION__, pfn);
+            xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has
INVALID_MFN",
+                          __FUNCTION__, pfn, 0x100000);
+            for (; pfn < 0x100000; pfn++)
+              dom->p2m_host[pfn] = INVALID_MFN;
+
+            for (; pfn < 0x100000 + dom->total_pages - dom->pci_hole;
pfn++)
+              dom->p2m_host[pfn] = pfn;
+            xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has
PFNs.", __FUNCTION__,
+                           0x100000, pfn);
+
+            /* allocate guest memory in two passes. */
+            for (i = rc = allocsz = 0; (i < dom->pci_hole) &&
!rc; i += allocsz)
+            {
+              allocsz = dom->pci_hole - i;
+              xc_dom_printf (dom->xch, "%s: Populating M2P
0x%lx->0x%lx",
+                             __FUNCTION__, i, i + allocsz);
+              rc = xc_domain_populate_physmap_exact (dom->xch,
dom->guest_domid,
+                                 allocsz, 0, 0,
+                                 &dom->p2m_host[i]);
+            }
+            for (i = 0x100000, allocsz = rc = 0;
+                 (i < (0x100000 + dom->total_pages - dom->pci_hole))
+                  && !rc; i += allocsz)
+            {
+              allocsz = (dom->total_pages - dom->pci_hole) - (i -
0x100000);
+              if (allocsz > 1024 * 1024)
+                allocsz = 1024 * 1024;
+              xc_dom_printf (dom->xch, "%s: Populating M2P
0x%lx->0x%lx",
+                             __FUNCTION__, i, i + allocsz);
+              rc = xc_domain_populate_physmap_exact (dom->xch,
dom->guest_domid,
+                                                      allocsz, 0, 0,
+                                                     
&dom->p2m_host[i]);
+            }
+            xc_dom_printf (dom->xch, "%s: Done with PCI populate
physmap",
+                          __FUNCTION__);
+        } else {
+                /* setup initial p2m */
+                for ( pfn = 0; pfn < dom->total_pages; pfn++ )
+                    dom->p2m_host[pfn] = pfn;
+                
+                /* allocate guest memory */
+                for ( i = rc = allocsz = 0;
+                      (i < dom->total_pages) && !rc;
+                      i += allocsz )
+                {
+                    allocsz = dom->total_pages - i;
+                    if ( allocsz > 1024*1024 )
+                        allocsz = 1024*1024;
+                    rc = xc_domain_populate_physmap_exact(
+                        dom->xch, dom->guest_domid, allocsz,
+                        0, 0, &dom->p2m_host[i]);
+                }
         }
     }
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in
 #include "xc_e820.h"
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb)
+                               unsigned long map_limitkb,
+                               xen_pfn_t pci_hole_start)
 {
     int rc;
+    uint64_t delta_kb;
+    size_t e820_sz;
     struct xen_foreign_memory_map fmap = {
         .domid = domid,
         .map = { .nr_entries = 1 }
     };
     DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
 
-    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
+    delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2);
+    if (pci_hole_start && (delta_kb > 0))
+      e820_sz = sizeof(*e820);
+    else
+      e820_sz = sizeof(*e820)*2;
+
+    e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz);
 
     if ( e820 == NULL )
     {
@@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf
     e820->size = (uint64_t)map_limitkb << 10;
     e820->type = E820_RAM;
 
+    if (pci_hole_start && (delta_kb > 0))
+    {
+	fmap.map.nr_entries ++;
+	e820[0].size = (uint64_t)pci_hole_start << 12;
+ 
+	e820[1].type = E820_RAM;
+	e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn...  */
+	e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB. */
+    }
+
     set_xen_guest_handle(fmap.map.buffer, e820);
 
     rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
@@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf
 #else
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb)
+                               unsigned long map_limitkb,
+                               xen_pfn_t pci_hole_start)
 {
     PERROR("Function not implemented");
     errno = ENOSYS;
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc
 
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb);
+                               unsigned long map_limitkb,
+				xen_pfn_t pci_hole_start);
 
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx 
                               libxl_device_disk *disk, libxl_diskinfo
*diskinfo);
 int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk
*disk);
 
+int libxl_find_pci_hole(uint32_t *start_pfn);
 /*
  * Make a disk available in this domain. Returns path to a device.
  */
diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl
--- a/tools/libxl/libxl.idl
+++ b/tools/libxl/libxl.idl
@@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain
                                         ])),
                  ("pv", "!%s", Struct(None,
                                        [("slack_memkb", uint32),
+                                        ("pci_hole_start", uint32),
                                         ("bootloader", string),
                                         ("bootloader_args", string),
                                         ("cmdline", string),
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
     xc_domain_set_memmap_limit(ctx->xch, domid, 
             (info->hvm) ? info->max_memkb : 
-            (info->max_memkb + info->u.pv.slack_memkb));
+            (info->max_memkb + info->u.pv.slack_memkb),
+            (info->hvm) ? 0 : info->u.pv.pci_hole_start);
     xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0);
     if ( info->disable_migrate )
         xc_domain_disable_migrate(ctx->xch, domid);
@@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint
             }
         }
     }
+    if ( info->u.pv.pci_hole_start)
+        dom->pci_hole = info->u.pv.pci_hole_start;
 
     dom->flags = flags;
     dom->console_evtchn = state->console_port;
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx 
     free(pcidevs);
     return 0;
 }
+
+#define MAX_LINE 300
+int libxl_find_pci_hole(uint32_t *start_pfn)
+{
+	FILE *fp;
+	char *s;
+	char buf[MAX_LINE];
+	int ret = -ENODEV;
+	long int pci_hole_phys;
+
+	*start_pfn = 0;
+	fp = fopen("/proc/iomem", "r");
+	if (!fp)
+		return ret;
+
+	while (1) {
+		s = fgets(buf, MAX_LINE, fp);
+		if (!s)
+			break;
+		if (strlen(buf) < 1)
+			continue;
+		if (buf[strlen(buf)-1] == ''\n'')
+			buf[strlen(buf)-1] = ''\0'';
+		s = strchr(buf,''P'');
+		if (!s)
+			continue;
+		if (strncmp(s, "PCI", 3) == 0) {
+			if (buf[0] == '' '')
+				continue;
+			s = strchr(buf,''-'');
+			if (!s)
+				break;
+			s[0]=''\0'';
+			pci_hole_phys = strtol(buf, NULL, 16);
+			if (!pci_hole_phys)
+				break;
+			/* We don''t want to the holes below 16MB. */
+			if (pci_hole_phys <= 0x1000)
+				continue;
+			*start_pfn = pci_hole_phys >> 12;
+			fprintf(stderr,"The value is 0x%d\n", *start_pfn);
+			ret = 0;
+			break;
+		}
+	}
+	fclose(fp);
+	return ret;
+}
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1078,6 +1078,14 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
         pci_power_mgmt = l;
 
+    if (!xlu_cfg_get_long (config, "pci_hole", &l)) {
+	if (l == 1) {
+	   uint32_t pfn_start = 0;
+	   if (!libxl_find_pci_hole(&pfn_start))
+            	b_info->u.pv.pci_hole_start = pfn_start;
+	} else
+           b_info->u.pv.pci_hole_start = l;
+    }
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         int i;
         d_config->num_pcidevs = 0;
diff --git a/tools/python/xen/lowlevel/xc/xc.c
b/tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje
     unsigned int mem_mb;
     unsigned long store_mfn = 0;
     unsigned long console_mfn = 0;
+    unsigned long pci_hole_start = 0;
     PyObject* elfnote_dict;
     PyObject* elfnote = NULL;
     PyObject* ret;
@@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje
                                 "console_evtchn", "image",
                                 /* optional */
                                 "ramdisk", "cmdline",
"flags",
-                                "features", "vhpt",
"superpages", NULL };
-
-    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisii",
kwd_list,
+                                "features", "vhpt",
"superpages",
+                                "pci_hole", NULL };
+
+    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisiii",
kwd_list,
                                       &domid, &store_evtchn,
&mem_mb,
                                       &console_evtchn, &image,
                                       /* optional */
                                       &ramdisk, &cmdline, &flags,
-                                      &features, &vhpt,
&superpages) )
+                                      &features, &vhpt,
&superpages,
+                                      &pci_hole_start) )
         return NULL;
 
     xc_dom_loginit(self->xc_handle);
@@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje
 
     dom->superpages = superpages;
 
+    dom->pci_hole = pci_hole_start;
+
     if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image,
                             ramdisk, flags, store_evtchn, &store_mfn,
                             console_evtchn, &console_mfn) != 0 ) {
@@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_
 {
     uint32_t dom;
     unsigned int maplimit_kb;
-
-    if ( !PyArg_ParseTuple(args, "ii", &dom, &maplimit_kb) )
+    unsigned long pci_hole_start = 0;
+
+    if ( !PyArg_ParseTuple(args, "ii|i", &dom, &maplimit_kb,
&pci_hole_start) )
         return NULL;
 
-    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb) != 0
)
+    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb,
+                                    pci_hole_start) != 0 )
         return pyxc_error_to_exception(self->xc_handle);
     
     Py_INCREF(zero);
@@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = {
       "Set a domain''s physical memory mappping limit\n"
       " dom [int]: Identifier of domain.\n"
       " map_limitkb [int]: .\n"
+      " pci_hole_start [int]: PFN for start of PCI hole
(optional).\n"
       "Returns: [int] 0 on success; -1 on error.\n" },
 
 #ifdef __ia64__
diff --git a/tools/python/xen/xend/XendConfig.py
b/tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py
+++ b/tools/python/xen/xend/XendConfig.py
@@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = {
     ''suppress_spurious_page_faults'': bool0,
     ''s3_integrity'' : int,
     ''superpages'' : int,
+    ''pci_hole'' : int,
     ''memory_sharing'': int,
     ''pool_name'' : str,
     ''Description'': str,
@@ -422,6 +423,7 @@ class XendConfig(dict):
             ''target'': 0,
             ''pool_name'' : ''Pool-0'',
             ''superpages'': 0,
+            ''pci_hole'': 0,
             ''description'': '''',
         }
         
@@ -2135,6 +2137,9 @@ class XendConfig(dict):
             image.append([''args'',
self[''PV_args'']])
         if self.has_key(''superpages''):
             image.append([''superpages'',
self[''superpages'']])
+        if self.has_key(''pci_hole''):
+            image.append([''pci_hole'',
self[''pci_hole'']])
+
 
         for key in XENAPI_PLATFORM_CFG_TYPES.keys():
             if key in self[''platform'']:
@@ -2179,6 +2184,10 @@ class XendConfig(dict):
         val = sxp.child_value(image_sxp, ''superpages'')
         if val is not None:
             self[''superpages''] = val
+
+        val = sxp.child_value(image_sxp, ''pci_hole'')
+        if val is not None:
+            self[''pci_hole''] = val
         
         val = sxp.child_value(image_sxp, ''memory_sharing'')
         if val is not None:
diff --git a/tools/python/xen/xend/image.py b/tools/python/xen/xend/image.py
--- a/tools/python/xen/xend/image.py
+++ b/tools/python/xen/xend/image.py
@@ -84,6 +84,7 @@ class ImageHandler:
 
     ostype = None
     superpages = 0
+    pci_hole = 0
     memory_sharing = 0
 
     def __init__(self, vm, vmConfig):
@@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler):
         self.vramsize =
int(vmConfig[''platform''].get(''videoram'',4))
* 1024
         self.is_stubdom = (self.kernel.find(''stubdom'') >=
0)
         self.superpages = int(vmConfig[''superpages''])
+        self.pci_hole = int(vmConfig[''pci_hole''])
 
     def buildDomain(self):
         store_evtchn = self.vm.getStorePort()
@@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler):
         log.debug("features       = %s", self.vm.getFeatures())
         log.debug("flags          = %d", self.flags)
         log.debug("superpages     = %d", self.superpages)
+        log.debug("pci_hole       = %d", self.pci_hole)
         if arch.type == "ia64":
             log.debug("vhpt          = %d", self.vhpt)
 
@@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler):
                               features       = self.vm.getFeatures(),
                               flags          = self.flags,
                               vhpt           = self.vhpt,
-                              superpages     = self.superpages)
+                              superpages     = self.superpages,
+                              pci_hole       = self.pci_hole)
 
     def getBitSize(self):
         return xc.getBitSize(image    = self.kernel,
@@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler):
         args = args + ([ "-M", "xenpv"])
         return args
 
-
 class HVMImageHandler(ImageHandler):
 
     ostype = "hvm"
@@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH
         # set physical mapping limit
         # add an 8MB slack to balance backend allocations.
         mem_kb = self.getRequiredMaximumReservation() + (8 * 1024)
-        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb)
+        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb, self.pci_hole)
         rc = LinuxImageHandler.buildDomain(self)
         self.setCpuid()
         return rc
diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py
--- a/tools/python/xen/xm/create.py
+++ b/tools/python/xen/xm/create.py
@@ -680,6 +680,11 @@ gopts.var(''superpages'',
val=''0|1'',
            fn=set_int, default=0,
            use="Create domain with superpages")
 
+gopts.var(''pci_hole'',
val=''0x<XXX>|0'',
+           fn=set_int, default=0,
+           use="""Create domain with a PCI hole. The value is
the PFN of the
+           start of PCI hole. Usually that is 0xc0000.""")
+
 def err(msg):
     """Print an error to stderr and exit.
     """
@@ -770,6 +775,9 @@ def configure_image(vals):
         config_image.append([''args'', vals.extra])
     if vals.superpages:
         config_image.append([''superpages'', vals.superpages])
+    if vals.pci_hole:
+        config_image.append([''pci_hole'', vals.pci_hole])
+
 
     if vals.builder == ''hvm'':
         configure_hvm(config_image, vals) 
diff --git a/tools/python/xen/xm/xenapi_create.py
b/tools/python/xen/xm/xenapi_create.py
--- a/tools/python/xen/xm/xenapi_create.py
+++ b/tools/python/xen/xm/xenapi_create.py
@@ -285,6 +285,8 @@ class xenapi_create:
                 vm.attributes["s3_integrity"].value,
             "superpages":
                 vm.attributes["superpages"].value,
+            "pci_hole":
+                vm.attributes["pci_hole"].value,
             "memory_static_max":
                 get_child_node_attribute(vm, "memory",
"static_max"),
             "memory_static_min":
@@ -697,6 +699,8 @@ class sxp2xml:
             = str(get_child_by_name(config, "s3_integrity", 0))
         vm.attributes["superpages"] \
             = str(get_child_by_name(config, "superpages", 0))
+        vm.attributes["pci_hole"] \
+            = str(get_child_by_name(config, "pci_hole", 0))
         vm.attributes["pool_name"] \
             = str(get_child_by_name(config, "pool_name",
"Pool-0"))
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-12 23:16 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

> It adds a new option to guest config file, which is "pci_hole".
The user can
> specify the PFN number, such as ''0xc0000'' or in case of
using the xl, ''1'' which
> will automatically figure out the start of the PCI address.
Here is an example:

kernel="/home/konrad/git/xtt/bootstrap/dist/common/vmlinuz"
ramdisk="/home/konrad/git/xtt/bootstrap/dist/common/initramfs.cpio.gz"
extra="inittab=/etc/inittab-xen console=hvc0 debug earlyprintk=xen
memblock=debug iommu=soft"
memory=6048
vcpus=4
on_crash="preserve"
vif = [ ''mac=00:0f:4b:00:00:68, bridge=switch'' ]
pci=
["0000:03:00.0","00:1d.0","00:1d.1","00:1d.2","00:1d.7"]
vfb = [ ''vnc=1, vnclisten=0.0.0.0,vncunused=1'']
pci_hole=0xa0000


Also I found out that ''xl'' does not understand multiple
options in the pci line
so testing with that might give you some weird results. However "xm"
works fine.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-13 07:40 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Why doesn''t the guest punch its own hole, by relocating RAM above 4GB?
That''s what all HVM guests do (in hvmloader).

 -- Keir

On 12/11/2010 23:08, "Konrad Rzeszutek Wilk"
<konrad.wilk@oracle.com> wrote:
> Hey guys,
> 
> Attached is an RFC patch for making a PCI hole in the PV guests. This
allows
> PV guests(*) with 4GB or more to now properly work with or without
> PCI passthrough cards.
> 
> Previously the Linux kernel would not be able to allocate the PCI region
> underneath the 4GB region as that region was all System RAM. And you would
see
> this:
> 
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
> 0000000000100000
> [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> [    0.000000] PCI: Unassigned devices with 32bit resource registers may
> break!
> [    0.000000] Allocating PCI resources starting at 100100000 (gap:
> 100100000:400000)
> 
> 
> This patchset punches an PCI hole in the E820 region and as well fills the
P2M
> properly,
> so that now you can see (*):
> [    0.000000] Allocating PCI resources starting at a0000000 (gap:
> a0000000:60000000)
> 
> It adds a new option to guest config file, which is "pci_hole".
The user can
> specify the PFN number, such as ''0xc0000'' or in case of
using the xl, ''1''
> which
> will automatically figure out the start of the PCI address.
> 
> *: This option requires support in the Linux kernel to actually deal with
two
> entries in the E820 map and P2M space filled with ~0.
> 
> 
> The patches (draft, not ready for upstream) for the Linux kernel to support
> this are
> available at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
devel/e820-hole
> 
> All of these patches make the E820 of the Linux guest with 4GB (or more)
> passed
> look like this (2.6.37-rc1+devel/e820-hole):
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 00000000a0000000 (usable)
> [    0.000000]  Xen: 0000000100000000 - 0000000160800000 (usable)
> 
> compared to (2.6.36)
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 0000000100000000 (usable)
> 
> and (2.6.37-rc1):
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 0000000100800000 (usable)
> 
> In regards to the patches that I am attaching here, what is the magic
> incantention
> to make the indentation/StyleGuide proper for the tools/libxc directory?
The
> tab spacing
> is off a bit (I think).
> 
> I''ve tested this so far only on 64-bit guests, and I am quite sure
the
> tool-stack
> needs some extra care for the 32-bit guests..
> 
> But please take a look and give feedback.
> 
> diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
> --- a/tools/libxc/xc_dom.h
> +++ b/tools/libxc/xc_dom.h
> @@ -91,6 +91,8 @@ struct xc_dom_image {
>  
>      /* physical memory */
>      xen_pfn_t total_pages;
> +    /* start of the pci_hole. goes up to 4gb */
> +    xen_pfn_t pci_hole;
>      struct xc_dom_phys *phys_pages;
>      int realmodearea_log;
>  
> diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c
> --- a/tools/libxc/xc_dom_core.c
> +++ b/tools/libxc/xc_dom_core.c
> @@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do
>      uint32_t *p2m_32;
>      uint64_t *p2m_64;
>      xen_pfn_t i;
> +    size_t tot_pages;
>  
>      if ( !dom->p2m_guest )
>          return 0;
>  
> +    tot_pages = dom->total_pages;
> +    if (dom->pci_hole)
> +         tot_pages += (0x100000 - dom->pci_hole);
> +
>      switch ( dom->arch_hooks->sizeof_pfn )
>      {
>      case 4:
>          DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn
"",
> -                  __FUNCTION__, dom->total_pages);
> +                  __FUNCTION__, tot_pages);
>          p2m_32 = dom->p2m_guest;
> -        for ( i = 0; i < dom->total_pages; i++ )
> +        for ( i = 0; i < tot_pages; i++ )
>              if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
>                  p2m_32[i] = dom->p2m_host[i];
>              else
> @@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do
>          break;
>      case 8:
>          DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn
"",
> -                  __FUNCTION__, dom->total_pages);
> +                  __FUNCTION__, tot_pages);
>          p2m_64 = dom->p2m_guest;
> -        for ( i = 0; i < dom->total_pages; i++ )
> +        for ( i = 0; i < tot_pages; i++ )
>              if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
>                  p2m_64[i] = dom->p2m_host[i];
>              else
> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> --- a/tools/libxc/xc_dom_x86.c
> +++ b/tools/libxc/xc_dom_x86.c
> @@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d
>  {
>      size_t p2m_size = dom->total_pages *
dom->arch_hooks->sizeof_pfn;
>  
> +    if (dom->pci_hole && (dom->total_pages >
dom->pci_hole))
> +    {
> + size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) *
> +      dom->arch_hooks->sizeof_pfn;
> +
> +        DOMPRINTF("%s: Expanding P2M to include PCI hole
(%ld->%ld)\n",
> +  __FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size);
> + p2m_size += p2m_pci_hole_size;
> +    }
>      /* allocate phys2mach table */
>      if ( xc_dom_alloc_segment(dom, &dom->p2m_seg,
"phys2mach", 0, p2m_size) )
>          return -1;
> @@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima
>  {
>      int rc;
>      xen_pfn_t pfn, allocsz, i, j, mfn;
> +    size_t p2m_size;
>  
>      rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
>      if ( rc )
> @@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima
>          if ( rc )
>              return rc;
>      }
> +    p2m_size = dom->total_pages;
>  
> -    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) *
dom->total_pages);
> +    if (dom->pci_hole && (dom->total_pages >
dom->pci_hole))
> +  p2m_size += (0x100000 - dom->pci_hole);
> +
> +    DOMPRINTF("Allocating %ld bytes for P2M", p2m_size *
sizeof(xen_pfn_t));
> +    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size);
>      if ( dom->superpages )
>      {
>          int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT;
> @@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima
>      }
>      else
>      {
> -        /* setup initial p2m */
> -        for ( pfn = 0; pfn < dom->total_pages; pfn++ )
> -            dom->p2m_host[pfn] = pfn;
> -        
> -        /* allocate guest memory */
> -        for ( i = rc = allocsz = 0;
> -              (i < dom->total_pages) && !rc;
> -              i += allocsz )
> + /* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */
> +        if ( dom->pci_hole && (dom->total_pages >
dom->pci_hole) )
>          {
> -            allocsz = dom->total_pages - i;
> -            if ( allocsz > 1024*1024 )
> -                allocsz = 1024*1024;
> -            rc = xc_domain_populate_physmap_exact(
> -                dom->xch, dom->guest_domid, allocsz,
> -                0, 0, &dom->p2m_host[i]);
> +            /* setup initial p2m in three passes. */
> +            for (pfn = 0; pfn < dom->pci_hole; pfn++)
> +              dom->p2m_host[pfn] = pfn;
> +
> +            xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has
PFNs.",
> __FUNCTION__, pfn);
> +            xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has
INVALID_MFN",
> +                          __FUNCTION__, pfn, 0x100000);
> +            for (; pfn < 0x100000; pfn++)
> +              dom->p2m_host[pfn] = INVALID_MFN;
> +
> +            for (; pfn < 0x100000 + dom->total_pages -
dom->pci_hole; pfn++)
> +              dom->p2m_host[pfn] = pfn;
> +            xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has
PFNs.",
> __FUNCTION__,
> +                           0x100000, pfn);
> +
> +            /* allocate guest memory in two passes. */
> +            for (i = rc = allocsz = 0; (i < dom->pci_hole)
&& !rc; i +> allocsz)
> +            {
> +              allocsz = dom->pci_hole - i;
> +              xc_dom_printf (dom->xch, "%s: Populating M2P
0x%lx->0x%lx",
> +                             __FUNCTION__, i, i + allocsz);
> +              rc = xc_domain_populate_physmap_exact (dom->xch,
> dom->guest_domid,
> +                                 allocsz, 0, 0,
> +                                 &dom->p2m_host[i]);
> +            }
> +            for (i = 0x100000, allocsz = rc = 0;
> +                 (i < (0x100000 + dom->total_pages -
dom->pci_hole))
> +                  && !rc; i += allocsz)
> +            {
> +              allocsz = (dom->total_pages - dom->pci_hole) - (i -
0x100000);
> +              if (allocsz > 1024 * 1024)
> +                allocsz = 1024 * 1024;
> +              xc_dom_printf (dom->xch, "%s: Populating M2P
0x%lx->0x%lx",
> +                             __FUNCTION__, i, i + allocsz);
> +              rc = xc_domain_populate_physmap_exact (dom->xch,
> dom->guest_domid,
> +                                                      allocsz, 0, 0,
> +                                                     
&dom->p2m_host[i]);
> +            }
> +            xc_dom_printf (dom->xch, "%s: Done with PCI populate
physmap",
> +                          __FUNCTION__);
> +        } else {
> +                /* setup initial p2m */
> +                for ( pfn = 0; pfn < dom->total_pages; pfn++ )
> +                    dom->p2m_host[pfn] = pfn;
> +                
> +                /* allocate guest memory */
> +                for ( i = rc = allocsz = 0;
> +                      (i < dom->total_pages) && !rc;
> +                      i += allocsz )
> +                {
> +                    allocsz = dom->total_pages - i;
> +                    if ( allocsz > 1024*1024 )
> +                        allocsz = 1024*1024;
> +                    rc = xc_domain_populate_physmap_exact(
> +                        dom->xch, dom->guest_domid, allocsz,
> +                        0, 0, &dom->p2m_host[i]);
> +                }
>          }
>      }
>  
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in
>  #include "xc_e820.h"
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb)
> +                               unsigned long map_limitkb,
> +                               xen_pfn_t pci_hole_start)
>  {
>      int rc;
> +    uint64_t delta_kb;
> +    size_t e820_sz;
>      struct xen_foreign_memory_map fmap = {
>          .domid = domid,
>          .map = { .nr_entries = 1 }
>      };
>      DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
>  
> -    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
> +    delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2);
> +    if (pci_hole_start && (delta_kb > 0))
> +      e820_sz = sizeof(*e820);
> +    else
> +      e820_sz = sizeof(*e820)*2;
> +
> +    e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz);
>  
>      if ( e820 == NULL )
>      {
> @@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf
>      e820->size = (uint64_t)map_limitkb << 10;
>      e820->type = E820_RAM;
>  
> +    if (pci_hole_start && (delta_kb > 0))
> +    {
> + fmap.map.nr_entries ++;
> + e820[0].size = (uint64_t)pci_hole_start << 12;
> + 
> + e820[1].type = E820_RAM;
> + e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn...  */
> + e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB.
*/
> +    }
> +
>      set_xen_guest_handle(fmap.map.buffer, e820);
>  
>      rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap,
sizeof(fmap));
> @@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf
>  #else
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb)
> +                               unsigned long map_limitkb,
> +                               xen_pfn_t pci_hole_start)
>  {
>      PERROR("Function not implemented");
>      errno = ENOSYS;
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc
>  
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb);
> +                               unsigned long map_limitkb,
> +    xen_pfn_t pci_hole_start);
>  
>  int xc_domain_set_time_offset(xc_interface *xch,
>                                uint32_t domid,
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx
>                                libxl_device_disk *disk, libxl_diskinfo
> *diskinfo);
>  int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk
> *disk);
>  
> +int libxl_find_pci_hole(uint32_t *start_pfn);
>  /*
>   * Make a disk available in this domain. Returns path to a device.
>   */
> diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl
> --- a/tools/libxl/libxl.idl
> +++ b/tools/libxl/libxl.idl
> @@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain
>                                          ])),
>                   ("pv", "!%s", Struct(None,
>                                         [("slack_memkb", uint32),
> +                                        ("pci_hole_start",
uint32),
>                                          ("bootloader", string),
>                                          ("bootloader_args",
string),
>                                          ("cmdline", string),
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin
>      xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
> LIBXL_MAXMEM_CONSTANT);
>      xc_domain_set_memmap_limit(ctx->xch, domid,
>              (info->hvm) ? info->max_memkb :
> -            (info->max_memkb + info->u.pv.slack_memkb));
> +            (info->max_memkb + info->u.pv.slack_memkb),
> +            (info->hvm) ? 0 : info->u.pv.pci_hole_start);
>      xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0,
0);
>      if ( info->disable_migrate )
>          xc_domain_disable_migrate(ctx->xch, domid);
> @@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint
>              }
>          }
>      }
> +    if ( info->u.pv.pci_hole_start)
> +        dom->pci_hole = info->u.pv.pci_hole_start;
>  
>      dom->flags = flags;
>      dom->console_evtchn = state->console_port;
> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> --- a/tools/libxl/libxl_pci.c
> +++ b/tools/libxl/libxl_pci.c
> @@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx
>      free(pcidevs);
>      return 0;
>  }
> +
> +#define MAX_LINE 300
> +int libxl_find_pci_hole(uint32_t *start_pfn)
> +{
> + FILE *fp;
> + char *s;
> + char buf[MAX_LINE];
> + int ret = -ENODEV;
> + long int pci_hole_phys;
> +
> + *start_pfn = 0;
> + fp = fopen("/proc/iomem", "r");
> + if (!fp)
> +  return ret;
> +
> + while (1) {
> +  s = fgets(buf, MAX_LINE, fp);
> +  if (!s)
> +   break;
> +  if (strlen(buf) < 1)
> +   continue;
> +  if (buf[strlen(buf)-1] == ''\n'')
> +   buf[strlen(buf)-1] = ''\0'';
> +  s = strchr(buf,''P'');
> +  if (!s)
> +   continue;
> +  if (strncmp(s, "PCI", 3) == 0) {
> +   if (buf[0] == '' '')
> +    continue;
> +   s = strchr(buf,''-'');
> +   if (!s)
> +    break;
> +   s[0]=''\0'';
> +   pci_hole_phys = strtol(buf, NULL, 16);
> +   if (!pci_hole_phys)
> +    break;
> +   /* We don''t want to the holes below 16MB. */
> +   if (pci_hole_phys <= 0x1000)
> +    continue;
> +   *start_pfn = pci_hole_phys >> 12;
> +   fprintf(stderr,"The value is 0x%d\n", *start_pfn);
> +   ret = 0;
> +   break;
> +  }
> + }
> + fclose(fp);
> + return ret;
> +}
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -1078,6 +1078,14 @@ skip_vfb:
>      if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
>          pci_power_mgmt = l;
>  
> +    if (!xlu_cfg_get_long (config, "pci_hole", &l)) {
> + if (l == 1) {
> +    uint32_t pfn_start = 0;
> +    if (!libxl_find_pci_hole(&pfn_start))
> +             b_info->u.pv.pci_hole_start = pfn_start;
> + } else
> +           b_info->u.pv.pci_hole_start = l;
> +    }
>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>          int i;
>          d_config->num_pcidevs = 0;
> diff --git a/tools/python/xen/lowlevel/xc/xc.c
> b/tools/python/xen/lowlevel/xc/xc.c
> --- a/tools/python/xen/lowlevel/xc/xc.c
> +++ b/tools/python/xen/lowlevel/xc/xc.c
> @@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje
>      unsigned int mem_mb;
>      unsigned long store_mfn = 0;
>      unsigned long console_mfn = 0;
> +    unsigned long pci_hole_start = 0;
>      PyObject* elfnote_dict;
>      PyObject* elfnote = NULL;
>      PyObject* ret;
> @@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje
>                                  "console_evtchn",
"image",
>                                  /* optional */
>                                  "ramdisk", "cmdline",
"flags",
> -                                "features", "vhpt",
"superpages", NULL };
> -
> -    if ( !PyArg_ParseTupleAndKeywords(args, kwds,
"iiiis|ssisii", kwd_list,
> +                                "features", "vhpt",
"superpages",
> +                                "pci_hole", NULL };
> +
> +    if ( !PyArg_ParseTupleAndKeywords(args, kwds,
"iiiis|ssisiii", kwd_list,
>                                        &domid, &store_evtchn,
&mem_mb,
>                                        &console_evtchn, &image,
>                                        /* optional */
>                                        &ramdisk, &cmdline,
&flags,
> -                                      &features, &vhpt,
&superpages) )
> +                                      &features, &vhpt,
&superpages,
> +                                      &pci_hole_start) )
>          return NULL;
>  
>      xc_dom_loginit(self->xc_handle);
> @@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje
>  
>      dom->superpages = superpages;
>  
> +    dom->pci_hole = pci_hole_start;
> +
>      if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image,
>                              ramdisk, flags, store_evtchn, &store_mfn,
>                              console_evtchn, &console_mfn) != 0 ) {
> @@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_
>  {
>      uint32_t dom;
>      unsigned int maplimit_kb;
> -
> -    if ( !PyArg_ParseTuple(args, "ii", &dom,
&maplimit_kb) )
> +    unsigned long pci_hole_start = 0;
> +
> +    if ( !PyArg_ParseTuple(args, "ii|i", &dom,
&maplimit_kb, &pci_hole_start)
> )
>          return NULL;
>  
> -    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb)
!= 0 )
> +    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb,
> +                                    pci_hole_start) != 0 )
>          return pyxc_error_to_exception(self->xc_handle);
>      
>      Py_INCREF(zero);
> @@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = {
>        "Set a domain''s physical memory mappping limit\n"
>        " dom [int]: Identifier of domain.\n"
>        " map_limitkb [int]: .\n"
> +      " pci_hole_start [int]: PFN for start of PCI hole
(optional).\n"
>        "Returns: [int] 0 on success; -1 on error.\n" },
>  
>  #ifdef __ia64__
> diff --git a/tools/python/xen/xend/XendConfig.py
> b/tools/python/xen/xend/XendConfig.py
> --- a/tools/python/xen/xend/XendConfig.py
> +++ b/tools/python/xen/xend/XendConfig.py
> @@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = {
>      ''suppress_spurious_page_faults'': bool0,
>      ''s3_integrity'' : int,
>      ''superpages'' : int,
> +    ''pci_hole'' : int,
>      ''memory_sharing'': int,
>      ''pool_name'' : str,
>      ''Description'': str,
> @@ -422,6 +423,7 @@ class XendConfig(dict):
>              ''target'': 0,
>              ''pool_name'' : ''Pool-0'',
>              ''superpages'': 0,
> +            ''pci_hole'': 0,
>              ''description'': '''',
>          }
>          
> @@ -2135,6 +2137,9 @@ class XendConfig(dict):
>              image.append([''args'',
self[''PV_args'']])
>          if self.has_key(''superpages''):
>              image.append([''superpages'',
self[''superpages'']])
> +        if self.has_key(''pci_hole''):
> +            image.append([''pci_hole'',
self[''pci_hole'']])
> +
>  
>          for key in XENAPI_PLATFORM_CFG_TYPES.keys():
>              if key in self[''platform'']:
> @@ -2179,6 +2184,10 @@ class XendConfig(dict):
>          val = sxp.child_value(image_sxp, ''superpages'')
>          if val is not None:
>              self[''superpages''] = val
> +
> +        val = sxp.child_value(image_sxp, ''pci_hole'')
> +        if val is not None:
> +            self[''pci_hole''] = val
>          
>          val = sxp.child_value(image_sxp,
''memory_sharing'')
>          if val is not None:
> diff --git a/tools/python/xen/xend/image.py
b/tools/python/xen/xend/image.py
> --- a/tools/python/xen/xend/image.py
> +++ b/tools/python/xen/xend/image.py
> @@ -84,6 +84,7 @@ class ImageHandler:
>  
>      ostype = None
>      superpages = 0
> +    pci_hole = 0
>      memory_sharing = 0
>  
>      def __init__(self, vm, vmConfig):
> @@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler):
>          self.vramsize =
int(vmConfig[''platform''].get(''videoram'',4))
* 1024
>          self.is_stubdom = (self.kernel.find(''stubdom'')
>= 0)
>          self.superpages = int(vmConfig[''superpages''])
> +        self.pci_hole = int(vmConfig[''pci_hole''])
>  
>      def buildDomain(self):
>          store_evtchn = self.vm.getStorePort()
> @@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler):
>          log.debug("features       = %s", self.vm.getFeatures())
>          log.debug("flags          = %d", self.flags)
>          log.debug("superpages     = %d", self.superpages)
> +        log.debug("pci_hole       = %d", self.pci_hole)
>          if arch.type == "ia64":
>              log.debug("vhpt          = %d", self.vhpt)
>  
> @@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler):
>                                features       = self.vm.getFeatures(),
>                                flags          = self.flags,
>                                vhpt           = self.vhpt,
> -                              superpages     = self.superpages)
> +                              superpages     = self.superpages,
> +                              pci_hole       = self.pci_hole)
>  
>      def getBitSize(self):
>          return xc.getBitSize(image    = self.kernel,
> @@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler):
>          args = args + ([ "-M", "xenpv"])
>          return args
>  
> -
>  class HVMImageHandler(ImageHandler):
>  
>      ostype = "hvm"
> @@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH
>          # set physical mapping limit
>          # add an 8MB slack to balance backend allocations.
>          mem_kb = self.getRequiredMaximumReservation() + (8 * 1024)
> -        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb)
> +        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb,
self.pci_hole)
>          rc = LinuxImageHandler.buildDomain(self)
>          self.setCpuid()
>          return rc
> diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py
> --- a/tools/python/xen/xm/create.py
> +++ b/tools/python/xen/xm/create.py
> @@ -680,6 +680,11 @@ gopts.var(''superpages'',
val=''0|1'',
>             fn=set_int, default=0,
>             use="Create domain with superpages")
>  
> +gopts.var(''pci_hole'',
val=''0x<XXX>|0'',
> +           fn=set_int, default=0,
> +           use="""Create domain with a PCI hole. The value
is the PFN of the
> +           start of PCI hole. Usually that is 0xc0000.""")
> +
>  def err(msg):
>      """Print an error to stderr and exit.
>      """
> @@ -770,6 +775,9 @@ def configure_image(vals):
>          config_image.append([''args'', vals.extra])
>      if vals.superpages:
>          config_image.append([''superpages'',
vals.superpages])
> +    if vals.pci_hole:
> +        config_image.append([''pci_hole'', vals.pci_hole])
> +
>  
>      if vals.builder == ''hvm'':
>          configure_hvm(config_image, vals)
> diff --git a/tools/python/xen/xm/xenapi_create.py
> b/tools/python/xen/xm/xenapi_create.py
> --- a/tools/python/xen/xm/xenapi_create.py
> +++ b/tools/python/xen/xm/xenapi_create.py
> @@ -285,6 +285,8 @@ class xenapi_create:
>                  vm.attributes["s3_integrity"].value,
>              "superpages":
>                  vm.attributes["superpages"].value,
> +            "pci_hole":
> +                vm.attributes["pci_hole"].value,
>              "memory_static_max":
>                  get_child_node_attribute(vm, "memory",
"static_max"),
>              "memory_static_min":
> @@ -697,6 +699,8 @@ class sxp2xml:
>              = str(get_child_by_name(config, "s3_integrity", 0))
>          vm.attributes["superpages"] \
>              = str(get_child_by_name(config, "superpages", 0))
> +        vm.attributes["pci_hole"] \
> +            = str(get_child_by_name(config, "pci_hole", 0))
>          vm.attributes["pool_name"] \
>              = str(get_child_by_name(config, "pool_name",
"Pool-0"))
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-15 17:03 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser
wrote:> Why doesn''t the guest punch its own hole, by relocating RAM above
4GB?

 1). Did not work for me - I am not sure why but I had the hardest time do
     hypervisor_populate_physmap - it would just hang the guest.
 2). It is much simple to parse the E820 in the Linux kernel than actually
     creating new E820 entries in the kernel (hypercall), making a bunch of
     hypervisor calls that unmap, then remap the space, filling out the P2M
     with INVALID_MFN, and doing all of that before the "real" Linux
kernel
     actually starts (all would have to be done in xen_start_kernel).
     I have a sinking feeling tha the upstream community would not like it
     this that much.
> That''s what all HVM guests do (in hvmloader).
  3). Which is also part of the Xen tool-stack.


Keir, I think you posted it at some point - was there a standard
''indent''
incantention for the tools/libxc StyleGuide (which is similar to to the xen/*
one) - I can''t find it in my mail archive...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Nov-15 17:20 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk
wrote:>  2). It is much simple to parse the E820 in the Linux kernel[...]
The ability to ingest an e820 from the hypervisor is also needed for
dom0 to consume the host e820 so it''s not like there is additional code
on the kernel side to cope with this.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-15 17:28 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Mon, Nov 15, 2010 at 05:20:00PM +0000, Ian Campbell
wrote:> On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk wrote:
> >  2). It is much simple to parse the E820 in the Linux kernel[...]
> 
> The ability to ingest an e820 from the hypervisor is also needed for
> dom0 to consume the host e820 so it''s not like there is additional
code
> on the kernel side to cope with this.
There is a bit. We need to decouple the info->nr_pages magic that is used
throughout the code b/c we are treating that value as a ''last
pfn''. With
the offset, we need to be more careful about it.

P.S.
(devel code, probably going to post a refresh today):
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-15 17:48 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 15/11/2010 17:03, "Konrad Rzeszutek Wilk"
<konrad.wilk@oracle.com> wrote:
> On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser wrote:
>> Why doesn''t the guest punch its own hole, by relocating RAM
above 4GB?
> 
> 
>  1). Did not work for me - I am not sure why but I had the hardest time do
>      hypervisor_populate_physmap - it would just hang the guest.
For a PV guest you don''t need to do any alloc/free/move memory
hypercalls.
You rewrite your own p2m to relocate mfns where you want them in pfn space.
Then some hypercalls just to update the m2p array to match.
>  2). It is much simple to parse the E820 in the Linux kernel than actually
>      creating new E820 entries in the kernel (hypercall), making a bunch of
>      hypervisor calls that unmap, then remap the space, filling out the P2M
>      with INVALID_MFN, and doing all of that before the "real"
Linux kernel
>      actually starts (all would have to be done in xen_start_kernel).
>      I have a sinking feeling tha the upstream community would not like it
>      this that much.
Well it is all quite Xen specific, so I''m surprised.
>> That''s what all HVM guests do (in hvmloader).
> 
>   3). Which is also part of the Xen tool-stack.
> 
> 
> Keir, I think you posted it at some point - was there a standard
''indent''
> incantention for the tools/libxc StyleGuide (which is similar to to the
xen/*
> one) - I can''t find it in my mail archive...
I think someone else worked one out, but I don''t have it to hand
I''m afraid.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-15 18:15 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

> >  1). Did not work for me - I am not sure why but I had the hardest
time do
> >      hypervisor_populate_physmap - it would just hang the guest.
> 
> For a PV guest you don''t need to do any alloc/free/move memory
hypercalls.
> You rewrite your own p2m to relocate mfns where you want them in pfn space.
> Then some hypercalls just to update the m2p array to match.
Ok, I can play with that see what fun/havoc I can
create.> 
> >  2). It is much simple to parse the E820 in the Linux kernel than
actually
> >      creating new E820 entries in the kernel (hypercall), making a
bunch of
> >      hypervisor calls that unmap, then remap the space, filling out
the P2M
> >      with INVALID_MFN, and doing all of that before the
"real" Linux kernel
> >      actually starts (all would have to be done in xen_start_kernel).
> >      I have a sinking feeling tha the upstream community would not
like it
> >      this that much.
> 
> Well it is all quite Xen specific, so I''m surprised.
Oh, there was another reason that I so obvious that I completly forgot. DomU
has no idea where the host PCI hole starts. In most cases it is at 3GB (or even
further up - 3.5GB), but a quick look for ''Allocating PCI resources
starting at''
at Google shows that there are some that start at 1.2G.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-15 18:41 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 15/11/2010 18:15, "Konrad Rzeszutek Wilk"
<konrad.wilk@oracle.com> wrote:
>> Well it is all quite Xen specific, so I''m surprised.
> 
> Oh, there was another reason that I so obvious that I completly forgot.
DomU
> has no idea where the host PCI hole starts. In most cases it is at 3GB (or
> even
> further up - 3.5GB), but a quick look for ''Allocating PCI
resources starting
> at'' 
> at Google shows that there are some that start at 1.2G.
Hm, true. We could give you access to XENMEM_machine_memory_map? It''s
not
really got any big secrets or privileged things in it. :-)

Or, is there much disadvantage, to having a static really big PCI hole? Say
starting at 1GB? The advantage of this would be the ability to hotplug PCI
devices to a domU even across save/restore/migrate -- this may not work so
well if you commit yourself to the hole size of the original host, and the
restore/migrate target host has a bigger hole!

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Nov-15 19:30 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 11/15/2010 10:15 AM, Konrad Rzeszutek Wilk wrote:>>>  1). Did not work for me - I am not sure why but I had the hardest
time do
>>>      hypervisor_populate_physmap - it would just hang the guest.
>> For a PV guest you don''t need to do any alloc/free/move memory
hypercalls.
>> You rewrite your own p2m to relocate mfns where you want them in pfn
space.
>> Then some hypercalls just to update the m2p array to match.
> Ok, I can play with that see what fun/havoc I can create.
>>>  2). It is much simple to parse the E820 in the Linux kernel than
actually
>>>      creating new E820 entries in the kernel (hypercall), making a
bunch of
>>>      hypervisor calls that unmap, then remap the space, filling out
the P2M
>>>      with INVALID_MFN, and doing all of that before the
"real" Linux kernel
>>>      actually starts (all would have to be done in
xen_start_kernel).
>>>      I have a sinking feeling tha the upstream community would not
like it
>>>      this that much.
>> Well it is all quite Xen specific, so I''m surprised.
> Oh, there was another reason that I so obvious that I completly forgot.
DomU
> has no idea where the host PCI hole starts. In most cases it is at 3GB (or
even
> further up - 3.5GB), but a quick look for ''Allocating PCI
resources starting at''
> at Google shows that there are some that start at 1.2G.
Yes, that''s the main reason I think it should be in the toolstack.  The
domain doesn''t know whether the PCI hole is necessary or not, and its
too early for it to poke around in xenstore to look for passed devices
or anything.  It could conservatively always reserve a 1G hole at 3G,
but that seems too pessimistic and is a waste of sub-4G adress space
which it might otherwise have use for.

If the toolstack can tell it where to make the hole by editing the E820,
then the dom0 and domU cases are very similar.

The question of whether all the pages given by the builder to the domain
should be distributed over the E820 RAM ranges, or should just be
considered ballooned out of the holes (which is what currently happens)
is pretty much orthogonal.

    J


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Nov-15 19:32 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 11/15/2010 10:41 AM, Keir Fraser wrote:>> Oh, there was another reason that I so obvious that I completly forgot.
DomU
>> has no idea where the host PCI hole starts. In most cases it is at 3GB
(or
>> even
>> further up - 3.5GB), but a quick look for ''Allocating PCI
resources starting
>> at'' 
>> at Google shows that there are some that start at 1.2G.
> Hm, true. We could give you access to XENMEM_machine_memory_map?
It''s not
> really got any big secrets or privileged things in it. :-)
>
> Or, is there much disadvantage, to having a static really big PCI hole? Say
> starting at 1GB? The advantage of this would be the ability to hotplug PCI
> devices to a domU even across save/restore/migrate -- this may not work so
> well if you commit yourself to the hole size of the original host, and the
> restore/migrate target host has a bigger hole!
Well, the other question is whether the devices have to have the same
pfn as mfn within the hole.  We''re emulating the PCI config space
anyway
- couldn''t we stick the passthrough PCI space at 3G regardless of where
it is on the real hardware?

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-15 19:57 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org>
wrote:
>> Or, is there much disadvantage, to having a static really big PCI hole?
Say
>> starting at 1GB? The advantage of this would be the ability to hotplug
PCI
>> devices to a domU even across save/restore/migrate -- this may not work
so
>> well if you commit yourself to the hole size of the original host, and
the
>> restore/migrate target host has a bigger hole!
> 
> Well, the other question is whether the devices have to have the same
> pfn as mfn within the hole.  We''re emulating the PCI config space
anyway
> - couldn''t we stick the passthrough PCI space at 3G regardless of
where
> it is on the real hardware?
Well, I don''t know. It sounds pretty sensible to me. :-)

Certain virtualisation feature sdisappearing after a save/restore/migrate --
or worsse, becoming unreliable -- would be a bit sad.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-15 23:11 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser
wrote:> On 15/11/2010 19:32, "Jeremy Fitzhardinge"
<jeremy@goop.org> wrote:
> 
> >> Or, is there much disadvantage, to having a static really big PCI
hole? Say
> >> starting at 1GB? The advantage of this would be the ability to
hotplug PCI
> >> devices to a domU even across save/restore/migrate -- this may not
work so
> >> well if you commit yourself to the hole size of the original host,
and the
> >> restore/migrate target host has a bigger hole!
> > 
> > Well, the other question is whether the devices have to have the same
> > pfn as mfn within the hole.  We''re emulating the PCI config
space anyway
> > - couldn''t we stick the passthrough PCI space at 3G
regardless of where
> > it is on the real hardware?
Your thinking is that on the Linux side, any of the pfns that are within
those System RAM gaps (irregardless if they are above or below 4GB) would
be consultated during PTE creation/lookup (xen_pte_val..). 

And if those PFNs are within those System RAM gaps, we would store the 
MFN in the P2M list and instead of doing:
   val = ((pteval_t)pfn << PAGE_SHIFT) | flags

we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP flag.

And  example patch (compiled tested, not tested any other way) attached at the
end of this email.

How does that work on the Xen side? Does the hypervisor depend on the pages
that belong to the DOM_IO domain to have a INVALID_MFN value in the mfn_list?

We do make the PTE that refer to physical devices to be the DOM_IO domain..

> 
> Well, I don''t know. It sounds pretty sensible to me. :-)
> 
> Certain virtualisation feature sdisappearing after a save/restore/migrate
--
> or worsse, becoming unreliable -- would be a bit sad.
So having the option of the PCI hole being passed through, and giving
the tools the value (pci_hole) would mean we could migrate an SR-IOV type
device from one machine to another. Constructing the PCI hole using the
XENMEM_machine_memory_map could generate different E820 for the two guests,
which
would be indeed a bit sad.


--- the patch ---

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 50dc626..96a08ef 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr)
 
 static bool xen_iomap_pte(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_IOMAP;
+	return xen_pfn_is_pci(pte_mfn(pte));
 }
 
 void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid)
@@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn,
pgprot_t flags)
 void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
 		    pte_t *ptep, pte_t pteval)
 {
-	if (xen_iomap_pte(pteval)) {
-		xen_set_iomap_pte(ptep, pteval);
-		goto out;
-	}
-
 	ADD_STATS(set_pte_at, 1);
 //	ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
 	ADD_STATS(set_pte_at_current, mm == current->mm);
@@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
 	return val;
 }
 
-static pteval_t iomap_pte(pteval_t val)
-{
-	if (val & _PAGE_PRESENT) {
-		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
-		pteval_t flags = val & PTE_FLAGS_MASK;
-
-		/* We assume the pte frame number is a MFN, so
-		   just use it as-is. */
-		val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
-	}
-
-	return val;
-}
 
 pteval_t xen_pte_val(pte_t pte)
 {
@@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte)
 		pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT;
 	}
 
-	if (xen_initial_domain() && (pteval & _PAGE_IOMAP))
-		return pteval;
+	if (xen_pfn_is_pci(pte_mfn(pte)))
+		pteval |= _PAGE_IOMAP;
 
 	return pte_mfn_to_pfn(pteval);
 }
@@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte)
 	 * mappings are just dummy local mappings to keep other
 	 * parts of the kernel happy.
 	 */
-	if (unlikely(pte & _PAGE_IOMAP) &&
-	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) {
-		pte = iomap_pte(pte);
-	} else {
+	if ((unlikely(pte & _PAGE_IOMAP) &&
+	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) ||
+	    (unlikely(xen_pfn_is_pci(PFN_UP(addr)))))
+		pte |=  _PAGE_IOMAP;
+	else
 		pte &= ~_PAGE_IOMAP;
-		pte = pte_pfn_to_mfn(pte);
-	}
+
+	pte = pte_pfn_to_mfn(pte);
 
 	return native_make_pte(pte);
 }
@@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val)
 
 void xen_set_pte(pte_t *ptep, pte_t pte)
 {
-	if (xen_iomap_pte(pte)) {
+	if (xen_iomap_pte(pte))
 		xen_set_iomap_pte(ptep, pte);
-		return;
-	}
 
 	ADD_STATS(pte_update, 1);
 //	ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
@@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
 #ifdef CONFIG_X86_PAE
 void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
-	if (xen_iomap_pte(pte)) {
+	if (xen_iomap_pte(pte))
 		xen_set_iomap_pte(ptep, pte);
-		return;
-	}
 
 	set_64bit((u64 *)ptep, native_pte_val(pte));
 }
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 5a1f22d..bb424e3 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void)
 	xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn,
xen_start_info->nr_pages);
 	return max_pfn;
 }
+
+int xen_pfn_is_pci(unsigned long pfn)
+{
+	static struct e820entry map[E820MAX] __initdata;
+	int rc, op, i;
+	struct xen_memory_map memmap;
+	unsigned long long addr = PFN_PHYS(pfn);
+	memmap.nr_entries = E820MAX;
+	set_xen_guest_handle(memmap.buffer, map);
+
+	op = xen_initial_domain() ?
+		XENMEM_machine_memory_map :
+		XENMEM_memory_map;
+	rc = HYPERVISOR_memory_op(op, &memmap);
+	BUG_ON(rc);
+
+	for (i = 0; i < memmap.nr_entries; i++) {
+		unsigned long long end = map[i].addr + map[i].size;
+		if (map[i].type != E820_RAM)
+			continue;
+		if (addr >= map[i].addr && addr <= end)
+			return 0;
+	}
+	return 1;
+}
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index eee2045..f859b04 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 extern phys_addr_t xen_extra_mem_start;
 unsigned long xen_find_max_pfn(void);
 
+int xen_pfn_is_pci(unsigned long pfn);
+
 #endif /* INCLUDE_XEN_OPS_H */

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Nov-16 01:06 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 11/15/2010 03:11 PM, Konrad Rzeszutek Wilk wrote:> On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser wrote:
>> On 15/11/2010 19:32, "Jeremy Fitzhardinge"
<jeremy@goop.org> wrote:
>>
>>>> Or, is there much disadvantage, to having a static really big
PCI hole? Say
>>>> starting at 1GB? The advantage of this would be the ability to
hotplug PCI
>>>> devices to a domU even across save/restore/migrate -- this may
not work so
>>>> well if you commit yourself to the hole size of the original
host, and the
>>>> restore/migrate target host has a bigger hole!
>>> Well, the other question is whether the devices have to have the
same
>>> pfn as mfn within the hole.  We''re emulating the PCI
config space anyway
>>> - couldn''t we stick the passthrough PCI space at 3G
regardless of where
>>> it is on the real hardware?
> Your thinking is that on the Linux side, any of the pfns that are within
> those System RAM gaps (irregardless if they are above or below 4GB) would
> be consultated during PTE creation/lookup (xen_pte_val..). 
>
> And if those PFNs are within those System RAM gaps, we would store the 
> MFN in the P2M list and instead of doing:
>    val = ((pteval_t)pfn << PAGE_SHIFT) | flags
>
> we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP
flag.
>
> And  example patch (compiled tested, not tested any other way) attached at
the
> end of this email.
Right, it basically depends on dropping _PAGE_IOMAP and populating the
p2m with the correct mapping for both memory and hardware pages.
> How does that work on the Xen side? Does the hypervisor depend on the pages
> that belong to the DOM_IO domain to have a INVALID_MFN value in the
mfn_list?
Xen wouldn''t care.  I don''t think its necessary to explicitly
do a
cross-domain mapping with DOM_IO as we currently do; that''s overkill
and/or a misunderstanding on my part.
> We do make the PTE that refer to physical devices to be the DOM_IO domain..
I think Xen will sort that out for itself when presented with a
hardware/device mfn.
>> Well, I don''t know. It sounds pretty sensible to me. :-)
>>
>> Certain virtualisation feature sdisappearing after a
save/restore/migrate --
>> or worsse, becoming unreliable -- would be a bit sad.
> So having the option of the PCI hole being passed through, and giving
> the tools the value (pci_hole) would mean we could migrate an SR-IOV type
> device from one machine to another. Constructing the PCI hole using the
> XENMEM_machine_memory_map could generate different E820 for the two guests,
which
> would be indeed a bit sad.
>
>
> --- the patch ---
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 50dc626..96a08ef 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr)
>  
>  static bool xen_iomap_pte(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_IOMAP;
> +	return xen_pfn_is_pci(pte_mfn(pte));
>  }
I think populating the p2m appropriately in advance is better than this
test; this is OK for prototyping I guess, but way to expensive for every
set_pte.

    J
>  
>  void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid)
> @@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long
mfn, pgprot_t flags)
>  void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
>  		    pte_t *ptep, pte_t pteval)
>  {
> -	if (xen_iomap_pte(pteval)) {
> -		xen_set_iomap_pte(ptep, pteval);
> -		goto out;
> -	}
> -
>  	ADD_STATS(set_pte_at, 1);
>  //	ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
>  	ADD_STATS(set_pte_at_current, mm == current->mm);
> @@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
>  	return val;
>  }
>  
> -static pteval_t iomap_pte(pteval_t val)
> -{
> -	if (val & _PAGE_PRESENT) {
> -		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> -		pteval_t flags = val & PTE_FLAGS_MASK;
> -
> -		/* We assume the pte frame number is a MFN, so
> -		   just use it as-is. */
> -		val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
> -	}
> -
> -	return val;
> -}
>  
>  pteval_t xen_pte_val(pte_t pte)
>  {
> @@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte)
>  		pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT;
>  	}
>  
> -	if (xen_initial_domain() && (pteval & _PAGE_IOMAP))
> -		return pteval;
> +	if (xen_pfn_is_pci(pte_mfn(pte)))
> +		pteval |= _PAGE_IOMAP;
>  
>  	return pte_mfn_to_pfn(pteval);
>  }
> @@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte)
>  	 * mappings are just dummy local mappings to keep other
>  	 * parts of the kernel happy.
>  	 */
> -	if (unlikely(pte & _PAGE_IOMAP) &&
> -	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) {
> -		pte = iomap_pte(pte);
> -	} else {
> +	if ((unlikely(pte & _PAGE_IOMAP) &&
> +	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) ||
> +	    (unlikely(xen_pfn_is_pci(PFN_UP(addr)))))
> +		pte |=  _PAGE_IOMAP;
> +	else
>  		pte &= ~_PAGE_IOMAP;
> -		pte = pte_pfn_to_mfn(pte);
> -	}
> +
> +	pte = pte_pfn_to_mfn(pte);
>  
>  	return native_make_pte(pte);
>  }
> @@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val)
>  
>  void xen_set_pte(pte_t *ptep, pte_t pte)
>  {
> -	if (xen_iomap_pte(pte)) {
> +	if (xen_iomap_pte(pte))
>  		xen_set_iomap_pte(ptep, pte);
> -		return;
> -	}
>  
>  	ADD_STATS(pte_update, 1);
>  //	ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
> @@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
>  #ifdef CONFIG_X86_PAE
>  void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
>  {
> -	if (xen_iomap_pte(pte)) {
> +	if (xen_iomap_pte(pte))
>  		xen_set_iomap_pte(ptep, pte);
> -		return;
> -	}
>  
>  	set_64bit((u64 *)ptep, native_pte_val(pte));
>  }
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 5a1f22d..bb424e3 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void)
>  	xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn,
xen_start_info->nr_pages);
>  	return max_pfn;
>  }
> +
> +int xen_pfn_is_pci(unsigned long pfn)
> +{
> +	static struct e820entry map[E820MAX] __initdata;
> +	int rc, op, i;
> +	struct xen_memory_map memmap;
> +	unsigned long long addr = PFN_PHYS(pfn);
> +	memmap.nr_entries = E820MAX;
> +	set_xen_guest_handle(memmap.buffer, map);
> +
> +	op = xen_initial_domain() ?
> +		XENMEM_machine_memory_map :
> +		XENMEM_memory_map;
> +	rc = HYPERVISOR_memory_op(op, &memmap);
> +	BUG_ON(rc);
> +
> +	for (i = 0; i < memmap.nr_entries; i++) {
> +		unsigned long long end = map[i].addr + map[i].size;
> +		if (map[i].type != E820_RAM)
> +			continue;
> +		if (addr >= map[i].addr && addr <= end)
> +			return 0;
> +	}
> +	return 1;
> +}
>  /**
>   * machine_specific_memory_setup - Hook for machine specific memory setup.
>   **/
> diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
> index eee2045..f859b04 100644
> --- a/include/xen/xen-ops.h
> +++ b/include/xen/xen-ops.h
> @@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct
*vma,
>  extern phys_addr_t xen_extra_mem_start;
>  unsigned long xen_find_max_pfn(void);
>  
> +int xen_pfn_is_pci(unsigned long pfn);
> +
>  #endif /* INCLUDE_XEN_OPS_H */
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 07:40 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 15/11/2010 23:11, "Konrad Rzeszutek Wilk"
<konrad.wilk@oracle.com> wrote:
>> Well, I don''t know. It sounds pretty sensible to me. :-)
>> 
>> Certain virtualisation feature sdisappearing after a
save/restore/migrate --
>> or worsse, becoming unreliable -- would be a bit sad.
> 
> So having the option of the PCI hole being passed through, and giving
> the tools the value (pci_hole) would mean we could migrate an SR-IOV type
> device from one machine to another. Constructing the PCI hole using the
> XENMEM_machine_memory_map could generate different E820 for the two guests,
> which
> would be indeed a bit sad.
SR-IOV is a nice example, but even assuming the user does not have pass-thru
devices attached during migrate, or save/restore, it would still be nice if
they could have the pci-hotplug-passthru facility available to them fully
working both before and after such an event.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Nov-16 09:26 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Tue, 2010-11-16 at 01:06 +0000, Jeremy Fitzhardinge
wrote:> 
> > How does that work on the Xen side? Does the hypervisor depend on
> the pages
> > that belong to the DOM_IO domain to have a INVALID_MFN value in the
> mfn_list?
> 
> Xen wouldn''t care.  I don''t think its necessary to
explicitly do a
> cross-domain mapping with DOM_IO as we currently do; that''s
overkill
> and/or a misunderstanding on my part.
> 
> > We do make the PTE that refer to physical devices to be the DOM_IO
> domain..
> 
> I think Xen will sort that out for itself when presented with a
> hardware/device mfn. 
My main concern would be with save/restore code will canonicalise all
the MFNs in the page tables back into PFNs and then convert back to MFNs
on the other side, which is likely to go pretty wrong on one end of the
other unless the save restore code is aware of which MFNs are device
MFNs and which are actual memory. I''m not sure there is any way it can
tell.

** scrobbles around in xc_domain_save.c **

Hrmm... The MFN_IS_IN_PSEUDOPHYS_MAP macro might have some impact on
this issue in some way (depending on what the m2p contains for DOM_IO
owned pages) but I don''t think it actually fixes anything. I
don''t see
anything else which would make this work... Best case as it stands
AFAICT is that MFN_IS_IN_PSEUDOPHYS_MAP causes device mappings to get
zapped requiring the kernel to reinstate them on restore. Which isn''t
so
bad I guess.

On an unrelated note I think if we do go down the route of having the
guest kernel punch the holes itself and such we should do so iff
XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open
the possibility of cunning tricks on the tools side in the future.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 09:52 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 16/11/2010 09:26, "Ian Campbell" <Ian.Campbell@citrix.com>
wrote:
>>> We do make the PTE that refer to physical devices to be the DOM_IO
>> domain..
>> 
>> I think Xen will sort that out for itself when presented with a
>> hardware/device mfn.
> 
> My main concern would be with save/restore code will canonicalise all
> the MFNs in the page tables back into PFNs and then convert back to MFNs
> on the other side, which is likely to go pretty wrong on one end of the
> other unless the save restore code is aware of which MFNs are device
> MFNs and which are actual memory. I''m not sure there is any way it
can
> tell.
The right answer is probably to refuse save/restore/migrate when devices are
passed through. It''s somewhere between very hard and very nuts to
attempt
that in general. For example, even with SR-IOV, we''ve only been talking
about it so far for NICs, and then in terms of having a Solarflare-like
acceleration abstraction allowing us to step off of SR-IOV for at least the
duration of the critical bit of the save/restore.

A sensible first goal would simply be to be able to do PCI passthrough both
before and after a s/r/m across reasonaly heterogenous hardware, but not
attempt to be able to maintain such a device passthru *during* the s/r/m.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Nov-16 10:02 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Tue, 2010-11-16 at 09:52 +0000, Keir Fraser wrote:> On 16/11/2010 09:26, "Ian Campbell"
<Ian.Campbell@citrix.com> wrote:
> 
> >>> We do make the PTE that refer to physical devices to be the
DOM_IO
> >> domain..
> >> 
> >> I think Xen will sort that out for itself when presented with a
> >> hardware/device mfn.
> > 
> > My main concern would be with save/restore code will canonicalise all
> > the MFNs in the page tables back into PFNs and then convert back to
MFNs
> > on the other side, which is likely to go pretty wrong on one end of
the
> > other unless the save restore code is aware of which MFNs are device
> > MFNs and which are actual memory. I''m not sure there is any
way it can
> > tell.
> 
> The right answer is probably to refuse save/restore/migrate when devices
are
> passed through.
Absolutely. 

However we are talking about setting up a 1-1 mapping in the P2M region
corresponding to the PCI hole at guest boot and preserving that until
such a time as a device is plugged in, which may be after a migration. 

I don''t think it matters that no device is passed through at the time
of
the migration, in this configuration we still need arrange for the
relevant P2M entries to be correct after the migration (or at least
before the device gets plugged in, perhaps we can leave holes and only
establish the 1-1 p2m on demand in pcifront?).

So long as this configuration doesn''t cause the save/restore code to go
mad it''s something we can likely fixup in the guest on restore. My
worry
is that the save/restore code will just barf before we get that
opportunity...

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 10:11 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 16/11/2010 10:02, "Ian Campbell" <Ian.Campbell@eu.citrix.com>
wrote:
>> The right answer is probably to refuse save/restore/migrate when
devices are
>> passed through.
> 
> Absolutely. 
> 
> However we are talking about setting up a 1-1 mapping in the P2M region
> corresponding to the PCI hole at guest boot and preserving that until
> such a time as a device is plugged in, which may be after a migration.
> 
> I don''t think it matters that no device is passed through at the
time of
> the migration, in this configuration we still need arrange for the
> relevant P2M entries to be correct after the migration (or at least
> before the device gets plugged in, perhaps we can leave holes and only
> establish the 1-1 p2m on demand in pcifront?).
Leave the hole empty and populate on demand when devices are passed through
would seem sensible.

 -- Keir
> So long as this configuration doesn''t cause the save/restore code
to go
> mad it''s something we can likely fixup in the guest on restore. My
worry
> is that the save/restore code will just barf before we get that
> opportunity...



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Nov-16 15:50 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

disclaimer:
This email got a bit lengthy - so make sure you got a cup of coffee when you
read this.
> On an unrelated note I think if we do go down the route of having the
> guest kernel punch the holes itself and such we should do so iff
> XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open
When would that actually happen? Is that return value returned when the
hypervisor is not implementing it (what version was that implemented this)?
> the possibility of cunning tricks on the tools side in the future.
<shuders>

I think we have three options in regards to this RFC patch I posted:
 1). Continue with this and have the toolstack punch the PCI hole. It would
     fill the PCI hole area with INVALID_MFN. The toolstack determines where
     the PCI hole starts.
 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map
and
     XENMEM_memory_map to get an idea of the host side PCI hole and set it up.
     Requires changes in hypervisor to allow non-privileged PV guest to make
     XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts
and
     the PCI hole is filled with INVALID_MFN.
 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with
INVALID_MFN.
 4). Another one I didn''t think of?

For all of those cases when devices show up we populate on demand the P2M array
with the MFNs). For the first two proposals the BARs we read of
the PCI devices are going to be written to the P2M array as identity (so
mfn_list[0xc0000] == 0xc0000). Code has not been written.

For the third proposal, we would have non-identity mappings in the P2M array, as
during the migration we could move from a device with BARs of 0xc0000 to
0x20000.
So mfn_list[0xc0000] = 0x20000.

But for the third case I am unsure how we would get the "real" MFNs.
We initially get
the BARs via 0xcf8 calls and if we don''t filter them, it gets to
ioremap function.
Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The
ioremap
gets called with 0x20000, and in its E820 that region is ''System
RAM''.

        last_pfn = last_addr >> PAGE_SHIFT;
        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) &&
!PageReserved(pfn_to_page(pfn)))
                        return NULL;
                WARN_ON_ONCE(is_ram);
        }   

Ugh, and it will think (correctly) that it falls within RAM.

If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can
then
provide BARs that always start at 0xC0000. But that does not help the PV guest
to
know the "real" MFNs which it needs so it can program the P2M array.
So the Xen
PCI front would have to do this - which it could, thought it adds a complexity
to it.

We also need to make all of this works with Domain zero, and here 1) or 2) can
easily be used as the Xen hypervisor has given us the E820 nicely peppered with
holes.
(I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get
anything?)

If we then go with 3), we would need to instrument the code that reads the BARs
so that
it can filter it properly. That would be low-level Linux pci_conf_read and that
is not
going happen - so we would have to make the Xen hypervisor be aware of this and
when
it traps the in/out provide new BAR values starting at 0xC0000.

I am not comfortable maintaining this filter/keep state code in both the Xen
hypervisor
and the Xen PCI front module so I think 3) would not work that well, unless
there are
better ways that I have missed?

Back to 1) and 2). Migration would work if we unplug the PCI devices before
suspend and
on resume plug them back in - otherwise the PCI BARs might have changed between
migrations. When the guest gets recreated - how does it iterate over the E820 to
create
the P2M list? Or is that something that is not done and we just save the P2M
list and
restore as-is on the other side? Naturally, since we would unplug the PCI device
the
entries in the E820 gaps would be INVALID_MFN...

If we consult the E820 during resume I think doing the PCI hole in the toolstack
has
merits - simply b/c the user can set the PCI hole to an arbitrary address that
is low
enough (0x2000 say) to cover all of the machines that he/she would migrate too.
While
if we do it in the Linux kernel we do not have that information. Even if we
don''t
consult the E820, the toolstack still has merits - as the PCI hole start address
might be different between the migration machines and we might have started on
a box with the PCI hole being way up (3.9GB) while the other machines might have
at 1.2GB.

The other thing I don''t know is how all of this works with 32-bit
kernels?

P.S.
I''ve done the testing of 1) with 64-bit w/ and w/o ballooning and it
worked fine.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Nov-16 18:01 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On 11/16/2010 02:11 AM, Keir Fraser wrote:> On 16/11/2010 10:02, "Ian Campbell"
<Ian.Campbell@eu.citrix.com> wrote:
>
>>> The right answer is probably to refuse save/restore/migrate when
devices are
>>> passed through.
>> Absolutely. 
>>
>> However we are talking about setting up a 1-1 mapping in the P2M region
>> corresponding to the PCI hole at guest boot and preserving that until
>> such a time as a device is plugged in, which may be after a migration.
>>
>> I don''t think it matters that no device is passed through at
the time of
>> the migration, in this configuration we still need arrange for the
>> relevant P2M entries to be correct after the migration (or at least
>> before the device gets plugged in, perhaps we can leave holes and only
>> establish the 1-1 p2m on demand in pcifront?).
> Leave the hole empty and populate on demand when devices are passed through
> would seem sensible.
Actually I was originally thinking that the hole would all be
INVALID_MFN but then pfn_to_mfn() would translate that to being an
identity translation.  But that''s pretty hacky, and only works if you
actually want identity.  On-demand population of regions is much cleaner.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gianni Tedesco

2010-Nov-17 11:14 UTC

head link

[Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk
wrote:> Hey guys,
> 
> Attached is an RFC patch for making a PCI hole in the PV guests. This
allows
> PV guests(*) with 4GB or more to now properly work with or without
> PCI passthrough cards.
> 
> Previously the Linux kernel would not be able to allocate the PCI region
> underneath the 4GB region as that region was all System RAM. And you would
see
> this:
> 
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
0000000000100000
> [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> [    0.000000] PCI: Unassigned devices with 32bit resource registers may
break!
> [    0.000000] Allocating PCI resources starting at 100100000 (gap:
100100000:400000)
> 
> 
> This patchset punches an PCI hole in the E820 region and as well fills the
P2M properly,
> so that now you can see (*):
> [    0.000000] Allocating PCI resources starting at a0000000 (gap:
a0000000:60000000)
> 
> It adds a new option to guest config file, which is "pci_hole".
The user can
> specify the PFN number, such as ''0xc0000'' or in case of
using the xl, ''1'' which
> will automatically figure out the start of the PCI address.
> 
> *: This option requires support in the Linux kernel to actually deal with
two
> entries in the E820 map and P2M space filled with ~0.
Doesn''t this problem affect dom0 also? How is it to be fixed there?

Gianni


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Nov-17 11:43 UTC

head link

Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote:> On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:
> > Hey guys,
> > 
> > Attached is an RFC patch for making a PCI hole in the PV guests. This
allows
> > PV guests(*) with 4GB or more to now properly work with or without
> > PCI passthrough cards.
> > 
> > Previously the Linux kernel would not be able to allocate the PCI
region
> > underneath the 4GB region as that region was all System RAM. And you
would see
> > this:
> > 
> > [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
0000000000100000
> > [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address
range
> > [    0.000000] PCI: Unassigned devices with 32bit resource registers
may break!
> > [    0.000000] Allocating PCI resources starting at 100100000 (gap:
100100000:400000)
> > 
> > 
> > This patchset punches an PCI hole in the E820 region and as well fills
the P2M properly,
> > so that now you can see (*):
> > [    0.000000] Allocating PCI resources starting at a0000000 (gap:
a0000000:60000000)
> > 
> > It adds a new option to guest config file, which is
"pci_hole". The user can
> > specify the PFN number, such as ''0xc0000'' or in case
of using the xl, ''1'' which
> > will automatically figure out the start of the PCI address.
> > 
> > *: This option requires support in the Linux kernel to actually deal
with two
> > entries in the E820 map and P2M space filled with ~0.
> 
> Doesn''t this problem affect dom0 also? How is it to be fixed
there?
Domain 0 sees the actual host e820 map instead of the paravirtualised
one constructed by the tools for domU. So the BIOS effectively takes the
place of the tools side of this patch for dom0.

Ian.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gianni Tedesco

2010-Nov-17 13:37 UTC

head link

Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Wed, 2010-11-17 at 11:43 +0000, Ian Campbell wrote:> On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote:
> > On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:
> > > Hey guys,
> > > 
> > > Attached is an RFC patch for making a PCI hole in the PV guests.
This allows
> > > PV guests(*) with 4GB or more to now properly work with or
without
> > > PCI passthrough cards.
> > > 
> > > Previously the Linux kernel would not be able to allocate the PCI
region
> > > underneath the 4GB region as that region was all System RAM. And
you would see
> > > this:
> > > 
> > > [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
0000000000100000
> > > [    0.000000] PCI: Warning: Cannot find a gap in the 32bit
address range
> > > [    0.000000] PCI: Unassigned devices with 32bit resource
registers may break!
> > > [    0.000000] Allocating PCI resources starting at 100100000
(gap: 100100000:400000)
> > > 
> > > 
> > > This patchset punches an PCI hole in the E820 region and as well
fills the P2M properly,
> > > so that now you can see (*):
> > > [    0.000000] Allocating PCI resources starting at a0000000
(gap: a0000000:60000000)
> > > 
> > > It adds a new option to guest config file, which is
"pci_hole". The user can
> > > specify the PFN number, such as ''0xc0000'' or in
case of using the xl, ''1'' which
> > > will automatically figure out the start of the PCI address.
> > > 
> > > *: This option requires support in the Linux kernel to actually
deal with two
> > > entries in the E820 map and P2M space filled with ~0.
> > 
> > Doesn''t this problem affect dom0 also? How is it to be fixed
there?
> 
> Domain 0 sees the actual host e820 map instead of the paravirtualised
> one constructed by the tools for domU. So the BIOS effectively takes the
> place of the tools side of this patch for dom0.
Hmm, of course, I did think xen "sanitised" it somehow though.

I suppose I just assumed (hoped) this may fix my dom0 boot crash issues.

Gianni


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Nov-17 14:23 UTC

head link

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

On Tue, 2010-11-16 at 15:50 +0000, Konrad Rzeszutek Wilk
wrote:> disclaimer:
> This email got a bit lengthy - so make sure you got a cup of coffee when
you read this.
> 
> > On an unrelated note I think if we do go down the route of having the
> > guest kernel punch the holes itself and such we should do so iff
> > XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave
open
> 
> When would that actually happen? Is that return value returned when the
> hypervisor is not implementing it (what version was that implemented this)?
-ENOSYS implies an older Xen which does not have this interface.
Currently this causes the guest to create a fake one-entry e820 covering
0..nr_pages, which is what old guests which don''t know about the
hypercall do too.

If the hypercall returns an e820 with nr_entries == 1 then this implies
a newer Xen which implements the interface but where the tools have only
poked down a simple one-entry e820 covering 0..nr_pages or possibly
0..max_pages (this is all any existing hypervisor/tools will do).

If the hypervisor returns nr_entries >= 2 then you have some future Xen
which has tools which (think they) know what they are doing and so we
should trust the e820 given to us. Without allowing for this now we will
end up with XENFEAT_tools_provide_a_useful_guest_e820 which would be a
shame!

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Nov 2010 - [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

[Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

[Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)